Sample Midterm Exam for BU425 – Fall 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Sample Midterm Exam for BU425 – Fall 2022
Part I: Multiple choice questions
1. A dataset is stored in a variable called BigData. Which of the following lines of R code would replace the missing data in Age column in BigData with mean Age.
a. BigData[!is.na(Pilgrim$Age),"Age"]) <- meanAge
b. BigData$Age [is.na(Pilgrim$Age)] <- meanAge
c. BigData[is.na(Pilgrim$Age),] <- meanAge
d. BigData$Age [!is.na(Pilgrim$Age)] <- meanAge
e. BigData$Age [is.na(Pilgrim$Age)== meanAge,]
2. Which of the options most closely describes the following R statement:
X <- 1.2*min(data$sales)
data[data$sales<X,”sales”] <- X
a. Sets all values in data to X
b. Replace the column sales in data by X
c. Sets the lowest values in data$sales to 120% of the lowest value
d. Sets rows smaller than 120% of the minimum value of sales to X
3. What are the steps for using a gradient descent algorithm?
i. Calculate error between the actual value and the predicted value
ii. Reiterate until you find the best weights of network
iii. Pass an input through the network and get values from output layer
iv. Set random weights and biases
v. Go to each neurons which contributes to the error and change its respective values to reduce the error
a. iv, iii, i, v, ii
b. i, ii, iii, iv, v
c. v, iv, iii, ii, i
d. iii, ii, i, v, iv
e. iii, iv, i, v, ii
4. Which of the following is not true about logistic regression:
a. Logistic regression gives class probability estimates
b. Logistic regression takes a categorical target variable in training data
c. Logistic regression does not have as much overfitting problem as a tree induction
d. Logistic regression represents the log odds of class membership as a linear function of attributes
e. All of the above are true
5. It is important for a predictive model …
a. to have high R2
b. not to have any multicollinearity among variables
c. not to have any omitted variables
d. to have significant p-values for all attributes (e.g. p < 0.05)
e. all of the above
6. Which one of the choices below does not result in overfitting?
a. Adding variables to the model
b. Transforming variables
c. Having a large training set
d. More nodes in a classification tree
e. All choices above will result in overfitting
7. Which classifier would you expect to have highest accuracy on the training set?
a. k-nearest neighbour (k is the size of training set)
b. Logistic regression
c. Support vector machine with non-linear kernel
d. A classification tree with pure leaf nodes
e. An artificial neural network with 4 hidden layers and 8 neurons in each layer
8. The points on a model’s ROC curve …
a. represent different rankings of examples
b. represent the performance of different classification thresholds
c. represent the cost of different classifications
d. represent the number of true positives in a training vs. testing samples
9. We use a validation set to …
a. find the k in the k-nearest neighbour
b. decide on a classification threshold
c. select features to be included in the model
d. find the best regularization rate
e. All choices above
10. Below is an example of a neural network. You can see that the last neuron takes input from two neurons before it. The activation function for all the neurons is given by:
f(x) = { , foT x < 0
1 |
|
a2 |
|
Suppose X1 is 0.5 and X2 is 1, what will be the output for the above neural network?
a. -1
b. 0
c. 0.5
d. 1
e. -0.5
11. How to select best hyperparameters in tree-based models?
a. Measure performance over training data
b. Measure performance over validation data
c. Both of these
12. Suppose you have given the following scenario for training and validation error for a decision tree. Which of the following hyper parameter would you choose in such case?
Scenario
|
Depth
|
Training Error
|
Validation Error
|
1
|
2
|
100
|
110
|
2
|
4
|
90
|
104
|
4
|
8
|
45
|
104
|
5
|
10
|
30
|
150
|
a. Scenario 1
b. Scenario 2
c. Scenario 3
d. Scenario 4
13. You are training a decision tree that split a node based on highest information gain. Which of the following splits has the highest information gain?
a. Outlook
b. Humidity
c. Windy
d. Temperature
Part II: Short answer questions
Please answer in complete sentences in the space provided.
1. Briefly explain what a classification threshold is? and how it may change the accuracy of a classifier (including different types of errors)?
2. What is regularization? Explain the difference between L1 and L2 norm?
3. Describe two advantages of a classification tree over a neural network? (1)
4. A data mining routine has been applied to a transaction dataset and has classified 88 records as fraudulent (30 correctly so) and 952 as non-fraudulent (920 correctly so). Construct the confusion matrix and calculate the overall error rate.
5. Consider the following decile-wise lift chart for the transaction data model in question 4 (Note that Mean Response on the y-axis represents the lift value).
a. Interpret the meaning of the first and second bars from the left.
b. Explain how you might use this information in practice.
c. Another analyst comments that you could improve the accuracy of the model by classifying everything as nonfraudulent. If you do that, what is the error rate?
d. Comment on the usefulness, in this situation, of these two metrics of model performance (error rate and lift).
6. A company that manufactures riding mowers wants to identify the best sales prospects for an intensive sales campaign. In particular, the manufacturer is interested in classifying households as prospective owners or nonowners on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The marketing expert looked at a random sample of 24 households. Use all the data to fit a logistic regression of ownership on the two predictors.
reg<-glm(Ownership ~ ., data = mowers.df, family = "binomial") summary(reg)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -25.9382 11.4871 -2.258 0.0239 *
Income 0.1109 0.0543 2.042 0.0412 *
Lot_Size 0.9638 0.4628 2.083 0.0373 *
Confusion Matrix:
Reference
Prediction Nonowner Owner
Nonowner
10
2
Owner
2
10
a. Among nonowners, what is the percentage of households classified correctly?
b. To increase the percentage of correctly classified nonowners, should the cutoff probability be increased or decreased?
c. What are the odds that a household with a $60K income and a lot size of 20,000 ft2 is an owner?
d. Using the cutoff probability =0.5 what is the minimum income that a household with 16,000 ft2 lot size should have before it is classified as an owner?
e. What is the classification of a household with a $60K income and a lot size of 20,000 ft2? Use cutoff = 0.5.
Part III: Case questions
Please answer in complete sentences in the space provided.
Case Question 1
EeasyLoan have archived a large amount of data from 30000 previous clients which includes information on default payments, demographic factors, credit data, history of payment, and bill statements of credit
card clients. In addition to this credit data we also have the amount of credit given to each customer. EeasyLoan is now reassessing the way in which it evaluates customers for loans.
Appendix Q1-A provides definitions of each of the data elements
Appendix Q1-B provides a sample of the data (the first 20 customers)
Appendix Q1-C provides regression output for Part A
Appendix Q1-D provides the impact of classification threshold on model accuracy
Appendix Q1-E provides regression output for Part B
Part A - We have started investigation by performing a logistic regression with the following independent variables:
LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE
1.1)Interpret the output of the regression (Appendix Q1-C) and describe the characteristics of a customer likely to default. Be sure to consider the statistical significance of the variables in your response.
1.2) Based on this regression output (Appendix Q1-C), which of the customers below has a lower probability of default in his/her payment? Explain.
|
Customer 1 |
Customer 2 |
LIMIT_BAL |
20000 |
10000 |
SEX |
1 |
2 |
EDUCATION |
3 |
3 |
MARRIAGE |
3 |
1 |
AGE |
20 |
30 |
1.3)Initially we set the threshold for probability estimate of the logistic regression output (for identifying customers who will default) to 0.5. We calculate the train error=0.2214 and the test error=0.2204. What can we say about the fit of our model (overfit/goodfit/underfit)?
1.4)We investigate the accuracy of the model when choosing different thresholds (Appendix Q1-D). What can we conclude about different types of errors (false positive rate/false negative rate) of the model for our initial threshold of 0.5.
1.5)To induce more spending EasyLoan is considering to offer credit incense to customers with good credit history (no default).
a. How should we change the initial threshold of 0.5 to select customers to send the offer to?
b. How this change will impact the original test error of 0.2204?
c. Discuss the impact on false positive and false negative rates.
Part B – Next we have performed a logistic regression including all the variables
1.6)How would you interpret the output of the regression (Appendix Q1-E) and the impact of missing variables in Part A on the probability of default.
1.7)Explain how would you select the variable to be included in your final logistic regression model.
Appendix Q1-A
Dataset Information
This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.
Content
There are 25 variables:
• ID: ID of each client
• LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
• SEX: Gender (1=male, 2=female)
• EDUCATION: (4=graduate school, 3=university, 2=high school, 1=others)
• MARRIAGE: Marital status (1=married, 2=single, 3=others)
• AGE: Age in years
• PAY_0: Repayment status in September, 2005 (-2&-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
• PAY_2: Repayment status in August, 2005 (scale same as above)
• PAY_3: Repayment status in July, 2005 (scale same as above)
• PAY_4: Repayment status in June, 2005 (scale same as above)
• PAY_5: Repayment status in May, 2005 (scale same as above)
• PAY_6: Repayment status in April, 2005 (scale same as above)
• BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
• BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
• BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
• BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
• BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
• BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
• PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
• PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
• PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
• PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
• PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
• PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
• default.payment.next.month: Default payment (1=yes, 0=no)
Appendix Q1-B
Some of the payment and billing columns are hidden for scalability
Appendix Q1-C
Appendix Q1-D
Appendix Q1-E
2022-11-14