Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Sample Midterm Exam for BU425  Fall 2022

Part I: Multiple choice questions

1.   A dataset is stored in a variable called BigData. Which of the following lines of R code would replace the missing data in Age column in BigData with mean Age.

a.    BigData[!is.na(Pilgrim$Age),"Age"]) <- meanAge

b.    BigData$Age [is.na(Pilgrim$Age)] <- meanAge

c.    BigData[is.na(Pilgrim$Age),] <- meanAge

d.    BigData$Age [!is.na(Pilgrim$Age)] <- meanAge

e.    BigData$Age [is.na(Pilgrim$Age)== meanAge,]

2.   Which of the options most closely describes the following R statement:

X <- 1.2*min(data$sales)

data[data$sales<X,”sales”] <- X

a.    Sets all values in data to X

b.    Replace the column sales in data by X

c.    Sets the lowest values in data$sales to 120% of the lowest value

d.   Sets rows smaller than 120% of the minimum value of sales to X

3.   What are the steps for using a gradient descent algorithm?

i.   Calculate error between the actual value and the predicted value

ii.   Reiterate until you find the best weights of network

iii.   Pass an input through the network and get values from output layer

iv.  Set random weights and biases

v.  Go to each neurons which contributes to the error and change its respective values to reduce the error

a.    iv, iii, i, v, ii

b.    i, ii, iii, iv, v

c.    v, iv, iii, ii, i

d.    iii, ii, i, v, iv

e.    iii, iv, i, v, ii

4.   Which of the following is not true about logistic regression:

a.    Logistic regression gives class probability estimates

b.    Logistic regression takes a categorical target variable in training data

c.    Logistic regression does not have as much overfitting problem as a tree induction

d.    Logistic regression represents the log odds of class membership as a linear function of attributes

e.   All of the above are true

5.    It is important for a predictive model …

a.    to have high R2

b.    not to have any multicollinearity among variables

c.    not to have any omitted variables

d.   to have significant p-values for all attributes (e.g. p < 0.05)

e.    all of the above

6.   Which one of the choices below does not result in overfitting?

a.    Adding variables to the model

b.   Transforming variables

c.    Having a large training set

d.    More nodes in a classification tree

e.   All choices above will result in overfitting

7.   Which classifier would you expect to have highest accuracy on the training set?

a.    k-nearest neighbour (k is the size of training set)

b.    Logistic regression

c.    Support vector machine with non-linear kernel

d.   A classification tree with pure leaf nodes

e.   An artificial neural network with 4 hidden layers and 8 neurons in each layer

8.   The points on a model’s ROC curve 

a.    represent different rankings of examples

b.    represent the performance of different classification thresholds

c.    represent the cost of different classifications

d.    represent the number of true positives in a training vs. testing samples

9.   We use a validation set to 

a.    find the k in the k-nearest neighbour

b.    decide on a classification threshold

c.    select features to be included in the model

d.   find the best regularization rate

e.   All choices above

10. Below is an example of a neural network. You can see that the last neuron takes input from two neurons before it. The activation function for all the neurons is given by:

f(x) = {  ,  foT x < 0

1

 

a2

 

Suppose X1 is 0.5 and X2 is 1, what will be the output for the above neural network?

a.   -1

b.   0

c.   0.5

d.   1

e.   -0.5

11. How to select best hyperparameters in tree-based models?

a.    Measure performance over training data

b.    Measure performance over validation data

c.     Both of these

12. Suppose you have given the following scenario for training and validation error for a decision tree. Which of the following hyper parameter would you choose in such case?

Scenario

Depth

Training Error

Validation Error

1

2

100

110

2

4

90

104

4

8

45

104

5

10

30

150

a.   Scenario 1

b.   Scenario 2

c.   Scenario 3

d.   Scenario 4

13. You are training a decision tree that split a node based on highest information gain. Which of the following splits has the highest information gain?

 

a.     Outlook

b.    Humidity

c.    Windy

d.   Temperature

Part II: Short answer questions

Please answer in complete sentences in the space provided.

1.    Briefly explain what a classification threshold is? and how it may change the accuracy of a classifier (including different types of errors)?

2.   What is regularization? Explain the difference between L1 and L2 norm?

3.    Describe two advantages of a classification tree over a neural network? (1)

4.   A data mining routine has been applied to a transaction dataset and has classified 88 records as    fraudulent (30 correctly so) and 952 as non-fraudulent (920 correctly so). Construct the confusion matrix and calculate the overall error rate.

5.    Consider the following decile-wise lift chart for the transaction data model in question 4 (Note that Mean Response on the y-axis represents the lift value).

 

a.    Interpret the meaning of the first and second bars from the left.

b.    Explain how you might use this information in practice.

c.    Another analyst comments that you could improve the accuracy of the model     by classifying everything as nonfraudulent. If you do that, what is the error rate?

d.    Comment on the usefulness, in this situation, of these two metrics of model performance (error rate and lift).

6.   A company that manufactures riding mowers wants to identify the best sales prospects for an          intensive sales campaign. In particular, the manufacturer is interested in classifying households as   prospective owners or nonowners on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The marketing expert looked at a random sample of 24 households. Use all the data to fit a logistic         regression of ownership on the two predictors.

reg<-glm(Ownership ~ ., data = mowers.df, family = "binomial") summary(reg)

Estimate                   Std. Error           z value                Pr(>|z|)

(Intercept)   -25.9382                    11.4871              -2.258                 0.0239 *

Income        0.1109                       0.0543                2.042                  0.0412 *

Lot_Size      0.9638                       0.4628                2.083                  0.0373 *

Confusion Matrix:

Reference

Prediction Nonowner Owner

Nonowner

10

2

Owner

2

10

a.    Among nonowners, what is the percentage of households classified correctly?

b.   To increase the percentage of correctly classified nonowners, should the cutoff probability be increased or decreased?

c.    What are the odds that a household with a $60K income and a lot size of 20,000 ft2 is an owner?

d.    Using the cutoff probability =0.5 what is the minimum income that a household with 16,000 ft2 lot size should have before it is classified as an owner?

e.    What is the classification of a household with a $60K income and a lot size of 20,000 ft2? Use cutoff = 0.5.

Part III: Case questions

Please answer in complete sentences in the space provided.

Case Question 1

EeasyLoan have archived a large amount of data from 30000 previous clients which includes information on default payments, demographic factors, credit data, history of payment, and bill statements of credit

card clients. In addition to this credit data we also have the amount of credit given to each customer. EeasyLoan is now reassessing the way in which it evaluates customers for loans.

Appendix Q1-A provides definitions of each of the data elements

Appendix Q1-B provides a sample of the data (the first 20 customers)

Appendix Q1-C provides regression output for Part A

Appendix Q1-D provides the impact of classification threshold on model accuracy

Appendix Q1-E provides regression output for Part B

Part A - We have started investigation by performing a logistic regression with the following independent variables:

LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE

1.1)Interpret the output of the regression (Appendix Q1-C) and describe the characteristics of a        customer likely to default. Be sure to consider the statistical significance of the variables in your response.

1.2) Based on this regression output (Appendix Q1-C), which of the customers below has a lower probability of default in his/her payment? Explain.

 

Customer 1

Customer 2

LIMIT_BAL

20000

10000

SEX

1

2

EDUCATION

3

3

MARRIAGE

3

1

AGE

20

30

1.3)Initially we set the threshold for probability estimate of the logistic regression output (for          identifying customers who will default) to 0.5. We calculate the train error=0.2214 and the test error=0.2204. What can we say about the fit of our model (overfit/goodfit/underfit)?

1.4)We investigate the accuracy of the model when choosing different thresholds (Appendix Q1-D). What can we conclude about different types of errors (false positive rate/false negative rate) of the model for our initial threshold of 0.5.

1.5)To induce more spending EasyLoan is considering to offer credit incense to customers with good credit history (no default).

a.   How should we change the initial threshold of 0.5 to select customers to send the offer to?

b.   How this change will impact the original test error of 0.2204?

c.   Discuss the impact on false positive and false negative rates.

Part B Next we have performed a logistic regression including all the variables

1.6)How would you interpret the output of the regression (Appendix Q1-E) and the impact of missing variables in Part A on the probability of default.

1.7)Explain how would you select the variable to be included in your final logistic regression model.

Appendix Q1-A

Dataset Information

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

Content

There are 25 variables:

•     ID: ID of each client

•     LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit

•    SEX: Gender (1=male, 2=female)

•     EDUCATION: (4=graduate school, 3=university, 2=high school, 1=others)

•     MARRIAGE: Marital status (1=married, 2=single, 3=others)

•    AGE: Age in years

•     PAY_0: Repayment status in September, 2005 (-2&-1=pay duly, 1=payment delay for one month,   2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)

•     PAY_2: Repayment status in August, 2005 (scale same as above)

•     PAY_3: Repayment status in July, 2005 (scale same as above)

•     PAY_4: Repayment status in June, 2005 (scale same as above)

•     PAY_5: Repayment status in May, 2005 (scale same as above)

•     PAY_6: Repayment status in April, 2005 (scale same as above)

•     BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

•     BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

•     BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

•     BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

•     BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

•     BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

•     PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)

•     PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

•     PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

•     PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

•     PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

•     PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

•    default.payment.next.month: Default payment (1=yes, 0=no)

Appendix Q1-B

Some of the payment and billing columns are hidden for scalability

 

Appendix Q1-C

 

Appendix Q1-D

 

Appendix Q1-E