Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ADEC 7320 - Econometrics

Homework #2 Assignment Requirements

Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response   variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person   was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash     their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to          predict the probability that a person will crash their car and also the amount of money it will cost if the person does  crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

VARIABLE NAME     DEFINITION                                                     THEORETICAL EFFECT

INDEX                         Identification Variable (do not use)             None

TARGET_FLAG          Was Car in a crash? 1=YES 0=NO                 None

TARGET_AMT           If car was in a crash, what was the cost      None

AGE                            Age of Driver                                                   Very young people tend to be risky. Maybe very old people also.

BLUEBOOK                Value of Vehicle                                              Unknown effect on probability of collision, but probably effect the payout if there is a crash

CAR_AGE                   Vehicle Age                                                      Unknown effect on probability of collision, but probably effect the payout if there is a crash

CAR_TYPE                  Type of Car                                                       Unknown effect on probability of collision, but probably effect the payout if there is a crash

CAR_USE                   Vehicle Use                                                      Commercial vehicles are driven more, so might increase probability of collision

CLM_FREQ                # Claims (Past 5 Years)                                   The more claims you filed in the past, the more you are likely to file in the future

EDUCA TION              Max Education Level                                      Unknown effect, but in theory more educated people tend to drive more safely

HOMEKIDS                # Children at Home                                        Unknown effect

HOME_VAL               Home Value                                                     In theory, home owners tend to drive more responsibly

INCOME                     Income                                                              In theory, rich people tend to get into fewer crashes

JOB                             Job Category                                                    In theory, white collar jobs tend to be safer

KIDSDRIV                   # Driving Children                                           When teenagers drive your car, you are more likely to get into crashes

M S TA TUS                Marital Status                                                  In theory, married people drive more safely

MVR_PTS                   Motor Vehicle Record Points                        If you get lots of traffic tickets, you tend to get into more crashes

OLDCLAIM                 Total Claims (Past 5 Years)                            If your total payout over the past five years was high, this suggests future payouts will be high

P A RE NT1

Single Parent

Unknown effect

RED_CAR                   A Red Car                                                          Urban legend says that red cars (especially red sports cars) are riskier. Is that true?

REVOKED                   License Revoked (Past 7 Years)                    If your license was revoked in the past 7 years, you probably are a riskier driver.

SEX                             Gender                                                              Urban legend says that women have less crashes than men. Is that true?

TIF                               Time in Force                                                   People who have been customers for a long time are usually safer.

TRA V TIM E              Distance to Work                                            Long drives to work usually suggest greater risk

URBANICITY              Home/Work Area                                           Unknown

YOJ                             Years on Job                                                     People who stay at a job for a long time are usually safer

Deliverables:

•    A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below.   You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.

•    Assigned predictions (probabilities, classifications, cost) for the evaluation data set. Use 0.5 threshold.

•    Include your R statistical programming code too.  Ideally, create the pdf using R Markdown directly and          include both your code and output.  Alternatively, you can submit your R script separately, or put the code in the Appendix. Ensure that code works without errors from top to bottom.

Write Up:

1.    DATA EXPLORATION (50 Points)

Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a         manager to lose interest while too little detail will make the manager consider that you aren’t doing your job.  Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

a.    Mean / Standard Deviation / Median / Min / Max / Skewness - can usestargazerpackage to create a professional looking summary statistics, and please do mention an insight or two on a few variables

b.    Histograms or Bar Chart or Box Plot of the data

c.    Is the data correlated to the target variable (or to other variables)?

d.   Are any of the variables missing and need to be imputed / “fixed”?

2.    DATA PREPARATION (50 Points)

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.

a.    Create flags to suggest if a variable was missing

b.    Fix missing values (maybe with a Mean or Median value)

c.    Convertingcategorical variablesintodummy variables, or intonumeric variables

d.   Transform data by putting it into buckets

e.    Mathematical transformations such as log or square root (or use Box-Cox).  Can try inverse transformation (1/x) or truncation (cap the maximum value possible)

f.     Combine variables (such as ratios or adding or multiplying) to create new variables

3.    BUILD MODELS (50 Points)

Using the training data set, build at least two different multiple linear regression and at least three different binary logistic regression models, using different variables (or the same variables with different transformations).  You may select the variables manually, use an approach such as Forward or Stepwise, or use a combination of techniques.      Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into  the model, indicate why this was done.

BONUS (20 Points):  You can substitute one multiple regression model or/and one logistic regression model with       implementing ridge orlassoregression .  They are very closely related to OLS andsimply modify the objective             function (minimizing sum of squared residuals) by adding penalty termsfor how many variables you keep in model,  though the penalty is slightlymore extreme under lasso than ridge.  You might also come acrosselastic net                 regressionmodels that combines the penalty function of the two extremes of ridge and lasso regressions  thereby  having properties of both models, but you can ignore that class of models.  Lasso may be more helpful ineliminating variablesand thus you can try to implement it for feature selection i.e. you can construct a model starting from the  base variables lasso regression suggest you should keep/ignore the variables that have a coefficients of zero, and      then potentially start eliminating more variables successively based on issues like multicollinearity, residual analysis, coefficients not having the expected signs and being statistically significant, et cetra…

Discuss the coefficients in the models, do they make sense? For example, if a person has a lot of traffic tickets, you

would reasonably expect that person to have more car crashes .But by how much?  If the coefficient is negative     (suggesting that the person is a safer driver), then that needs to be discussed.  In particular for thelogit/probit models, I will be looking for the quantitative interpretation of the coefficient in terms of log odds, or odds ratio, or factor, or percent increase, like we discussed in the MeetUp.  Thus, for your favorite/final model, give me the interpretation for both a positive and a negative coefficient.  This will ensure you really understand the methodology, and feel free to share your personal notes/understanding of the approach too such aswhy we prefer  logitover OLS/linear probability model.  It will help you internalize the material better.  You will need to get  comfortable withproportions/precents/probability,odds (ratio)andswitching between probability and odds and log odds.

Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

4.    SELECT MODELS (50 Points)

Decide on the criteria for selecting the best multiple linear regression model and the best binary logistic regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious?    Discuss why you selected your model .

A.   For the multiple linear regression model, will you use a metric such as Adjusted R2 , RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model     based on

a.    mean squared error,

b.    R2 ,

c.    F-statistic, and

d.    residual plots.

i.       Residuals vs Fitted Values

ii.       Normal Q-Q (Quantile-Quantile)  tells if residuals are normally distributed by comparing them with actual normal distribution

iii.       Plot - Scale-Location / Spread-Location Plot  shows if residuals are spread equally among our predictions in order to check homoscedasticity

iv.       Residuals vs Leverage Plot  shows influential data points that have a big effect on the linear model

B.    For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.?   Using the training data set, evaluate the binary logistic regression model based onconfusion matrixbased classification/misclassification metrics.

a.    classification error rate, (REQUIRED)

b.    accuracy, (REQUIRED)

c.    sensitivity, (REQUIRED)

d.   specificity, (REQUIRED)

e.    precision, (OPTIONAL)

f.     F1 score, (OPTIONAL)

g.    AUC (OPTIONAL)


You are likely to cross paths withcaretpackage here - you can refer itsR cheat sheetor watch YouTube videos or find examples online at sayStack Exchangeor other places about the implementation.


Also, NO need forresidual analysis on logit model.

C.    Make predictions using the evaluation data set.

GRADING RUBRIC:  I will be looking for 

I.       DATA EXPLORATION: Performing EDA as well as summarizing some key patterns/insights that you found useful for model building.

II.       DATA PREPARATION:  Dealing with missing values and outliers, along with performing feature engineering and some variable transformations.

III.       BUILDING MODELS: Building and discussing at least three different multiple linear regression models .

Discussion of the model coefficient estimates (final logit model specifically) and regression model output.

IV.       SELECTING MODELS:

Appropriate justification of your "best" model.    Most importantly, your ability to construct the confusion  matrix and regression performance quantification measures like classification error rate, accuracy, sensitivity, specificity.  BONUS POINT (10) to the person with the highest accuracy.  Your predictions on the evaluation set.