ADEC 7320 - Econometrics Homework #2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ADEC 7320 - Econometrics
Homework #2 Assignment Requirements
Overview
In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
VARIABLE NAME DEFINITION THEORETICAL EFFECT |
||
INDEX Identification Variable (do not use) None |
||
TARGET_FLAG Was Car in a crash? 1=YES 0=NO None |
||
TARGET_AMT If car was in a crash, what was the cost None |
||
AGE Age of Driver Very young people tend to be risky. Maybe very old people also. |
||
BLUEBOOK Value of Vehicle Unknown effect on probability of collision, but probably effect the payout if there is a crash |
||
CAR_AGE Vehicle Age Unknown effect on probability of collision, but probably effect the payout if there is a crash |
||
CAR_TYPE Type of Car Unknown effect on probability of collision, but probably effect the payout if there is a crash |
||
CAR_USE Vehicle Use Commercial vehicles are driven more, so might increase probability of collision |
||
CLM_FREQ # Claims (Past 5 Years) The more claims you filed in the past, the more you are likely to file in the future |
||
EDUCA TION Max Education Level Unknown effect, but in theory more educated people tend to drive more safely |
||
HOMEKIDS # Children at Home Unknown effect |
||
HOME_VAL Home Value In theory, home owners tend to drive more responsibly |
||
INCOME Income In theory, rich people tend to get into fewer crashes |
||
JOB Job Category In theory, white collar jobs tend to be safer |
||
KIDSDRIV # Driving Children When teenagers drive your car, you are more likely to get into crashes |
||
M S TA TUS Marital Status In theory, married people drive more safely |
||
MVR_PTS Motor Vehicle Record Points If you get lots of traffic tickets, you tend to get into more crashes |
||
OLDCLAIM Total Claims (Past 5 Years) If your total payout over the past five years was high, this suggests future payouts will be high |
||
P A RE NT1 |
Single Parent |
Unknown effect |
RED_CAR A Red Car Urban legend says that red cars (especially red sports cars) are riskier. Is that true? |
||
REVOKED License Revoked (Past 7 Years) If your license was revoked in the past 7 years, you probably are a riskier driver. |
||
SEX Gender Urban legend says that women have less crashes than men. Is that true? |
||
TIF Time in Force People who have been customers for a long time are usually safer. |
||
TRA V TIM E Distance to Work Long drives to work usually suggest greater risk |
||
URBANICITY Home/Work Area Unknown |
||
YOJ Years on Job People who stay at a job for a long time are usually safer |
Deliverables:
• A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
• Assigned predictions (probabilities, classifications, cost) for the evaluation data set. Use 0.5 threshold.
• Include your R statistical programming code too. Ideally, create the pdf using R Markdown directly and include both your code and output. Alternatively, you can submit your R script separately, or put the code in the Appendix. Ensure that code works without errors from top to bottom.
Write Up:
1. DATA EXPLORATION (50 Points)
Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
a. Mean / Standard Deviation / Median / Min / Max / Skewness - can usestargazerpackage to create a professional looking summary statistics, and please do mention an insight or two on a few variables
b. Histograms or Bar Chart or Box Plot of the data
c. Is the data correlated to the target variable (or to other variables)?
d. Are any of the variables missing and need to be imputed / “fixed”?
2. DATA PREPARATION (50 Points)
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.
a. Create flags to suggest if a variable was missing
b. Fix missing values (maybe with a Mean or Median value)
c. Convertingcategorical variablesintodummy variables, or intonumeric variables
d. Transform data by putting it into buckets
e. Mathematical transformations such as log or square root (or use Box-Cox). Can try inverse transformation (1/x) or truncation (cap the maximum value possible)
f. Combine variables (such as ratios or adding or multiplying) to create new variables
3. BUILD MODELS (50 Points)
Using the training data set, build at least two different multiple linear regression and at least three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
BONUS (20 Points): You can substitute one multiple regression model or/and one logistic regression model with implementing ridge orlassoregression . They are very closely related to OLS andsimply modify the objective function (minimizing sum of squared residuals) by adding penalty termsfor how many variables you keep in model, though the penalty is slightlymore extreme under lasso than ridge. You might also come acrosselastic net regressionmodels that combines the penalty function of the two extremes of ridge and lasso regressions – thereby having properties of both models, but you can ignore that class of models. Lasso may be more helpful ineliminating variablesand thus you can try to implement it for feature selection i.e. you can construct a model starting from the base variables lasso regression suggest you should keep/ignore the variables that have a coefficients of zero, and then potentially start eliminating more variables successively based on issues like multicollinearity, residual analysis, coefficients not having the expected signs and being statistically significant, et cetra…
Discuss the coefficients in the models, do they make sense? For example, if a person has a lot of traffic tickets, you
would reasonably expect that person to have more car crashes .But by how much? If the coefficient is negative (suggesting that the person is a safer driver), then that needs to be discussed. In particular for thelogit/probit models, I will be looking for the quantitative interpretation of the coefficient in terms of log odds, or odds ratio, or factor, or percent increase, like we discussed in the MeetUp. Thus, for your favorite/final model, give me the interpretation for both a positive and a negative coefficient. This will ensure you really understand the methodology, and feel free to share your personal notes/understanding of the approach too such aswhy we prefer logitover OLS/linear probability model. It will help you internalize the material better. You will need to get comfortable withproportions/precents/probability,odds (ratio)andswitching between probability and odds and log odds.
Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
4. SELECT MODELS (50 Points)
Decide on the criteria for selecting the best multiple linear regression model and the best binary logistic regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model .
A. For the multiple linear regression model, will you use a metric such as Adjusted R2 , RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on
a. mean squared error,
b. R2 ,
c. F-statistic, and
d. residual plots.
i. Residuals vs Fitted Values
ii. Normal Q-Q (Quantile-Quantile) – tells if residuals are normally distributed by comparing them with actual normal distribution
iii. Plot - Scale-Location / Spread-Location Plot – shows if residuals are spread equally among our predictions in order to check homoscedasticity
iv. Residuals vs Leverage Plot – shows influential data points that have a big effect on the linear model
B. For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based onconfusion matrixbased classification/misclassification metrics.
a. classification error rate, (REQUIRED)
b. accuracy, (REQUIRED)
c. sensitivity, (REQUIRED)
d. specificity, (REQUIRED)
e. precision, (OPTIONAL)
f. F1 score, (OPTIONAL)
g. AUC (OPTIONAL)
You are likely to cross paths withcaretpackage here - you can refer itsR cheat sheetor watch YouTube videos or find examples online at sayStack Exchangeor other places about the implementation.
Also, NO need forresidual analysis on logit model.
C. Make predictions using the evaluation data set.
GRADING RUBRIC: I will be looking for –
I. DATA EXPLORATION: Performing EDA as well as summarizing some key patterns/insights that you found useful for model building.
II. DATA PREPARATION: Dealing with missing values and outliers, along with performing feature engineering and some variable transformations.
III. BUILDING MODELS: Building and discussing at least three different multiple linear regression models .
Discussion of the model coefficient estimates (final logit model specifically) and regression model output.
IV. SELECTING MODELS:
Appropriate justification of your "best" model. Most importantly, your ability to construct the confusion matrix and regression performance quantification measures like classification error rate, accuracy, sensitivity, specificity. BONUS POINT (10) to the person with the highest accuracy. Your predictions on the evaluation set.
2023-04-22