ADEC 7320 - Econometrics Homework #1 Assignment Requirements
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ADEC 7320 - Econometrics
Homework #1 Assignment Requirements
Overview
In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
You can watch some YouTube videos for understanding the basic rules of the game (video1,2,3,4, 5), and also explore the field of Sports Analytics. Future of the Game: Baseball's Latest Statistical Revolution;Extreme Moneyball: An Independent Baseball Team’s Descent Into Sabermetric Thinking.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
VARIABLE NAME |
DEFINITION |
THEORETICAL EFFECT |
INDEX |
Identification Variable (do not use) |
None |
TARGET_WINS |
Number of wins |
|
TEAM_BATTING_H |
Base Hits by batters (1B,2B,3B,HR) |
Positive Impact on Wins |
TEAM_BATTING_2B |
Doubles by batters (2B) |
Positive Impact on Wins |
TEAM_BATTING_3B |
Triples by batters (3B) |
Positive Impact on Wins |
TEAM_BATTING_HR |
Homeruns by batters (4B) |
Positive Impact on Wins |
TEAM_BATTING_BB |
Walks by batters |
Positive Impact on Wins |
TEAM_BATTING_HBP |
Batters hit by pitch (get a free base) |
Positive Impact on Wins |
TEAM_BATTING_SO |
Strikeouts by batters |
Negative Impact on Wins |
TEAM_BASERUN_SB |
Stolen bases |
Positive Impact on Wins |
TEAM_BASERUN_CS |
Caught stealing |
Negative Impact on Wins |
TEAM_FIELDING_E |
Errors |
Negative Impact on Wins |
TEAM_FIELDING_DP |
Double Plays |
Positive Impact on Wins |
TEAM_PITCHING_BB |
Walks allowed |
Negative Impact on Wins |
TEAM_PITCHING_H |
Hits allowed |
Negative Impact on Wins |
TEAM_PITCHING_HR |
Homeruns allowed |
Negative Impact on Wins |
TEAM_PITCHING_SO |
Strikeouts by pitchers |
Positive Impact on Wins |
Deliverables:
• A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
• Assigned predictions (the number of wins for the team) for the evaluation data set.
• Include your R statistical programming code too. You can create the pdf from R Markdown directly and include both the code and output, or submit R script separately, or put the code in the Appendix. Ensure that code works without errors from top to bottom.
Write Up:
1. DATA EXPLORATION (50 Points)
Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
a. Mean / Standard Deviation / Median
b. Bar Chart or Box Plot of the data
c. Is the datacorrelatedto the target variable (or to other variables?)
d. Are any of the variables missing and need to be imputed “fixed”?
2. DATA PREPARATION (50 Points)
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.
a. Fix missing values (maybe with a Mean or Median value)
b. Create flags to suggest if a variable was missing
c. Transform data by putting it into buckets
d. Mathematical transformations such aslogor square root (or use Box-Cox). Can try inverse transformation (1/x) or truncation (cap the maximum value possible)
e. Combine variables (such as ratios or adding or multiplying) to create new variables
3. BUILD MODELS (50 Points)
Using the training data set, build at least three different multiplelinear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
4. SELECT MODELS (50 Points)
Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.
For selecting the best multiple linear regression model, will you use a metric such as Adjusted R2 , RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on
a. mean squared error,
b. R2 ,
c. F-statistic, and
d. residual plots (see thisvideofor details).
i. Residuals vs Fitted Values
ii. Normal Q-Q (Quantile-Quantile) – tells if residuals are normally distributed by comparing them with actual normal distribution
iii. Plot - Scale-Location / Spread-Location Plot – shows if residuals are spread equally among our predictions in order to check homoscedasticity
iv. Residuals vs Leverage Plot – shows influential data points that have a big effect on the linear model
Make predictions using the evaluation data set.
GRADING RUBRIC: I will be looking for –
I. DATA EXPLORATION: Performing EDA as well as summarizing some key patterns/insights that you found useful for model building.
II. DATA PREPARATION: Dealing with missing values and outliers, along with performing feature engineering and some variable transformations.
III. BUILDING MODELS: Building and discussing at least three different multiple linear regression models . Discussion of the model coefficient estimates and regression model output.
IV. SELECTING MODELS:
Appropriate justification of your "best" model. Your predictions on the evaluation set were provided .
2023-04-10