Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ADEC 7320 - Econometrics

Homework #3 Assignment Requirements

Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of  the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high-end restaurant.

A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able  to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

VARIABLE NAME

DEFINITION

THEORETICAL EFFECT

INDEX

Identification Variable (do not use)

None

TARGET

Number of Cases Purchased

None

AcidIndex

Proprietary method of testing total acidity of wine by using a

weighted average

Alcohol                         Alcohol Content

Chlorides                     Chloride content of wine

CitricAcid                     Citric Acid Content

Density                         Density of Wine

FixedAcidity                 Fixed Acidity of Wine

FreeSulfurDioxide      Sulfur Dioxide content of wine

LabelAppeal

Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative         numbers suggest customer don't like the design.

Many consumers purchase based on the visual appeal of the wine label design .    Higher numbers suggest better sales.

ResidualSugar             Residual Sugar of wine

STARS                           Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor           A high number of stars suggests high sales

Sulphates                     Sulfate content of wine

TotalSulfurDioxide     Total Sulfur Dioxide of Wine

VolatileAcidity            Volatile Acid content of wine

pH

pH of wine

Deliverables:

•    A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below.   You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.

•    Assigned predictions (probabilities, classifications, cost) for the evaluation data set. Use 0.5 threshold.

•    Include your R statistical programming code too.  Ideally, create the pdf using R Markdown directly and          include both your code and output.  Alternatively, you can submit your R script separately, or put the code in the Appendix. Ensure that code works without errors from top to bottom.

Write Up:

1.    DATA EXPLORATION (50 Points)

Describe the size and the variables in the wine training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job.  Some                 suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

a.    Mean / Standard Deviation / Median / Min / Max - can usestargazerpackage to create a professional         looking summary statistics table, and please do mention an insight or two on a few variables like skewness

b.    Histograms or Bar Chart or Box Plot of the data

c.    Is the data correlated to the target variable (or to other variables)?

d.   Are any of the variables missing and need to be imputed / “fixed”?

2.    DATA PREPARATION (50 Points)

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Do not confusenull values with 0.  Here are     some possible transformations.

a.    Create flags to suggest if a variable was missing

b.    Fix missing values (maybe with a Mean or Median value).

c.    Transform data by putting it into buckets

d.    Mathematical transformations such as log or square root (or use Box-Cox).  Can try inverse transformation (1/x) or truncation (cap the maximum value possible)

e.    Combine variables (such as ratios or adding or multiplying) to create new variables

3.    BUILD MODELS (50 Points)

Using the training data set, build at least two different Poisson regression models, at least two negative binomial model, and at least two multiple linear regression model, using different variables (or the same variables with      different transformations).  Sometimes Poisson and negative binomial regression models give the same results. If that is the case, comment on that. Consider changing the input variables if that occurs so that you get different    models. Although not covered in class, you may also want to consider building zero-inflated Poisson and negative binomial regression models. Describe the technique(s) you used for variable selection. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Discuss the coefficients in the models, do they make sense? In this case, the key variables you should comment on is the number of stars (STARS) and the wine label appeal (LabelAppeal). You might comment on the coefficient and        magnitude of variables and how they are similar or different from model to model (stargazer package will be helpful).  For example, you might say “pH seems to have a major positive impact in my Poisson regression model, but a negative effect in my multiple linear regression model” .

Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

4.    SELECT MODELS (50 Points)

Decide on the criteria for selecting the best count regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.

A.   For the count regression model, will you use a metric such as AIC, average squared error, etc.? Be sure to

explain how you can make inferences from the model, and discuss other relevant model output. If you like the multiple linear regression model the best, please say why.

B.    Make predictions using the evaluation data set. You must select a count regression model for model deployment.

C.    Using the training data set, evaluate the performance of the count regression model. You will have a              multiclass classification matrix, but thereare a few different ways in Rsuch ascaretpackage,cvmspackage, cross tabulation with table commands, et cetra.

GRADING RUBRIC :  I will be looking for 

I.       DATA EXPLORATION: Performing EDA as well as summarizing some key patterns/insights that you found useful for model building.

II.       DATA PREPARATION:  Dealing with missing values and outliers, along with performing feature engineering and some variable transformations.

III.       BUILDING MODELS: Building and discussing at least three different multiple linear regression models .  Discussion of the model coefficient on STARS and LabelAppeal estimates (do you find the theoretical effect, statistical significance and economic magnitude), along with a few other inferences from the model that you find (EG you might find people that people on average do not like “sugary” wines or wines with excess sulphur/chlorides.

IV.       SELECTING MODELS:

Appropriate justification of your "best" count model, and how does it compare to your multivariate regression model .  Evaluating the performance of the count regression model from the confusion matrix.