Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Problem Set 2

DSME 6756:  Business Intelligence Techniques and Applications (Winter 2022)

Due at 9:30AM, Monday, December 19, 2022

Instructions

Please read the Jupyter Notebooks for Session 2.  Submit a Jupyter Notebook of your solutions with code on Blackboard.  Because this problem set is relatively long, the total achievable points are 8, which means you will earn 2 extra points. Please name your Jupyter Notebook as

 YourLastName_YourFirstName_PS2.ipynb (e.g., Zhang_Renyu_PS2.ipynb)

1.  Forecasting Auto Sales (4 points)

In this problem, we will try to predict monthly sales of an Auto Brand.

The file Auto.csv contains data for the problem.  Each observation is a month, from January 2010 to February 2014. For each month, we have the following variables:

• Month = the month of the year for the observation (1 = January, 2 = February, 3 = March, ...).

• Year = the year of the observation.

• AutoSales = the number of units of the Auto sold in the United States in the given month.

• Unemployment = the estimated unemployment percentage in the United States in the given month.

• Queries = a (normalized) approximation of the number of Google searches for“Auto”in the given month.

• CPI_energy = the monthly consumer price index (CPI) for energy for the given month.

• CPI_all = the consumer price index (CPI) for all products for the given month; this is a measure of the magnitude of the prices paid by consumer households for goods and services (e.g., food, clothing, electricity, etc.).

Load the data set into Python and split the data set into training and testing sets as follows: Place all observations for 2012 and earlier in the training set, and all observations for 2013 and 2014 into the testing set.

(a)  (0.5 point) Build a linear regression model to predict monthly Auto sales using Unemployment,

CPI_all, CPI_energy and Queries as the independent variables.  Use all of the training set data to do this. Please try to interpret your estimation results.

(b)  (0.5 point) We would now like to improve the model by incorporating seasonality. Seasonality refers to the fact that demand is often cyclical/periodic in time.  For example, demand for warm outerwear (like jackets and coats) is higher in fall/autumn and winter than in spring and summer. In our problem, since our data includes the month of the year in which the units were sold, it is feasible for us to incorporate monthly seasonality. From a modeling point of view, it may be reasonable that the month plays an effect in how many Auto units are sold. To incorporate the seasonal effect due to the month, build a new linear regression model that predicts monthly Auto sales using Month as well as Unemployment, CPI_all, CPI_energy and Queries. Do not modify the training and testing data frames before building the model. Based on the model estimation results, how do you evaluate the new model compared with the original one?

(c)  (1 point) In the new model, given two monthly periods that are otherwise identical in Un- employment, CPI_all, CPI_energy and Queries, what is the absolute difference in predicted Auto sales given that one period is in January and one is in March? Consider again the new model, given two monthly periods that are otherwise identical in Unemployment, CPI_all, CPI_energy and Queries, what is the absolute difference in predicted Auto sales given that one period is in January and one is in May? Is there anything you feel uncomfortable about this finding?

(d)  (1 point) Alternatively, we consider Month as a factor variable, instead of a numeric variable. Then, we can use the binary variable technique introduced in Session 2’s lecture to build a linear regression model. Why do you think we should use the factor variable instead of the numeric variable to represent month?

(e)  (0.5 point) Re-run the regression with the Month variable modeled as a factor variable. From the new regression results, what seasonality pattern have you observed?

(f)  (0.5 point) Another peculiar observation about the regression results (with month as a factor variable) is that the signs of the Queries variable and the CPI_energy variable.  Why their signs are counter-intuitive? Please try to give an explanation for such phenomenon and find a way to address this issue. You may need to remove some independent variables and re-build the linear regression model.

(g)  (0.5 point) Use out-of-sample test to evaluate all your models built to estimate the sales of Auto.  Report the out-of-sample R2  of each model and discuss which model you would like recommend to this Auto Brand for their sales forecasting.

2.  Election Forecasting (4 points)

In this problem, you will use polling data from the months leading up to a presidential election to predict the winner by logistic regression.  The file polling.csv contains the polling data for

United States Presidential Election in 2004, 2008 and 2012. The variables are listed as follows:

State: Name of state

• Year: Election year (2004, 2008, 2012)

• Rasmussen  and SurveyUSA:  Voters who said they were likely to vote Republican % - voters who said they were likely to vote Democrat %, from two major polling data resources, Rasmussen and SurveyUSA.

• DiffCount:  Number of polls that predicted a Republican winner in the state - number of polls that predicted a Democratic winner

• PropR: The proportion of all polls that predicted a Republican winner

• Republican: Whether a Republican actually won that state in that particular election year (the label, taking a 1/0-value)

Please solve the following questions.

(a)  (1 point) Read the data set polling.csv into Python. Then, split the data into a training set,

consisting of all the observations in 2004 and 2008, and a testing set consisting of observations in 2012. Based on the training data set, let the baseline model be that we predict the outcome

of 2012 election in each state will be the same as the outcome of 2008 election. Please evaluate

the false positive rate, the false negative rate, and the accuracy of the baseline model.

(b)  (1 point) A more credible baseline model would be to follow one of the polls and make a prediction. In our case, we will take the variable Rasmussen to make the prediction. Specifi- cally, if the variable Rasmussen is positive, then the new baseline model predicts Republican will win; If negative, it predicts Democrat will win. And if the variable equals zero, the model would randomly predict which party will win.

To determine the sign of the variable, you can use the function numpy.sign().

Compute the overall accuracy of the new baseline model. Take the cases in which the model does not know which to select as wrong predictions. Does the new baseline model outperform the original one in overall accuracy?

3.  Logistic Regression as Linear Classifier (2 points)

Assume that a fitted logistic regression model is

exp(0.5 0.3Xi1  2.7Xi2 + 3.9Xi3)   

P(Yi  = 1|Xi)

(a)  (1 point) The model predicts that Yi   =  1  if P(Yi   =  1|Xi)  ≥ 0.6  and predicts Yi   =  0  if P(Yi   =  1|Xi)  < 0.6.  Under what condition of Xi   =  (Xi1, Xi2, Xi3) will the model predict Yi  = 1? If we change the threshold t = 0.6 to t = 0.5, what will happen to the false-positive rate and the false-negative rate?

(b)  (1 point) If Xi  = (3, 1, 1), what is the prediction of Yi  with the model if the threshold is set at t = 0.5?