Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

BUSI 650

Final Exam

a. Name your files as: course name_your name_family name. Example: BUSI 650_Adam_Smith

b. Please save your Python notebook files as “.ipynb”.

Please submit your “.ipynb”.  and excel files as well as your detailed solutions and explanations on Canvas->Modules->Final Exam.

1. (40 Points) Linear Regression

Bingo Software is a firm that sells games and educational software. It recently revised its collection of 2000 items in a new catalog, which are available to customers via cloud. The file Bingo.csv contains information on 2000 purchases. Based on these data, Bingo wants to devise a model for predicting the spending amount that a purchasing customer will yield.

The following table describes the variables to be used in the problem (the Excel file contains additional variables).

Description of Variables for Bingo Software

a. Explore the relationship between spending and each of the two continuous predictors by creating two scatterplots (Spending vs. Freq, and Spending vs. last_update_days_ago). Does there seem to be a linear relationship?

b. To fit a predictive model for Spending:

i. Partition the 2000 records into training and validation sets.

ii. Run a multiple linear regression model for Spending vs. all six predictors. Write the estimated predictive equation.

iii. Based on this model, what type of purchaser is most likely to spend a large amount of money?

v. Show how the prediction and the prediction error are computed for the first purchase in the validation set.

vi. Evaluate the predictive accuracy of the model by examining its performance on the validation set.

vii. Create a histogram of the model residuals. Do they appear to follow a normal distribution?

2. (25 points) Bayes Rule

Predicting Delayed Flights. The file FlightDelays.csv contains information on all commercial flights departing the Washington, DC area and arriving at New York. For each flight, there is information on the departure and arrival airports, the distance of the route, the scheduled time and date of the flight, and so on. The predicted variable is whether or not a flight is delayed. A delay is defined as an arrival that is at least 15 minutes later than scheduled.

Here, we focus on two predictors: weather (1 represents inclement weather and 0 otherwise) and Day_Week (day of the week, with 1 representing Sunday, 2 representing Monday, 3 representing Tuesday,…), and the outcome “flight status”.

Transform variable day of week (DAY_WEEK) into a categorical variable.

a. Create a pivot table for the training data with Day_Week as a column variable, weather as a row variable, and flight status as a secondary row variable. The values inside the table should convey the count.

b. Consider the task of classifying a flights during inclement weather condition on Friday. Looking at the pivot table, what is the probability that this flight has delay?

C. Compute the following quantities [P(A ∣ B) means “the probability of A given B”]:

P(no inclement weather ∣ delayed)

P(flight on Tuesday ∣ ontime)

P(ontime ∣ flight on Monday, no inclement weather)

3. (35 points) Discriminant Analysis

A consultant is studying the roles played by experience and training in a system administrator’s ability to complete a set of tasks on time. Data are collected on the performance of 75 randomly selected administrators. They are stored in the file SystemAdministrators.csv.

Using these data, the consultant performs a discriminant analysis. The variable “Experience”

measures months of full-time system administrator experience, while “Training” measures number of relevant training credits. The dependent variable “Completed” is either Yes or No, according to whether or not the administrator completed the tasks.

a. Create a scatter plot of Experience vs. Training using color or symbol to differentiate administrators who completed the tasks from those who did not complete them. See if you

can identify a line that separates the two classes with minimum misclassification.

b. Run a discriminant analysis with both predictors using the entire dataset as training data. Among those who completed the tasks, what is the percentage of administrators who are classified incorrectly as failing to complete the tasks?

c. Compute the two classification scores for an administrator with 4 months of experience and six credits of training. Based on these, how would you classify this administrator?

d. How much experience must be accumulated by an administrator with four training credits before his or her estimated probability of completing the tasks exceeds 0.5?