Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMM1190: DATA, INSIGHTS, AND DECISIONS

PRACTICE QUESTIONS

QUESTION 1 (30 marks)

PART A 18 MARKS

Concerning the charts below, answer all of the following questions.

a)  The bar chart below presents the changes in the variable “Attrition” based on the two variables “Yearsatcompany” and “Gender”. Summarise two facts based on    your interpretation of this bar chart.

Note: “Attrition” is a categorical variable (yes/no), denoting if an employee leaves the  company  or  not;  “Yearsatcompany”  is  a  numerical  variable,  denoting  the number  of  years  an  employee   has  worked  at  the  company;  “Gender”   is  a categorical variable (Female/Male).

[max 120 words] (8 marks)

b)  The scatter plot below presents the correlation between the horsepower of a car and  its  capacity  to  travel  on  the  highway.  Based  on  the  chart,  formulate  a descriptive problem, a predictive problem, and a prescriptive problem that can be addressed using the scatter plot below. Note: “Horsepower” denotes the power that a car engine produces; “MPG(Highway)” denotes how far a car can travel for every gallon of fuel it uses on the highway.

[max 180 words] (10 marks)


PART B 12 MARKS

Which features of this graph are redundant or irrelevant?

[max 200 words]

QUESTION 2 (35 marks)

PART A 16 MARKS

You are examining the relationship between the concentration of substance A from measurements of peak area and the percentage of colour B. You have observed the following data  points:  (x_i,  y_i)  where  i=1,2,…,n,    and x_i  and y_i  represent  the percentage of colour B and the concentration of substance A, respectively. Here, 0 ≤ x_i ≤1 and x_i=0.5 means that the percentage of colour B is 50%.

Concerning the information above, answer all of the following questions.

a)  Suppose you want to fit a simple linear regression model to the dataset by treating the percentage of colour B as the predictor and the concentration of substance A from measurements of peak area as the response. Write down the mathematical equation of a simple linear regression model. (3 marks)

b)  The following table presents some of the statistics from the above fit (the  linear regression model):

Coefficient Estimate

Standard

Error

t-statistics

p-value

Intercept

0.0729

0.0279

2.6129

0.017

Percentage of Color B

10.77

0.27

39.8889

0.000

If you want to test whether there is a relationship between the predictor (Percentage of Color B) and the response (Concentration of Substance A from Measurements of Peak Area), what is the null hypothesis, and what is your conclusion based on the output in the above table? Justify your conclusion.

[max 60 words] (4 marks)

c)  Based on the output in the table in Q2b), provide an interpretation of the coefficient associated with the Percentage of Colour B.

[max 50 words] (3 marks)

d)  To assess the quality of the fit of the linear regression model, you want to examine whether the  residuals  for the  data  on  the  concentration  of  substance A from measurements of peak area follow a  normal distribution.  Name  one graphical method from this course that you can use to perform this task and briefly describe in words how you can visually check this.

Note: residuals refer to the differences between the observed values and the fitted values using the above linear regression model.

[max 60 words] (3 marks)

e)  There are two samples: the percentage of colour B in sample  1 is 10% and the percentage of colour B in sample 2 is 50%. Compare the average concentration of substance A from measurements of peak area in the two samples by calculating the ratio of them based on the fitted linear regression model. (3 marks)

PART B 19 marks

A  medical  experiment  has  been  carried  out  to   build  a   model  for  predicting  a deformation D in young patients after a certain type of medical surgery. The dataset includes  the  following   information  for  each   patient  under  study:   D   deformation (deformation or normal), Age (in months), Number (the number of parts involved), and Position (the position of the topmost part operated on).

Concerning the information above, answer all of the following questions.

a)  Suppose you want  to fit a logistic regression model with three predictors: Age, Number,  and  Position.  Write  down  the  mathematical  equation  of  the  logistic regression model. (3 marks)

b)  You are given the following output after fitting the logistic regression model in Q3(a).

Coefficient estimate

Intercept

-2.04

Age

0.01

Number

0.41

Position

-0.21

Explain whether you predict that a young patient with the characteristics (i.e., Age =1, Number =2, Position = 10) will have deformation D.

[max 80 words] (4 marks)

c)  Suppose you also fitted the classification tree below:

Based on the above classification tree, would you predict a young patient with the following  characteristics  (i.e.,  Age  =1,   Number  =2,   Position  =   10)  will  have deformation D? Justify your answer.

[max 60 words] (3 marks)

d)  Using the table in Q3b) and the graph in Q3c), comment on the consistency of results from the logistic regression and the classification tree from two aspects.

[max 80 words] (4 marks)

e)  The tables below show the confusion matrices for the classification tree and the logistic regression. Compare the two classification approaches by making full use of  the  confusion  matrices  and  explain  which  approach  you  would  prefer  for predicting D deformation in young patients.

Classification tree

Predicted Normal

Predicted Deformation

Actual Normal

53

11

Actual Deformation

2

15

Logistic regression

Predicted Normal

Predicted Deformation

Actual Normal

52

12

Actual Deformation

10

7

[max 100 words] (5 marks)

Question 3 35 MARKS

A multinational hotel and resort group has recently opened three new holiday resorts in Australia. They are Resort A, Resort B, and Resort C. The business plan was to position the resorts as “upmarket” complexes with a range of facilities, including spas, boat and bike hire, beauty and massage services, restaurants, and a small range of boutique shops. Customers would be charged separately for these extra services if they used them, and the plan was to generate considerable revenue over and above the accommodation charges. The Head Office of the company has decided to explore different ways to  promote this type of extra spending but  has  left  it to the  resort managers as to how they gather relevant evidence. Ultimately, Head Office will decide on how promotion is best achieved based on the evidence from each of the resorts.

Two modes of promotion are being considered. Call these treatments BOOK and TV:

•    BOOK: customers are provided with a glossy booklet explaining the available facilities when they first check in to a resort;

•   TV: whenever the television in their rooms is turned on, and before customers could  watch  anything  else,  advertisements  would  run  providing  the  same information contained in the booklet.

Regarding the information above, answer all of the following questions.

a)  In Resort A, the manager decides to use two one-month periods to gather evidence. In the first one-month period, all customers are allocated to the BOOK treatment. In the second one-month period, all customers are allocated to the TV treatment. The key outcome, denoted by Expend, is defined as the total amount of dollars spent per booking per day over and above accommodation. At the end of the second period, the manager calculates the difference in means of Expend for customers in the two treatments. Explain why this may not be a good approach to estimate the difference in the causal effects of the two treatments on Expend by including at least two criticisms.

[max 200 words] (8 marks)

b)  In Resort B, the manager decides that all customers, over one month, are allocated to either the BOOK or TV treatment depending on whether their booking number is odd or even. Define Bkdi  to be a dummy variable equal to 1 if the ith customer was allocated to the BOOK treatment and equals zero if they were in the TV treatment. Consider  the   regression  Expendi  = β0  + β1 Bkdi  + ui. Using  data  on  the   200 customers who stayed at Resort B over the one month of the experiment produces OLS estimates for this model given below.

How do you interpret the magnitudes of the estimates of β0  and  β1 ? Is the estimate of β1   statistically  significant?  Provide  an overall  interpretation of the difference between the treatments, BOOK, and TV, that would be suitable for reporting to senior management.

Note: The 97.5th  percentile of a standard normal distribution is 1.96.

equalExpen(̂)di = 2 22(7)kdi

n = 200, R2  = .048, standard errors in ( . )

[max 200 words]

c)  In Resort C, the manager uses the same approach as in Resort B, except that the allocation to either the BOOK or the TV treatment was decided by check-in staff when a customer arrived. The table below provides the sample means, separately for the two treatment sub-samples, for the 398 customers who stayed at Resort C over the one month of their experiment.

i.   Use these sample means to estimate β1 .

[max 50 words] (2 marks)

ii.   Based on the sample means for the customer characteristics, discuss whether you think randomization into the two treatments was successful or not.

[max 80 words] (3 marks)

iii.  Use this discussion to provide an argument for why the difference in the causal effects of the two treatments on Expend is possibly biased using the Resort C approach.

[max 80 words] (3 marks)

Table: Sample means for key variables divided into the two treatment groups

Variables and definitions

BOOK

TV

BKd

= 1 if assigned to BOOK, =0 if assigned to TV

1.00

0.00

Income

= 1 if family income > $100,000, = 0 otherwise

0.68

0.79

People

Number of people in the booking party

3.53

4.42

Length

Length of stay in days

4.68

4.81

Age

Age in years of the person making the booking

47.2

47.8

Expand

Expenditure ($) per booking per day over & above

accommodation

231.8

273.2

Observations

178

220

d)  In reviewing the results for Resort B, Head Office is surprised that no other controls were included in the regression reported in Q5b). In particular, they note that the correlation of 0.61 between the number of people in the booking party and Expend is positive as expected and quite strong. Explain to Head Office why you think this is not a problem in interpreting the estimates of 1 obtained  in Q5b) as causal.

[max 100 words] (4 marks)

e)  After  reviewing  the  evidence,   Head  Office  decided  to   implement  TV  for  all customers and advertise the facilities through the television at all three resorts. After implementing this change, the Resort B manager monitored their Facebook reviews and noticed that the negative reactions to the television advertising outweighed the positive  posts.  What  do  you  think  is  the  most  likely  explanation  for  why  the experimental evidence obtained for Resort B is different from the reactions on Facebook?

[max 100 words] (5 marks)