Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

2022/2023

Machine Learning and Forecasting (BUSN9108)

Question 1.

Download dataset Question1.sav from module BUSN9108 on moodle.kent.ac.uk. Answer the following three questions.

(1). Use a boxplot to compare the Miles_Per_Gallon of the three groups Origin=1, 2, 3. Provide two of your findings. [2 marks]

(2). Plot the Q-Q plot of Acceleration to check if it follows the normal distribution. What is your conclusion?  [3 marks]

(3). Use both the Kolmogorov-Smirnov(K-S) test and the Shapiro-Wilk (S-W) test to test whether the variable Acceleration is normally distributed. What is your conclusion and why? [4 marks]

Question 2.

A company surveyed how frequently consumers of various age groups use two different payment methods when making purchases. Sample data for 3000 customers shows the results by four age groups.

 

Age group

Payment

18—24

25--34

35--44

45 and over

Method 1

215

265

276

354

Method 2

206

364

411

909

(1) Test for the independence between method of payment and age group. Formulate the problem statistically by posing it as a hypothesis test. What is the p-value? Using α =0.05, what is your conclusion? [4 marks]

(2) Assume Method 1 is better than Method 2. Find the gamma association measure. [3 marks]

Question 3:

Download the dataset Question3.sav from the module BUSN9108 on moodle.kent.ac.uk. Build a multiple linear regression model using SPSS and the following settings:

· independent variables: , , ,, , ,, ; and dependent variable , and

· modelling method “Backward”.

Answer the following questions.

(1). What is the estimated multiple linear regression equation? [2 Marks]

(2). Are there any outliers? If yes, which cases are outliers? Why? [3 Marks]

(3). Check the residual plot (ZPRED as the X axis and ZRESID as the Y axis). What is your conclusion from it? [3 Marks]

(4). Is there a problem of multi-collinearity? If yes, which variables are they? [3 Marks]

(5). If there is anything that may violate the major assumptions of the linear regression model or have a strong influence over the model, provide your remedies. [9 marks]

Question 4. A dataset was split into two subsets: training set and test set. Two regression models with the same number of independent variables were built based on the training dataset. Their Akaike Information Criterion (AIC) on the training dataset and the test dataset are given below. [10 Marks]

 

Akaike Information Criterion (AIC)

Model 1

Model 2

Training set

11.23

12.14

Test set

15.21

12.26

 Which model will you select and why? Discuss it.

Question 5.

The dataset, Question5.sav, which can be downloaded from the module BUSN9108 on moodle.kent.ac.uk, concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. Build a logistic regression model using SPSS and the following settings:

· the covariates (or independent variables) are x1, x2 and x3; the dependent variable is ; and

· the modelling method is “Backward: Wald”.

(1). Provide the logistic regression equation and the modelling process. [6 Marks]

(2). Interpret one of the coefficients of the model developed in part (1) [2 Marks]

(3). Provide the accuracy and the precision of the model developed in part 1), respectively. [4 Marks]

Question 6.

A construction equipment company, CX, started its business about 100 years ago. Recently, one of its product, compact excavator, has become the leading product of CX, but the sales team has suffered with repeated over productions and under productions, due to wrong sales forecast. The production team needs to schedule how many compact excavators need to be produced one month in advance. See dataset Quesion6.xlsx from module BUSN9108 on moodle.kent.ac.uk for the sales data.

Ms Chan, the senior manager of the sales team, is very anxious with the accuracy of the current sales forecasting model, which uses the moving average of the last three months. The sales team was quite puzzled by the great variability in the sales every month, and the current forecast model seems to be too simplistic and could not justify why it averages the last three periods. Some of her sales team suggested that the historical data for the number of sales might contain seasonal dependencies, but they have no idea how to confirm the existence of the seasonal patterns and how to model this feature. Some others also suggested using a regression model, but they could not find any relevant external factors that could explain the behaviour of the sales time series, so they decided to enhance the forecast based on the historical time series data at this stage.

(1). Before any modelling, visually analyse each of the systematic patterns (e.g. trend and seasonality) in the time series and discuss their existence or/and patterns. [3 marks]

(2). Divide the data into the training dataset/period (up to the end of 2015) for estimating forecast models, and the test (hold-out) dataset/period (Jan-2016 onward) to evaluate your model forecasts. Develop a Holt’s exponential smoothing (HES) model and a Holt-Winter’s model. When estimating the model parameter(s), you are suggested to use the MSE(mean squared error) in the fitting period only.

a) Present the two models, respectively. Note: you need to provide the mathematical models. [5 marks]

b) Based on Part a), which model do you recommend finally and why? [2 marks]

Question 7.

Download dataset Question7.sav from module BUSN9108 on moodle.kent.ac.uk. The dataset contains the following variables measuring the geometric parameters of a kind of plant:

v1. area A,

v2. perimeter P,

v3. length of kernel,

v4. width of kernel,

v5. asymmetry coefficient

v6. length of kernel groove

Answer the following questions.

(1). Use Ward’s method to determine the number of clusters and explain the reason (ie., how do you select it?), [3 marks]

(2). Based on the number of clusters determined from the above step, use the k-means clustering method to cluster the observations and interpret the outcomes. [3 marks] 

Question 8. 

Download dataset Question8.sav from module BUSN9108 on moodle.kent.ac.uk. This dataset contains responses to a questionnaire on factors related to the quality of a public place. Each observation represents a response from a user. Answer the following questions.

(1). How many factors do you select and how do you select them? [2 marks]

(2). What is the cumulative percentage of variance accounted for by your selected factors? Interpret it. [2 marks]

(3). If you use rotation method “Varimax” and use extraction method “Principal component analysis”, which variables are your factor(s) associated with? [2 marks]

Question 9.

Download dataset Question9.sav from module BUSN9108 on moodle.kent.ac.uk. Use three modelling methods to build three classification models with variable Y as the dependent variable and the other variables as independent variables (you may select some of them), respectively. One of the modelling methods must be decision tree modelling and the logistic regression should not be used in this question. Answer the following questions.

(1). Provide your modelling process. [11 marks]

(2). Convert your decision tree from 1) to a set of rules and provide the rules. [3 marks]

(3). Select the model with the best performance and explain how you will use it in the future. [6 marks]