BU510.650 – Data Analytics Sample Exam

发布时间：2023-01-02

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

BU510.650 – Data Analytics

Sample Exam

Note: This document is not meant to reflect the length or distribution of topics for the actual exam. Its goal is to familiarize you with the exam style.

1. For each statement below, choose True or False.

(a) Adding an input variable (predictor) to a linear regression will never reduce R2 .

TRUE FALSE

(b) “Regressor” is one of the terms used in data analytics for the “dependent” variable.

TRUE FALSE

(c) If we add more input variables (predictors) to a linear regression, we will reduce the residual sum of squares (RSS).

TRUE FALSE

2. Suppose we are predicting whether an apartment’s rent is above $2000 per month, as a function of the number of bedrooms (#Rooms) and the distance from school in miles (Distance). The following figure shows what our prediction would be in three different regions (“Yes” indicates a prediction that the rent will be above $2000, and “No” indicates a prediction that the rent will be below $2000.)

Create a decision tree that captures the same information shown in the figure above.

3. For 200 runners in a marathon (observations), the RunTime data set provides the following data:

LastDistance (in miles): the length of the runner’s last training run

LastTime (in hours): the time it took the runner to finish his or her last training run

Age (in years): the runner’s age

MarathonTime (in hours): the time it took the runner finish the marathon The following table shows the first three rows of this data set:

	LastDistance	LastTime	Age	MarathonTime
1	18.52	1.96	42.01	3.07
2	10.94	1.77	34.62	4.90
3	18.87	1.86	27.69	2.83

We run linear regression with MarathonTime as the output variable (response) and LastDistance, LastTime, and Age as the input variables (predictors). The following is the output of this linear regression:

Which of the three predictors (LastDistance, LastTime, and Age) have a statistically significant effect on MarathonTime? Explain your answer in one sentence.

4. Suppose that we have data about arrests in 11 US states, and we want to separate the observations into three clusters. (Notice that each state is an observation.) After running the hierarchical clustering, we obtain the dendrogram shown in Figure 1. If we cut the tree to obtain three clusters, what are the clusters we would obtain? That is, write the observations (i.e., state names) for each cluster.

5. We run linear regression for Revenue (as the output variable (response), in million $) and Salesforce (as the input variable (predictor)) based on training data in the previous 200 periods (weeks). The following is part of the summary output of this linear regression.

Answer the following questions based on the above summary output.

(a) What proportion of variation in the training data is explained by the linear regression model?

(b) Does Salesforce have a statistically significant effect on Revenue? Explain your answer.

(c) In the next two weeks, we want to change the Salesforce level to 25 and 30, respectively. Predict the corresponding revenues based on the above regression output.

(d) Suppose that we introduce Advertising as an input variable (predictor) in this regression. Answer the following questions.

i. Will the new model fit the training data better? (If yes, explain how you would verify it from the output. If no, explain why not.)

ii. Will the new model yield higher prediction accuracy for the test data? (If yes, explain how you would verify it from the output. If no, explain why not.)

6. Consider the following data, which shows the types of smart phones available at a store. (AboveMed is 1 if the phone’s price is above median, and 0 otherwise.) Suppose this data is stored as a data frame called Phone. In the rest of this question, you will be asked about the result of certain R commands applied to this data frame. The *grey cells* (Memory, Size, Price, MedianPrice, Available, A, B. C, D, E, F) are *not part of the data* – they are just column names and row names.

	Memory	Size	Price	AboveMed	Available
A	32	Regular	650	0	0
B	128	Regular	750	0	1
C	256	Regular	850	1	1
D	32	Plus	770	0	1
E	128	Plus	870	1	1
F	256	Plus	970	1	0

For each R command below, please write down the result (you do not have to display the output just as R would; simply indicate the result you would obtain):

(a) > Phone[2,3]

(b) > Phone[1,]

(d) > table(Phone$AboveMed, Phone$Available)

(e) > train = -c(1,2,5)

> Phone [train,]

(f)

> x = seq(from = 1, to = 9, by = 2)

> y = 2

> x*y

(g) > set.seed(19)

> sample(1:15,2)

[1] 2 7

> set.seed(19)

> sample(1:15,2)

The questions below are about “model selection”, which we will cover in week 7.

7. For each statement below, choose True or False.

(a) All model selection methods (best subset selection, forward selection, and backward

selection) lead to the same set of input variables (predictors).

TRUE

FALSE

(b) Adding an input variable (predictor) to a linear regression will never reduce R2 .

TRUE

FALSE

TRUE FALSE

(d) “Regressor” is one of the terms used in data analytics for the “dependent” variable.

TRUE

FALSE

(e) If we add more input variables (predictors) to a linear regression, we will reduce the

residual sum of squares (RSS).

TRUE FALSE

(f) Suppose we are using the “best subset selection” method to find the best subset of

predictors for a linear regression. The best subset of predictors according to Cp criterion will be the same as the best subset of predictors according to AIC criterion.

TRUE

FALSE

8. Given 10 possible predictors, we would like to use model selection techniques to determine the best subset of predictors. How many subsets will we evaluate if we use best subset selection?

9. The College data set provides data about 18 variables (columns) for more than 700 colleges (observations). The following table shows some of the columns for the first three rows of this data set:

	Private	Apps	Accept	Enroll	…	Expend	Grad. Rate
Abilene Christian University	Yes	1660	1232	721	...	7041	60
Adelphi University	Yes	2186	1924	512	…	10527	56
Adrian College	Yes	1428	1097	336	…	8735	54

We would like to model Apps (which indicates the number of applications) as the output variable (response) as a linear function of all others as the input variables (predictors). To determine the best subset of predictors, we use best subset selection. Part of the code and a partial output is shown below:

regfit.best=regsubsets(Apps~., data=College, nvmax=17)

summary(regfit.best)

best.summary=summary(regfit.best)

best.summary$cp

Answer the following questions based on the above information.

(a) Which predictors are included in the best model with 5 predictors?

(b) How many predictors are there in the best model according to Cp criterion?

10. Suppose we have a data set, which has p predictors (input variables), and we perform model selection using (i) best subset selection (BSS), (ii) forward stepwise selection (FwSS), and (iii) backward stepwise selection (BwSS). Specifically, using each of these three approaches, we determine the best model with k predictors for all possible values of k, that is, k = 1, 2, …, p.

(a) For a given k, suppose we are comparing the models obtained by these three methods, that is, the

model with k predictors obtained by BSS, FwSS, BwSS. Which of the three models will have the smallest RSS on the data we used to perform model selection? Explain your answer.

(b) Answer the following True or False questions:

(i) The predictors in the k-predictor model identified by FwSS are a subset of predictors in the (k+1)-predictor model identified by FwSS.

(ii) The predictors in the k-predictor model identified by BwSS are a subset of predictors in the

(k+1)-predictor model identified by BwSS.

(iii) The predictors in the k-predictor model identified by BSS are a subset of predictors in the

(k+1)-predictor model identified by BSS.

(iv) The predictors in the k-predictor model identified by BwSS are a subset of predictors in the

(k+1)-predictor model identified by FwSS.

(v) The predictors in the k-predictor model identified by FwSS are a subset of predictors in the (k+1)-predictor model identified by BwSS.