关键词 > AD699

AD699: Data Mining for Business Analytics 2018 Quiz #1

发布时间:2024-06-28

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Quiz #1

Thursday, 26JUL2018

AD699: Data Mining for Business Analytics

1.  Suppose a user built a k-nearest neighbors model, with a k value equal to the total number of records in the training data set.  What would happen in such an instance?

a.    In a situation such as this, a new record would be assigned to a particular class by a random tie-breaker.

b.   A number of binary dummies equal to the full value of the training set would be required.

c.   All of the records would be assigned to the majority class from the training data.

d.   This would eliminate the overfitting risk, but at a great computational cost.

2. A classification tree with four decision nodes will have how many terminal nodes?

(Questions 3 through 7).  We are working with the McDonald’s in Kenmore Square to conduct a study of purchase patterns.  After we observed 12 transactions, we created the following binary matrix:

Transaction Number

Big Mac

Quarter Pounder

French Fries

Egg

McMuffin

Apple Pie

Soft Drink

McFlurry

1

1

0

1

1

1

0

0

2

1

1

1

0

1

1

0

3

1

0

0

0

1

1

0

4

1

0

1

0

0

1

1

5

0

0

0

0

1

1

0

6

0

1

1

0

1

1

0

7

1

1

1

1

0

1

0

8

1

1

0

0

0

1

0

9

0

0

1

0

0

0

1

10

0

0

1

1

1

1

1

11

0

0

1

1

0

1

0

12

1

1

1

0

0

1

0

3.   What is the support for {Big Mac, French Fries}?

4.   What is the confidence for IF {Big Mac, Quarter Pounder} THEN {Apple Pie}?

5.   What is the support for {Quarter Pounder, French Fries, Soft Drink, Apple Pie}?

6.  What is the lift ratio for IF {French Fries} then {Soft Drink}?

7.  What is the lift ratio for IF {French Fries, Egg McMuffin} THEN {Apple Pie}?

Questions 8 and 9.  The Red Sox are commissioning a study to help them understand how traffic to home games at Fenway Park will be impacted by the BU Bridge construction project will impact the way fans arrive to games.  After speaking with 925 fans, they found that 400 of them never drive to games.  100 of the fans said, “I never drive, and I never take the T.”   150 fans said, “Sometimes I take the T, and sometimes I drive...it just depends on how I feel that day.”

8.  Given that a Red Sox fan never takes the T to the games, what is the chance that he never drives to the games?

9.   Given that a Red Sox fan sometimes drives to games, what is the chance that he never takes the T to games?

10.   Mary and Tim each review six famous novels,  and give those novels scores from 1 to 10, with 10 being the best and 1 being the worst.   In order, the novels are:   The Grapes of Wrath, The Sun Also Rises, The Sound and the Fury, Native Son, Midnight’s Children, and Robinson Crusoe.  Their scores are listed below:

Mary:  (10, 8, 9, 6, 7, 3)

Tim:    (5, 7, 8, 6, 7, 5)

What is the Euclidean distance between Mary and Tim?

11.  Which of the following is a distance metric that should not be used when your data consists of continuous numeric variables?

a.    correlation distance.

b.   Euclidean distance.

c.    covariance distance.

d.   hamming distance.

12.   Imagine  that  the proportion  of all e-mails in our training data that are spam is .62, and the proportion  of e-mails that are ham is  .38.   An  e-mail  comes through  our  Bayesian filter and it contains only two words: “lucrative lottery.”  From our training data, we know the probabilities that each of these words would appear in either type of e-mail.  The probabilities are shown below:

SPAM

HAM

lucrative

.05

.01

lottery

.15

.03

Using a naive bayes methodology, what is the probability that this e-mail is spam?

13.  We have a data set in front of us, and all of its variables are numerical.   Is it possible to build a naive bayes model with this dataset?

a.   Yes, it is possible.   But to avoid the risk of overfitting, we have to ensure that the numerical predictors that we’ll use as inputs are likely to appear in both our training and validation sets.

b.   No, it is not possible.  We should use a different algorithm instead, such as k-nn.

c.   No, it is not possible.  We can, however, attempt to estimate the probabilities using a different technique, and then rebuild the model.

d.  Yes, it is possible.   But first, we must convert all of the variables into categorical ones, after binning them into particular groups.

14.  Based on the information in our database about interests, habits, and tastes, Janet is very similar to  Carla.       Therefore,  since  Janet  gave  a  favorable  rating  to  the  movie  “Skyscraper”  we  will recommend it to Carla, too.

a.    Content-based filtering.

b.   User-based collaborative filtering.

c.    Item-based collaborative filtering.

d.   Association rules.

15.  Why  would  a  classification  tree  in  which  the  terminal  nodes  (leaf nodes) were  completely homogeneous not necessarily be a great model?

a.   Even though this tree model might have perfectly classified the data that was used to build it, but we don’t know how this model would perform with new data.

b.   Such a tree would not be able to perform well if the training data used to create it had included outliers.

c.   This tree would not be able to avoid the use of cross-validated data.

d.   With  a tree  such as as this, the confusion matrix would reveal splits that did not originally occur in the model.