闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSI4142 Introduction to Data Science

Final Examination 2019

Question 1: Data Mining [20]

Consider the following table containing some sample data about Customers who visited one of the Northern Lights Spas over the last 20 years. Note that the database contains the details of 800,000 individuals.

Name	Smith, John	Smith, J	Jane Doe
Date-of-Birth	01-01-1980	01-01-1990	03-08-1969
Age	24	28	15
Country	Canada	Italy	-
City	Ottawa	Padova	Washington DC
Package	Baths+Robe	Therapeutic Treatment	Baths+Robe
Price ($)	40	160	50
Gender	M	F	-
Fitness	Fit	Unfit	Moderately Fit

Explain how you would preprocess this data prior to data mining, with reference to the Northern Lights Spa case study. Specifically, your answer should focus on:

- Transforming four (4) different types of data: numeric, nominal, ordinal and Boolean (12)

- Handling missing values and noise, given that about 5% of the customer records contain either missing values or noise (4)

- Addressing the curse of dimensionality (4)

Question 2: Classification [30]

Suppose that your first data mining task is to construct a model to distinguish between customers, based on the types of Package they choose. To this end, you aim to apply a classification algorithm to your preprocessed data, with “Package” as the target class. Initially, your aim is to learn a model that distinguishes between customers who only visit the baths, versus those that also book a treatment. This is therefore a binary classification task with the target labels “Baths” and “Treatment” . (Recall that the database contains the details of 800,000 individuals.)

a. Suppose that the majority of customers (90%) only access the baths, while the remaining 10% also book a treatment. Explain how you would address such class imbalance prior to classification. (6)

b. Suppose that you created a decision tree that clearly overfits the data. Explain how you would modify your decision tree algorithm in order to avoid overfitting. (2)

c. Accuracy and error rate are not always the most appropriate evaluation measures, especially in an imbalanced setting. Name two (2) other measures that could be used and explain how they are calculated – you may show the formula. (4)

d. Suppose that the decision tree algorithm created in Question 2(b) does not produce highly accurate results. The owner of the Northern Lights Spa group has heard of ensemble-based methods. She is keen for you to use an ensemble in an attempt to improve on this accuracy.

a. Contrast the following three techniques - Boosting, Bagging and Random Forests - by focusing on the details of the algorithms and datasets used to construct a model. (6)

b. Choose one of these three algorithms and motivate your choice, with reference to the Northern

Lights Spa data. (4)

e. Explain why support vector machines (SVMs) are effective in high dimensions. (4)

f. Explain the pros and cons of lazy learning when compared with eager learning. (4)

Question 3: Cluster Analysis [30]

Suppose that the Head Chef of the Northern Lights Spa group’s Bistros is interested in grouping customers in terms of the menu items they purchase.

You decide to employ a cluster analysis algorithm to determine potential clusters. To this end, you merge the customers’ details with their purchases at the restaurant, into a single database view. (Note that about 80% of the customers enjoy a meal at a Bistro while visiting a Spa.)

The partial schema of this view, together with the sample data of two customers, are as follows:

Customer-Age	20	50
Customer-Gender	F	M
Food Item	Chicken Wrap	Soup
Food Item Price	8.99	6.99
Beverage	Tea	Coffee
Beverage-Price	3.99	3.99
Day-of-Week	Saturday	Sunday

a. Explain how you would calculate the distances between Customers. (6)

b. Suppose that you have the option to apply a partitioning-based, a density-based or a model-based

algorithm to this data.

i) For each one of these three algorithms: (9)

a. Explain how the method constructs clusters

b. Discuss the strengths of the method

c. Discuss the limitations of the method

ii) Choose one of these three approaches for the Northern Lights Spa data and motivate your choice.

(2)

c. Explain how you would evaluate the quality of the clusters formed by the algorithm you chose in Question 3(b). (6)

d. Explain what subspace clustering entails and give an example, with reference to the Northern Lights case study, where it would be appropriate. (4)

Question 4: Outliers and anomalies [20]

The manager of the Northern Lights Spa group recently noticed that it is very difficult to predict outliers (surprises) in the number of “walk ins”, measured in terms of people who request a therapeutic treatment when already on site. This unfortunately leads to scheduling difficulties and loss of revenue, since it is almost impossible to determine the optimal number of therapists to schedule on a given day.

To this end, he would like to detect abnormally high, or abnormally low, rates of such bookings. He plans to use this information to determine whether there are any interplays between customer profiles, seasonal events, time of the day, day of the week, and so on, that may cause be the cause of such outliers.

a. Outlier detection methods may be grouped into global, contextual or collective outliers. In which one of these three categories would the therapeutic treatment outlier detection problem fall? Motivate your answer. (4)

b. There are four main challenges associated with outlier detection. Explain what these are, with

reference to the Northern Lights Spa case study. (12)

c. Briefly describe a (one) classification-based method that are suitable for outlier detection. (4)