CSI4142 Introduction to Data Science Final Examination 2019
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CSI4142 Introduction to Data Science
Final Examination 2019
Question 1: Data Mining [20]
Consider the following table containing some sample data about Customers who visited one of the Northern Lights Spas over the last 20 years. Note that the database contains the details of 800,000 individuals.
Name |
Smith, John |
Smith, J |
Jane Doe |
Date-of-Birth |
01-01-1980 |
01-01-1990 |
03-08-1969 |
Age |
24 |
28 |
15 |
Country |
Canada |
Italy |
- |
City |
Ottawa |
Padova |
Washington DC |
Package |
Baths+Robe |
Therapeutic Treatment |
Baths+Robe |
Price ($) |
40 |
160 |
50 |
Gender |
M |
F |
- |
Fitness |
Fit |
Unfit |
Moderately Fit |
Explain how you would preprocess this data prior to data mining, with reference to the Northern Lights Spa case study. Specifically, your answer should focus on:
- Transforming four (4) different types of data: numeric, nominal, ordinal and Boolean (12)
- Handling missing values and noise, given that about 5% of the customer records contain either missing values or noise (4)
- Addressing the curse of dimensionality (4)
Question 2: Classification [30]
Suppose that your first data mining task is to construct a model to distinguish between customers, based on the types of Package they choose. To this end, you aim to apply a classification algorithm to your preprocessed data, with “Package” as the target class. Initially, your aim is to learn a model that distinguishes between customers who only visit the baths, versus those that also book a treatment. This is therefore a binary classification task with the target labels “Baths” and “Treatment” . (Recall that the database contains the details of 800,000 individuals.)
a. Suppose that the majority of customers (90%) only access the baths, while the remaining 10% also book a treatment. Explain how you would address such class imbalance prior to classification. (6)
b. Suppose that you created a decision tree that clearly overfits the data. Explain how you would modify your decision tree algorithm in order to avoid overfitting. (2)
c. Accuracy and error rate are not always the most appropriate evaluation measures, especially in an imbalanced setting. Name two (2) other measures that could be used and explain how they are calculated – you may show the formula. (4)
d. Suppose that the decision tree algorithm created in Question 2(b) does not produce highly accurate results. The owner of the Northern Lights Spa group has heard of ensemble-based methods. She is keen for you to use an ensemble in an attempt to improve on this accuracy.
a. Contrast the following three techniques - Boosting, Bagging and Random Forests - by focusing on the details of the algorithms and datasets used to construct a model. (6)
b. Choose one of these three algorithms and motivate your choice, with reference to the Northern
Lights Spa data. (4)
e. Explain why support vector machines (SVMs) are effective in high dimensions. (4)
f. Explain the pros and cons of lazy learning when compared with eager learning. (4)
Question 3: Cluster Analysis [30]
Suppose that the Head Chef of the Northern Lights Spa group’s Bistros is interested in grouping customers in terms of the menu items they purchase.
You decide to employ a cluster analysis algorithm to determine potential clusters. To this end, you merge the customers’ details with their purchases at the restaurant, into a single database view. (Note that about 80% of the customers enjoy a meal at a Bistro while visiting a Spa.)
The partial schema of this view, together with the sample data of two customers, are as follows:
Customer-Age |
20 |
50 |
Customer-Gender |
F |
M |
Food Item |
Chicken Wrap |
Soup |
Food Item Price |
8.99 |
6.99 |
Beverage |
Tea |
Coffee |
Beverage-Price |
3.99 |
3.99 |
Day-of-Week |
Saturday |
Sunday |
a. Explain how you would calculate the distances between Customers. (6)
b. Suppose that you have the option to apply a partitioning-based, a density-based or a model-based
algorithm to this data.
i) For each one of these three algorithms: (9)
a. Explain how the method constructs clusters
b. Discuss the strengths of the method
c. Discuss the limitations of the method
ii) Choose one of these three approaches for the Northern Lights Spa data and motivate your choice.
(2)
c. Explain how you would evaluate the quality of the clusters formed by the algorithm you chose in Question 3(b). (6)
d. Explain what subspace clustering entails and give an example, with reference to the Northern Lights case study, where it would be appropriate. (4)
Question 4: Outliers and anomalies [20]
The manager of the Northern Lights Spa group recently noticed that it is very difficult to predict outliers (surprises) in the number of “walk ins”, measured in terms of people who request a therapeutic treatment when already on site. This unfortunately leads to scheduling difficulties and loss of revenue, since it is almost impossible to determine the optimal number of therapists to schedule on a given day.
To this end, he would like to detect abnormally high, or abnormally low, rates of such bookings. He plans to use this information to determine whether there are any interplays between customer profiles, seasonal events, time of the day, day of the week, and so on, that may cause be the cause of such outliers.
a. Outlier detection methods may be grouped into global, contextual or collective outliers. In which one of these three categories would the therapeutic treatment outlier detection problem fall? Motivate your answer. (4)
b. There are four main challenges associated with outlier detection. Explain what these are, with
reference to the Northern Lights Spa case study. (12)
c. Briefly describe a (one) classification-based method that are suitable for outlier detection. (4)
2022-03-10