CSI4142 Fundamentals of Data Science Final Examination 2020
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CSI4142 Fundamentals of Data Science
Final Examination 2020
Case Study
Consider Opération Soleil Club that sells all-inclusive vacation packages leaving from major cities in Canada (e.g. Ottawa, Toronto, Calgary, Halifax and Montreal) to Caribbean destinations, such as Cuba, Antigua and the Bahamas. All sales are conducted online, directly to customers, who need to become members of the Opération Soleil Club to book trips. (Note that a member cannot book a vacation for others, without also going on the trip. This requirement is strictly enforced, in order to make sure that Opération Soleil is not targeted by resellers.)
You are encouraged to visit relevant websites, such as https://www.tripcentral.ca/ or https://www.escapes.ca/, to get an idea of this business sector.
The Opération Soleil transaction flow is as follows. A new member first registers on the Opération Soleil website and then proceeds to book an all-inclusive package for one (1), two (2) or more travellers. A package includes the airplane ticket(s), the cost of stay at a resort and transfer(s) to and from the airport at the Caribbean destination. The accommodation costs include not only the price of the room(s), but also all meals and most beverages. At some of the resorts, children stay for free when sharing a room with adults, while other resorts offer a reduced rate for minors, i.e. persons younger than 12 years of age.
Members select the resort of their choice by considering factors such as whether it is on the beach front, is adults-only, offers a golf course and/or has a family club. The duration of an all-inclusive stay is typically seven (7) nights, but members may also book shorter or longer stays. All members are required to provide passport details of all travellers in their party, at the time of booking. This means that Opération Soleil has access to information such as age, nationality and gender of all the persons staying at their resorts. Opération Soleil also stores the members’ home addresses and credit card numbers. The company encourages all members to fill a questionnaire about their employment histories and lifestyle preferences. The current response rate is around 95%, given the incentive that participating members are entered in a draw for an all-inclusive trip for two, to Bora Bora!
The all-inclusive travel business is very competitive and, unfortunately, the owner of Opération Soleil noticed that the membership rate is remaining rather constant. That is, it is difficult to attract new clients. In addition, the financial analysts of Opération Soleil are concerned about two aspects of the business. Firstly, the most expensive packages, notably packages for high end adults-only destinations during peak seasons, are not selling well. Secondly, the profit margins of “last minute deals” are very low. On the other hand, there are two positive trends. Namely, (i) low season sales of wedding packages are becoming increasingly popular and (ii) many families with minors in the 6-to-12 years age group tend to “skip school” and choose to travel during semesters, in order to benefit from cheaper shoulder season rates.
The owner of Opération Soleil decides to construct a data mart, in her quest to better understand the trends in her business. Her main aim is to track the members’ preferences and to explore the popularity of specific destinations, in terms of the services offered, the star ratings, the most popular dates, and so on.
In addition, she would like to determine the typical profile of her clientele, members as well as accompanying persons, in terms of demographics such as age, income, city of origin and occupation. She aims to use this information for future targeted marketing. To this end, the owner wishes to determine the number of adults and children travelling together, and the profiles of the resorts they tend to stay at.
Senior citizens further represent a market that she has plans to expand into. Also, as stated above, another goal is to assess which destinations are popular, or unpopular, and to link such information to age group(s) and family characteristics.
Suppose that you have access to the Opération Soleil member profiles and transactional data from the last ten (10) years. Further, suppose that the grain of the Opération Soleil data mart is an individual visitor to a resort in the Caribbean, on a single trip. This visitor may be a member, an accompanying adult or a minor. For instance, consider a booking for two adults and two minors, travelling to a resort in Cuba for a one week stay during March 2020. This booking will result in four rows in your data mart, one for every visitor in the group.
You are hired as a data scientist at Opération Soleil and your main task is to build data mining models to better understand the trends in the business, as discussed above.
You decide to focus your attention on determining which destinations, and specifically resorts, are popular, or unpopular, based on the ratings as provided by past customers.
Part A: Data preprocessing [10 marks]
Consider the following table that contains sample data of ratings of resorts past customers visited. This data set contains three data types, namely nominal, binary and numerical data. Note that the last column indicates the class label (rating), which may take the values “High” or “Low”:
Resorts1(Number-of-stars, Number-of-restaurants, Beach, Price-per-individual, Country, Rating)
Number of Stars |
Number of restaurants |
Beach (y/n) |
Price per individual (per week) |
Country |
Rating |
5 |
3 |
Y |
1200 |
Cuba |
High |
2 |
5 |
N |
400 |
Antigua |
Low |
- |
3 |
Y |
800000 |
Mexico |
High |
3 |
3 |
Y |
1560 |
Cuba |
High |
4 |
2 |
- |
1670 |
Mexico |
Low |
5 |
5 |
N |
1300 |
- |
High |
1. Explain how you would transform the three different data types prior to data mining. (6)
2. Explain how you would handle noise. (2)
3. Explain how you would handle missing values. (2)
Part B: Classification [25 marks]
1. Consider the following table that contains another subset of data detailing the Ratings of Resorts by past visitors. The class label is the Rating attribute, which can take the values High or Low: Ratings(Number-of-stars, Number-of-swimming-pools, Adults-only, Type-of-beach, Rating)
Number of stars |
Number of swimming pools |
Adults only |
Type of beach |
Rating |
Four |
Three |
No |
pebbles |
High |
Five |
Two |
No |
pebbles |
High |
Four |
One |
Yes |
Sand |
High |
Three |
Two |
No |
pebbles |
low |
Three |
One |
Yes |
pebbles |
high |
Three |
Two |
Yes |
Sand |
high |
Four |
Two |
No |
Sand |
high |
Four |
Three |
Yes |
pebbles |
high |
Five |
Two |
No |
Sand |
low |
Four |
One |
Yes |
Sand |
high |
Three |
Two |
No |
pebbles |
low |
Five |
Two |
Yes |
pebbles |
low |
Three |
Three |
No |
pebbles |
low |
Consider the following additional Row 14 describing another Rating.
Number of stars |
Number of swimming pools |
Adults only? |
Type of beach |
Rating |
|
|
Yes |
Sand |
High |
First, insert your own value for the number-of-stars and number-of-swimming-pools attributes for Row 14, that corresponds to the last digit of your own age. That is, if you are 23 years old, you should enter the values (Number-of-stars = Three) and (Number-of-swimming-pools = Three).
Number of stars |
Number of swimming pools |
Adults only? |
Type of beach |
Rating |
Three |
Three |
Yes |
Sand |
High |
a. Show how the ID3 decision tree algorithm would construct a model against this data, using Rows
1 to 14 as training set. That is, your answer should detail the steps followed by a decision tree using Information Gain. (8)
b. Illustrate the steps that a Boosting ensemble, consisting of k-nearest neighbor classifiers, will follow to build a model against this data. (8)
2022-03-10