CSI4142 Introduction to Data Science Midterm 2019
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CSI4142 Introduction to Data Science
Midterm 2019
1. Declare the grain of your data mart. (5)
2. Identify two measures (or so-called facts) in your data mart. (5)
3. Identify the following two types of dimensions in your data mart. Be sure to clearly explain
your choices.
a. A role-playing dimension (3)
b. A slowly changing dimension, other than Artist (2)
Consider the following two queries:
a) How does the ticket sales for jazz concerts in February 2019 compare with the ticket sales for the annual jazz festival that takes place during the summer, in July and August?
For instance, the query may return thefollowing results:
Feb 2019 |
Summer 2018 |
Summer 2017 |
Summer 2016 |
… |
100 |
2300 |
3400 |
1400 |
|
b) How does the ticket sales of jazz concerts in February 2019 compare with the ticket sales for jazz concerts in December 2018?
(For instance, the query may show that the total number of tickets sold for February 2019 was 100, while the total salesfor December 2018 was 600.)
4. Give an example of one aggregate (or cube) that you would build to speed up both query (a) and query (b). (5)
5. Suppose that the data from 2010 contain many missing values. Specifically, the seating preferences of many customers were not captured. Also, for many events, MyTickets did not records the details of the principle actors and artists. Explain how you would handle such omissions. (5)
6. Suppose that Artist is a slowly changing dimension on marital status and that we wish to keep the history of changes. Explain how you would implement this change in your data mart. (5)
7. There are three ways to increase the speed of the incremental load cycle. Explain how these three techniques may be used, with reference to the MyTickets data mart. (5)
8. The analytical cycle for Business Intelligence, as discussed in class, consists of five steps. Explain these five steps, using your own example against the MyTickets data mart. (5)
9. The manager of MyTickets is interested in identifying the ten (10) customers who purchased the highest total number of tickets in 2018. She wants to determine their names, gender, and postal codes, together with the average number of tickets they purchased for events. Provide the SQL statement to answer this query. (5)
10. As a data scientist, you are tasked with classifying past events into one of two categories, namely as successful (outcome: good) or unsuccessful (outcome: bad). To this end, you collect sample data with the following schema:
Weather |
Tickets sold |
Day ofWeek |
Event Type |
Venue |
Outcome |
snow |
10,000 |
Saturday |
Opera |
NCC |
Good |
snow |
1,000 |
Monday |
Opera |
NCC |
Bad |
clear |
5,000 |
Saturday |
Jazz |
Dows |
Bad |
rain |
6,000 |
Sunday |
Jazz |
Dows |
Good |
clear |
500 |
Tuesday |
Cinema Nouveau |
Archives |
Good |
clear |
1,000 |
Tuesday |
Blues |
NCC |
Good |
… |
|
|
|
|
|
Explain the process you will follow to construct a model from this data. (5)
2022-03-10