Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSI4142 Introduction to Data Science

Midterm 2019

1.   Declare the grain of your data mart.                                                                                    (5)

The grain is expressed as consisting oftwo aspects:

- the purchase of tickets by an individual customer, to attend one event at one location, on a specific date (3 marks) as well as

- the refund of a ticket to an individual customer, who cancel attending one event at one location, on a specific date (2 marks)


2.   Identify two measures (or so-called facts) in your data mart.                                              (5)

Ifthe grain is correct, then the measures are: number of tickets, individual price, total price Ifthe grain is incorrect, they may have something like individual price, ticket (y/n)


3.   Identify the following two types of dimensions in your data mart and clearly explain your

choices.

a.   A role-playing dimension                                                                             (3)

b.   A slowly changing dimension                                                                      (2)

a. Date is a role-playing dimension, since we have purchase date, event date and potentially refund date.

b. Customer is slowly changing on address and demographic details such as marital status



Consider the following two queries:

a) How does the ticket sales for jazz concerts in February 2019 compare with the ticket sales

for the annual jazz festival that takes place during July and August?

For instance, the query may return thefollowing results:

Feb 2019

Summer

2018

Summer

2017

Summer

2016

100

2300

3400

1400

b) How does the ticket sales of jazz concerts in February 2019 compare with the ticket sales for jazz concerts in December 2018?

(For instance, the query may show that the total number of tickets soldfor February 2019 was 100, while the salesfor December 2018 was 800.)


4.   Give an example ofone aggregate (or cube) that you would build to speed up both query (a) and query (b).                                                                                                                       (5)

We need an aggregate on Month that sums on the number oftickets sold.

Thejazz concert occurs in the “where” statement ofthe SQL query.



c)  What are the names and brands of the six tents that had the lowest volume of sales, in Ontario, during the week starting on 25 June 2017? That is, we want to determine which tents did not sell well during the so-called “peak” summer selling season.

For instance, one may Determine that the Spark 1 tent by MEC was only sold 5 times, and so on.

d)  What was the most popular colour for women’s hats, as sold during 2017? Here, we use the term “popular” to refer to the volume of sales and we further assume that we only have hats with colours as contained in the set {red, black, lilac, green}. Strangely, we did not have any hats in stock of any other colour.

e)  Name the five (5) bicycles are the most popular in the Ottawa store, in terms of the total

number of sales in 2017, when compared to the sales of bicycles throughout Canada. For instance, the Ghost Trekking 5 bicycle was the most popular seller in the Ottawa store, during 2017, However, it was ranked 3rdpopular in Canada during the same period oftime.


5.   Suppose that some of the data about ticket purchases in 2010 contain many missing values. Specifically, the seating preferences of many customers were not captured. Also, for many events, Canada Ticks did not records the details of the principle actors and artists. Explain how you would handle such omissions. (5)

See slide 31 of topic 3, data staging.

Customer seating preferences that are more recent may be used to propagate the older data from the same customers. For the events, one may look at the historic data in other sources

(such as the actors’ payrolls) to get this information.

Practically, this is older data, so we could just ignore it.


6.   Suppose that Artist is a slowly changing dimension on marital status and that we wish to keep the history of changes. Explain how you would implement this change in your data mart.                                                                                                                                 (5)

See slides 41-50 ofthe data staging slides.

Ifwe want to keep history, we would add a column (Type 3) or a dimension (Type 4).



7.   There are three ways to speed up the incremental load cycle. Discuss how these three approaches may be used, with reference to the Canada Ticks data mart. (5)

Refer to slide 51 of data staging.

- More frequent loading: instead of say daily, load every hour.

- Partitioned files and indexes: partition the data so that we can load across partitions. Indexes such as star joins and bitmaps may be used.

- Parallel processing: we can use parallelism, etc., in order to speed up the load cycle


8.   The analytical cycle for Business Intelligence consists of five steps. Explain these five steps, using your own example against the Canada Ticks data mart.


Refer to slide 5 ofTopic 4: Data Analytics

Students should mention these 5 steps and use their own example.

For instance, if the sales goes down, we need to understand why (e.g. too pricy, snowstorm, other events, etc).

Similarly, if events are sold out, we need to understand why