闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Winter Examination Period 2022

ECS766P

DATA MINING

Question 1

(a) A sales company needs to perform the following tasks:

1. Predict sales trends for the next ﬁscal year, based on historical sales data.

2. Record daily sales in the store database and calculate the sum of daily sales.

For each of the above two tasks, explain whether it can be considered a data mining task or not; justify your response.

[4 marks]

(b) You are provided with the below dataset from patient records in a hospital:

Patient ID	Age	Admitted to Emergencies?	Patient Risk
PE067254	24	Yes	High
RF243756	46	No	Very low
TG638475	35	No	Moderate
ED068265	51	Yes	Very high

For each attribute from the above dataset, describe its type and sub-type when applicable, and justify your response.

[8 marks]

attribute1	attribute2	attribute3	attribute4	attribute5
0	0.2	0	0	0
0	0	1.1	0	0.2
2.2	0	0	0	0
0	0	0	0.7	0

What is the type of this dataset and how is the above representation called? Which particular characteristic does the above dataset have?

[3 marks]

(d) Consider data points x1 = (2, 2) and x2 = (3, 5). Calculate the Manhattan and Supremum distance between x1 and x2 .

[4 marks]

(e) Consider a customer satisfaction survey, where customers can use one the following ratings to indicate their satisfaction: extremely dissatisﬁed, somewhat dissatisﬁed, neutral, somewhat satisﬁed, extremely satisﬁed. Using an appropriate similarity measure, calculate the similarity between a customer who is extremely satisﬁed and a customer who is somewhat dissatisﬁed.

[6 marks]

Question 2

(a) Assume a dataset of patient records, which includes several numerical attributes related to patients’ vital signs. Through an initial analysis, we found out that several of these attributes are strongly correlated. Suggest an appropriate method in order to reduce the dimensionality of this dataset.

[4 marks]

(b) A university needs to develop IT infrastructure for two tasks: (1) studying module

enrolments for the past decade; (2) registering students to modules. Which data infrastructure and system should the university use for each of the above two tasks and why?

[4 marks]

(c) Consider the following dataset: x = [3, 5, 2, 4, 3, 6, 36, 2, 5, 2]. We would like to perform normalisation on the dataset before any further analysis. Which normalisation method should we use and why?

[3 marks]

(d) Consider the below table which shows measurements from various individuals’ systolic blood pressure (SBP) along with their corresponding age:

Age	61	21	25	55	64	56
SBP	139	118	120	125	142	129

Are age and SBP positively or negatively correlated? Justify your response.

[6 marks]

(e) Consider a delivery company, which keeps records of items, customers, and depots.

The company would like to build a data warehouse to analyse sales, and would like to measure numbers of items sold in each transaction and the value of each transaction.

1. Draw a star diagram for the above data warehouse. Populate the dimension tables and fact table with attributes relevant to the task.

2. Assume that the item dimension has 2 levels: line item < category. How many cuboids will the cube contain, including the base and apex cuboids?

3. A query to be processed is for all customers and all depots for a speciﬁc item=”tablet-YZ” . Suppose there are 3 materialised cuboids:

。cuboid 1: {category, customer, depot}

。cuboid 2: {line item, customer, depot}

。cuboid 3: {line item=“tablet-YZ”, customer, depot=“Mile End”}

Which of these 3 above cuboids should be selected to process the query? Justify your response.

[8 marks]

Question 3

(a) Consider the following dataset represented by a table.

ID Feature 1 Feature 2 Feature 3 Feature 4 0 0 4 2 6 1 11 10 1 13 2 5 7 3 11 3 3 6 0 2 4 10 9 2 6 5 3 6 3 11 6 2 5 1 3

7 9 9 0 2

1. Create a scatter plot to visualise features 1 and 2. Do not use code to generate the plot; it’s ﬁne to draw the scatterplot approximately by hand. What can you say about their (Pearson) correlation coefﬁcient based on this visualisation? You don’t need to compute the correlation coefﬁcient.

2. Compute the frequency of each value of feature 3. What are the modes of this feature?

[7 marks]

(b) Answer the following questions about feature representations.

1. Why can a feature with a large range disrupt a distance-based classiﬁer?

2. Should the test set be used in order to make decisions about which features are important?

[4 marks]

1. Consider a classiﬁcation dataset that contains the observations x1 = (0.5, 1), x2 = (1, 0.5), x3 = (1, 1). Suppose that their respective classes are y1 = 1, y2 = 1, y3 = 2. Which class would a 1-nearest neighbour classiﬁer assign to a new observation x = (0.5, 0.5)? Show your calculations using the Euclidean distance function.

2. In which classiﬁcation problems does the number of neighbours k affect whether a k-nearest neighbours classiﬁer needs a tie-breaking policy?

[5 marks]

(d) Answer the following questions about clustering algorithms.

1. Suppose a cluster at a certain iteration of the k-means algorithm contains solely the observations x1 = (0.5, 1), x2 = (1, 0.5), x3 = (1, 1). What would be the cluster center of this cluster? Show your calculations.

2. Explain why it is important to minimise the sum of squared errors while also minimising the number of clusters in k-means.

[5 marks]

(e) Prove that a classiﬁer with perfect accuracy on a test set has an F1 -score of 1.0 in this

same test set.

[4 marks]

Question 4

(a) Consider a network composed of four web pages. Suppose page 1 has a link to pages

2 and 3. Suppose page 3 has a link to page 4. Suppose page 4 has a link to page 3.

1. Draw a graph to represent this network. Identify any dead-end nodes.

2. Draw a modiﬁed graph with artiﬁcial edges from dead-end nodes to every node in the graph, as required by the PageRank algorithm.

3. Present the system of PageRank equations for nodes 1 and 2 on the modiﬁed graph drawn previously. Let α denote the probability of teleportation.

[8 marks]

(b) Explain why frequent itemset mining is a preliminary step in ﬁnding strong association

rules.

[4 marks]

2 {1, 2, }3

3 {1, 2, 4, }5

4 {1, }3

5 {1, 4, }5

6 {2, }4

7 {1, 3, 4, }5

8 {3, }5

1. What is the support count of the itemset {1, 3}? What is the support of the itemset {2, 4}?

2. Is the itemset {1, 3, 4} considered frequent for a support threshold of 20%? Without computing the support of the itemset {1, 4}, is it possible to say whether it is frequent for the same support threshold based on your previous answer? Justify your answers.

3. What is the support of the association rule {1, 3} =÷ {4}? What is the conﬁdence of this association rule?

[7 marks]

(d) Answer the following questions about outlier detection.

1. Describe a heuristic method for using a clustering algorithm for unsupervised outlier detection.

2. Describe the similarities and differences between distance-based and density- based methods for outlier detection.

[6 marks]