COMP20008 Elements of Data Processing Semester One 2019 Exam
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Semester One 2019 Exam
Computing and Information Systems
COMP20008
Elements of Data Processing
Formulae and Notation
Euclidean distance: d(x, y) =_ (xi - yi )2
Pearson’s correlation coefficient: r = ^ (1(_年2 ^(_年¯))(yi
where = xi and y¯ = yi
Entropy: H(X) = pi log2 pi
where pi is the proportion of points in the ith category.
Conditional entropy: H(Y }X) = 年ex p(x)H(Y }X = x)
where < is the set of all possible categories for X and }< } is the number of categories.
Mutual information:
MI(X, Y) = H(Y) - H(Y }X) = H(X) - H(X}Y)
Normalised mutual information:
NMI(X, Y) = min
Accuracy: A = where
TP is number of true positives
TN is number of true negatives
FP is number of false positives
FN is number of false negatives
Set Union: X x Y is the union of sets X and Y (elements in X or in Y or in both X and Y)
Set intersection: X u Y is the intersection of sets X and Y (elements in both X
and Y)
1. This question concerns representations of data in XML/JSON.
(a) (0.5 mark) Consider the XML data below. Is it well-formed? Explain your answer.
<?xml version="1 .0"?>
<subject xmlns:uom="http://UoM/offer/"
xmlns:open_u="http://OpenUniversity/subjects/" >
COMP20008</subject>
<semester>1</semester>
<year/>
<name> Elements of Data Processing</name>
(b) (0.5 mark) Explain the purpose of ‘uom’ and ‘open u’ in the XML data above.
(c) (1 mark) Provide an effective representation of the following data as a list of ‘person’ records in JSON format.
Name |
Age |
Interests |
Tom |
45 |
Gardening, Bird watching |
Jack |
34 |
Movies, Video Games |
Susan |
29 |
Cycling, Chess |
2. Consider the task of processing readings from multiple temperature sensors.
(a) (1 mark) Explain the difference between readings which are noise and readings which are outliers.
(b) (2 marks) Using a simple example, explain two ways in which using a histogram for outlier detection in temperature sensor data might provide misleading results.
(c) Consider the following dataset of readings from a set of 10 temperature sensors collected at a particular time (one reading per sensor).
175, 179, 5, 185, 190, 191, 193, 193, 201, 33
i. (1 mark) Calculate the interquartile range (IQR) of this data.
ii. (1 mark) Explain why readings 5 and 33 would be categorised as outliers in a Tukey boxplot representation.
iii. (2 marks) There is in fact one sensor reading that has a missing value and was not shown above. Suggest an imputation method for this reading, in light of the data above, and justify your choice.
3. Consider a recommender system for an online bookstore Books R Us which records ratings of books from users, and makes recommendations to users about books they might be interested to purchase. The system uses a popu- larity based strategy for its recommendations, randomly recommending a book from the top 50 books. The top 50 books are determined according to total number of purchases in the store during the last month.
(a) (2 marks) Explain one advantage and one disadvantage of this popularity based strategy, compared to using an item based collaborative filtering strategy for making recommendations. Explain any assumptions made.
(b) (2 marks) A data science consultant suggests to Books R Us that an alter- native strategy would be to train a decision tree classification model to predict whether a user will like a particular book. The features of the classification model would include information about the customer such as (age, gender, lo- cation) and also information about the book (cost, topic, year of publication). The class label to be predicted would be nlike, not like{.
Explain one advantage and one disadvantage of this scheme. Explain any assumptions made.
4. Consider a dataset of four data instances nx1 , x2 , x3 , x4 { which has the follow- ing pairwise dissimilarity (distance) matrix.
|
1 |
2 |
3 |
4 |
1 |
0 |
1 |
4 |
5 |
2 |
1 |
0 |
2 |
6 |
3 |
4 |
2 |
0 |
3 |
4 |
5 |
6 |
3 |
0 |
(a) (3 marks) Apply single link agglomerative hierarchical clustering to group the data instances represented by this matrix. Draw the resulting dendrogram. Show all working.
(b) (2 marks) Explain one general advantage and one general disadvantage of single link agglomerative hierarchical clustering compared to k-means cluster- ing.
5. Consider the task of detecting a strong negative (linear) correlation between two features, using visualisation.
(a) (2 marks) Explain, with the aid of a simple diagram, how a parallel co- ordinates plot could help a user detect a strong negative (linear) correlation between two features.
(b) (1 mark) Explain, with the aid of a simple diagram, how a scatter plot could help a user detect a strong negative (linear) correlation between two features.
(c) (1 mark) Which of the two plots do you believe would be more effective for this task? Explain and justify your answer.
6. (2 marks) Consider a dataset D1 with 4000 instances and 10 features. Consider two possible strategies to cluster this dataset, using k = 3.
· (S1) Apply PCA to D1 and retain the top two principal components. Let
the resulting 2-dimensional dataset be D2 . Apply k-means on D2 .
· (S2) Apply k-means directly on D1 .
In what circumstances would you expect the first strategy (S1) to be more effective than the second strategy (S2)? In what circumstances would you expect the second strategy (S2) to be more effective than the first strategy (S1)? Explain any assumptions made.
7. (4 marks) Given a dataset with four classes, A, B, C and D, suppose the root node of a decision tree has 125 instances of class A, 355 instances of class B,
200 instances of class C and 500 instances of class D. Consider a candidate split of this root node into four children, using a numeric feature F with values from 0 to 100.
· The first child contains instances with 0 ● F < 20 and has class propor-
tions (50 class A and 75 class B),
· The second child contains instances with 20 ● F < 40 and has class
proportions (50 class A and 225 class B),
· The third child contains instances with 40 ● F < 60 and has class
proportions (25 class A, 30 class B, 50 class C, 100 class D )
· The fourth child contains instances with 60 ● F ● 100 and has class
proportions (25 class B, 150 class C, 400 class D)
Using these numbers, write an expression for computing the utility of this candidate split using the information gain criterion. The expression may be complex and you do not need to simplify it to a single number.
8. A data scientist collects a large number of data pairs (age, height) of people from birth to 80 years old. She computes a Pearson correlation coefficient between age and height.
(a) (1 mark) Would you expect the correlation to be positive or negative? Why?
(b) (1 mark) Would you expect the correlation to be similar in value if it was computed using a different set of people? Explain your answer and any assumptions made.
9. (3 marks) Suppose I have a dataset of 5000 customers (instances). Each cus- tomer is described by 10 features. There is a binary class label of nhighvalue, lowvalue{.
Suppose we wish to empirically decide whether using k-NN will be a more accurate model than decision tree, for predicting the value of a new customer. Explain the steps one should follow in order to make this decision. Explain any assumptions made.
10. Business X and Business Y have decided to conduct a joint marketing cam- paign. For this marketing campaign, they first need to determine how many customers they have in common (how many people are in the customer list of both businesses). They involved a trusted 3rd party, Company Z, and imple- mented the following 3-party privacy preserving protocol, making use of the SHA-256 one way hashing function.
#In the following, the ’+’ symbol indicates string concatenation (joining two strings)
#Business X does the following
SetX=empty
For each customer at Business X
SetX=SetX n SHA-256(“First Name”+“Last Name”)
Send SetX to Company Z
#Business Y does the following
SetY=empty
For each customer at Business Y
SetY=SetY n SHA-256(“First Name”+“Last Name”)
Send SetY to Company Z
#Company Z does the following
result=count(SetX u SetY)
Share result with Business X and Business Y
(a) (1 mark) Suppose Company Z is malicious and intends to compromise the privacy of the data. Explain two possible privacy attacks that Company Z can perform.
(b) (1 mark) Explain what changes to the protocol would be required for resisting these two possible attacks.
(c) (2 marks) According to the protocol, Company Z can only compute ex- act matches between pairs of records from Business X and Business Y. Propose changes to the protocol so that approximate similarity can be computed in a privacy preserving manner by Company Z.
11. In a data de-duplication scenario, blocking can be used to reduce the cost of finding matching entities.
(a) (1 mark) Using the records below, propose a simple blocking strategy that results in at most two records being in the same block, when applied to the four records below.
Lee Cheng, 7 George St, Carlton
Lee Cheng, 7 George Rd, Brunswick
Jan Cheng, 7 Haddock Rd, Carlton
Jan Lee, 14 George Rd, Carlton
(b) (1 mark) Based on your proposed strategy, show the blocks having two records. (Do not show blocks with only one record.)
(c) (2 marks) Provide two reasons as to why your proposed blocking strategy might perform poorly when applied to a large real dataset.
12. Consider the following dataset which has been anonymised. The quasi-identifiers are nSex,Age,Postcode{ and the sensitive attribute is Diagnosis.
ID Sex Age Postcode Diagnosis
1 |
male |
20-25 |
318* |
skin rash |
2 |
male |
20-25 |
318* |
flu |
3 |
male |
40-45 |
318* |
headache |
4 |
female |
30-35 |
318* |
skin rash |
5 |
male |
20-25 |
318* |
flu |
6 |
female |
30-35 |
318* |
flu |
7 |
female |
30-35 |
318* |
headache |
8 |
male |
40-45 |
318* |
headache |
9 |
male |
40-45 |
318* |
headache |
10 |
male |
20-25 |
318* |
skin rash |
11 |
female |
30-35 |
318* |
headache |
12 |
male |
40-45 |
318* |
headache |
(a) (2 marks) In the context of k-anonymity: Is this data 1-anonymous? Is it 2-anonymous? Is it 3-anonymous? Is it 4-anonymous? Is it 5-anonymous? Is it 6-anonymous? Is it 7-anonymous? Is it 8-anonymous? Explain your answers.
(b) (2 marks) In the context of l-diversity: Is this data 1-diverse? Is it 2- diverse? Is it 3-diverse? Is it 4-diverse? Is it 5-diverse? Is it 5-diverse? Is it 6-diverse? Is it 7-diverse? Is it 8-diverse? Explain your answers.
13. (2 marks) Suppose Bob signs a document with his digital signature. Fred receives the document and changes its contents, but leaves the digital signature unchanged. How could a third party (Alice), know that the document has been modified from its original version, by someone other than Bob?
14. (3 marks) Consider the following quote from William Mougayar about blockchain “Online identity and reputation will be decentralized. We will own the data that belongs to us” .
Using the example of a blockchain for storing education related data, argue 3 distinct reasons why this quote could be true.
2022-07-11