闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Semester Two 2019 Exam

Computing and Information Systems

COMP20008

Elements of Data Processing

Formulae and Notation

Euclidean distance: d(x．y) =_ (zi - yi )2

Pearson’s correlation coeﬃcient: r = ^ (

where = zi and y¯ = yi

Entropy: F(x) = pi log2 pi

where pi is the proportion of points in the ith category.

Conditional entropy: F(y nx) = zex p(z)F(y nx = z)

where x is the set of all possible categories for x

Mutual information:

IH(x．y) = F(y) - F(y nx) = F(x) - F(xny)

Normalised mutual information:

NIH(x．y) = min

Accuracy: , = where

TP is number of true positives

TN is number of true negatives

AP is number of false positives

AN is number of false negatives

Set Union: x u y is the union of sets x and y (elements in x or in y or in both x and y)

Set intersection: x ∩ y is the intersection of sets x and y (elements in both x

and y)

1. Consider the following XML ﬁle:

<?xml version="1 .0"?>

<URL> https://handbook.unimelb.edu.au/subjects/comp20008 </url> <name> Elements of Data Processing </name>

</subject>

<year/>

(a) (1 mark) Identify the errors present in the XML ﬁle. Modify the XML so that it is well formed.

(b) (1 mark) Is this an example of unstructured, semi-structured or struc- tured data? Explain why.

(d) (0.5 mark) Explain one advantage of JSON over XML.

2. Consider the following temperature data for one week in Melbourne:

12, 15, 18, 13, 51, 10, 16

(a) (1 marks) Calculate the median and interquartile range (IQR) of the data (b) (0.5 mark) Draw a Tukey boxplot of the data.

(d) (1 mark) Suppose we consider the 51 value to be infeasible and discard it. One method for calculating a replacement value is imputation by mean. Considering only the values of the datapoints and not the real- world context of the data, do you think this method would work well for this data? Why or why not?

(e) (1 mark) Consider the real-world context of the data. Do you think imputation by mean is likely to work well? Why or why not?

(f) (0.5 mark) Suggest one alternative imputation approach that might work better than imputation by mean for this dataset

3. Consider the following two plots:

(a) (1.5 mark) Are these likely to have come from the same dataset? Give two reasons why or why not.

(b) (1.5 marks) Consider a dataset with 10000 rows and 500 features. Give three reasons why we might want to apply PCA while analysing the dataset.

4. Consider the following data points:

(0,0), (0,3), (4,0), (4,3)

(a) (1 mark) Calculate the proximity matrix using Euclidean distance be- tween points.

(b) (2 marks) Apply hierarchical clustering to this dataset using the single linkage method. Draw the resulting Dendrogram.

(c) (1 mark) Describe one situation where hierarchical clustering using the complete linkage method might perform better than hierarchical cluster- ing using the single linkage method.

(d) (1 mark) Consider the dataset represented by the scatter plot below, where one class of data is indicated by circular points and another class of data is indicated by square points:

Do you think that hierarchical clustering or K-Means clustering would better separate the two classes? Why?

(e) (1 mark) Describe how clustering can be used to detect outliers in a dataset.

5. Pearson Correlation and Mutual Information are two widely used correlation measures. Assume you have collected the following information for 10,000 Australian adults: Age in years (ranging from 20 to 100) and salary in dollars (ranging from 0 to 100,000).

(a) (1 mark) When you calculate the Pearson correlation between age and salary for this data set, the value you obtain is 0.8. Name two conclusions you can draw from this result.

(b) (2 marks) Pearson Correlation has a number of limitations. Name two of them and illustrate with the example above.

(c) (2 marks) If you want to apply mutual information to this example, the ﬁrst step is to create appropriate bins for the data. Brieﬂy explain why this is necessary and show one suitable strategy of creating such bins for the example above.

(d) (1 mark) Another data wrangler applied Mutual Information to the data set and obtained a value of 0.9. Does this together with the Pearson co-eﬃcient of 0.8 mean we can conclude that age determines the salary of a new employee?

6. Cross-validation is a common experimental design tool.

(a) (1 mark) Brieﬂy explain how 5-fold cross-validation works using a small illustration.

(b) (2 marks) Classiﬁcation data sets can often be imbalanced, i.e. there may be far more training examples of one class versus another. Can cross-validation help in this situation? Why or why not?

7. Decision trees consist of a number of elements. Two of them are internal nodes and leaf nodes.

(a) (1 mark) Explain the diﬀerence between these two node types.

(b) (2 marks) Using decision trees, what happens to a data instance during testing if has an attribute value that was never seen in training? Give two possible scenarios.

8. Consider an online ‘Movie’ recommender system. The system collects users’ ratings for movies and uses collaborative ﬁltering techniques to generate rec- ommendations to the users. Ratings are between 1 and 5 (1, 2, 3, 4, or 5).

(a) (2 marks) Explain the steps an item-based collaborative ﬁltering approach takes to predict the likely rating of a movie M1 by a particular user U1 using the user rating matrix.

(b) (2 marks) Give one advantage and one disadvantage of item-based col- laborative ﬁltering method.s

We want to oﬀer external people aggregate statistics with some global diﬀer- ential privacy guarantee. It has been decided that we will allow people to query the following two aggregate statistics for a speciﬁc movie M1 from the user rating matrix: (1) the total sum of the ratings for M1 . (2) the number of ratings for M1 . In this context,

(d) (1 mark) What are the global sensitivity values for the two aggregate statistics?

9. (4 marks ) Consider two parties A and B. Each has a dataset of names, in- cluding personal and company names of about 1 million records.

Explain a 3-party protocol for a privacy preserving data linkage between A and B which can prevent both dictionary and frequency attacks. Explanations should include both data processing and the roles of each of the three parties in performing the processing.

10. Blockchain technology enables data sharing in a peer to peer system.

(a) (3 marks) Explain how to produce a message digest in a digital signature and give 2 properties of a message digest.

(b) (2 marks) Describe the data structure of the header of a block in a blockchain; and explain how the mechanism of chaining the blocks in a blockchain is a part of the design for supporting data immutability

11. (2 marks) Given the anonymised table below, what is the best achievable k- anonymity and l-diversity for the data? Justify your answer.

Gender (non-sensitive) Subject (non-sensitive) Grade (sensitive)

male	COMP20008	H1
male	COMP20008	H1
female	COMP20008	H1
male	COMP20008	H2
male	COMP20008	H2
female	COMP20008	H3
female	COMP20008	P
male	COMP20003	H1
male	COMP20003	H2
male	COMP20003	P