闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MAT700

Mathematical Methods for Data Mining

2020/21

1. (a) i. What is supervised learning? Give an example.

ii. What is unsupervised learning? Give an example.

iii. What is semi-supervised learning? Give an example.

(b) Prove that the edit distance is indeed a distance measure.

(c) Suppose there is a repository of 512 documents, and word w appears in 32 of them. In a particular document D, the maximum number of occurrences of a word is 15. What is the TF IDF score

for w in D if that word appears six times in D?

(d) i. Prove that the Jaccard similarity of bags is always less than or equal to 1/2.

ii. Compute the Jaccard similarities of the following two bags:

一a, b, b, c, c, c, d, d, d, d, e , e, e, e, e{ and

一x, x, e, d, d, c, c, b, b, b, b, a, a, a, a, a{.

2. (a) Describe the na¨ıve Bayesian classiﬁcation.

(b) Explain how any hierarchical clustering algorithm works and how we choose

which two clusters to merge.

(c) During testing of a Spam/NotSpam classiﬁer on a test set of 1000 e-mail messages with 200 spam messages and 800 non-spam messages you found

that 180 spam messages were correctly classiﬁed as spam, and 600 non-spam messages were correctly classiﬁed as not spam. Calculate the precision, recall and F-measure of this classiﬁer.

(d) In a class of 250 students, suppose that the birthdays are independent and uniformly distributed over the 365 days of the year. We are interested in the event that some students have their birthday on the same day. Calculate the expectation of the number of triples of students having

the same birthday.

3. (a) What are spider traps and what problems do they cause in PageRank calculations? Explain how the improved PageRank model:

ν = }(1 │ β) + βM╱ ν

overcomes the problem of spider traps.

(b) Compute the PageRank of each page in Fig. 1, assuming β = 0.8

(c) For a given data set of 3-D input data, we apply the SVM (support vector machine) learning algorithm and therefore achieve an optimal decision plane:

x1 + 2x2 + 2x3 + 3 = 0

i. What is the margin of this SVM?

ii. For three examples in the training set, (-1,-1,-2) with label -1, (2,-2,-1) with label -1, and (-2,-2,2) with label +1, tell whether they are support vectors of the decision plane.

4. (a) What is the unnormalized graph Laplacian operator L for a weighted graph? [4]

(b) Prove that if a function f on a connected weighted graph belongs to the kernel of the unnormalized graph Laplacian operator, i.e. Lf = 0, then f is a constant function.

(d) What is the kNN classiﬁcation algorithm? Explain how to weight the contributions of the neighbours.

(e) Formulate the k-means algorithm. What is the objective function of the k-means algorithm? Prove the convergence of the k-means algorithm.