闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP7103 Assignment 1

Question 1 Data Preprocessing [15%]

Consider a numerical attribute with the following values. It is known that the range of values for the attribute is [0, 20] and it is always an integer.

0 3 5 6 8 12 15 15 15 16 18 19 19 19 20

a) Convert the attribute into an ordinal attribute by splitting the data into 3 bins of equal interval width. State the ranges of the 3 bins.

b) Convert the attribute into an ordinal attribute by splitting the data into 3 bins of equal frequency. State the ranges of the 3 bins.

c) Apply min-max normalization to this attribute.

Question 2 Metric Axioms [20%]

Consider a set of document data that records the number of occurrences of n different words in each of the documents.

a) Suppose a distance function d is defined for the document data as d(p, q) = arccos(dCOS (p, q)), where dCOS (p, q) is the cosine similarity of p and q . Validate the distance measure with each of the criteria in the metric axioms (Lecture Notes Chapter 2 p.64).

You may use the following inequalities in your answer. Proof is required for any other inequalities used.

⚫ ∀a, b ∈ ℝ, |a| + |b| ≥ |a + b| (triangle inequality)

⚫ ∀a, b ∈ ℝ2, ‖a‖2(2)‖b‖2(2) ≥ (a ⋅ b)2 (Cauchy-Schwarz inequality)

⚫ Denote ∠AB as angle between vectors A and B, then |∠AC − ∠CB| ≤ ∠AB ≤ ∠AC + ∠CB (triangle inequality for angles)

b) Suggest one transformation to the document data so that the distance function in part a) satisfies the metric axioms. Explain your answer.

Question 3 Classification [30%]

Consider the training and testing samples shown in Table 1 for a binary classification problem.

a) What is the entropy of this collection of training examples with respect to the class attribute?

b) Build a decision tree using entropy as the impurity measure and the pre-pruning criteria is the information gain < 0.1. Show your steps.

c) Using the testing data shown in Table 1 as the test set; show the confusion matrix of your classifier.

d) With respect to the ‘+‘class, what are the precision and recall of your classifier?

Training Set

Record A B C Class

Testing Set

Record A B C Class

X X X X X X Y Y X X X X Y Y Y Y

−

− + + + +

Table 1 dataset for a binary classification problem

Question 4 Splitting [20%]

Consider a dataset consists of 150 instances of data, with a single attribute a and a class attribute x . There are three possible values for x (A, B or C). Table 2 shows a summary of the dataset showing the number of instances of data per class label for every value of a appearing in the dataset.

Suppose we want to predict the class attribute x using the numerical attribute a, compare all the possible splitting using the GINI as the impurity measures, and derive the best binary split point. Show clearly all split points considered and the corresponding GINI.

a	x = A	x = B	x = C
1	30	0	0
5	20	7	1
9	0	19	4
11	0	22	11
13	0	2	34

Table 2 summary of 150 instances of data

Question 5 Weka [15%]

The Abalone Data Set consists of some measurement of abalones and their age (judged by counting the number of rings under microscope).

Download the dataset:https://archive.ics.uci.edu/ml/datasets/Abaloneand read the description. A

copy of the dataset and an extract of the description are also available on Moodle.

Construct a new dataset with 5 attributes and a class label as described in Table 3, containing all data from the given dataset. Make sure the data could be imported into Weka so that Weka can identify the data type of the attributes correctly.

Attribute	Data Type	Note
Sex	nominal	Same	as “Sex”
Length	numerical	Same	as “Length”
Diameter	numerical	Same	as “Diameter”
Height	numerical	Same	as “Height”
Weight	numerical	Same	as “Whole weight”
AgeGroup	nominal (class attribute)	“A” if “B” if “C” if “D” if	Rings ≤ 5; 5 < Rings ≤ 10; 10 < Rings ≤ 15; Rings > 15.

Table 3 Attributes ofnew dataset

a) Construct the header of an ARFF file for the preprocessed dataset. Show all parts before “@DATA” .

b) Give a screenshot of the histogram of all attributes with respect to the class label AgeGroup in Weka.

c) Use CVParameterSelection in Weka with the J48 algorithm, picking the value of C among 5 values from 0.1 to 0.5. Choose 10-fold cross-validation in both the test options and the options in CVParameterSelection. Give all classifier output before “J48 pruned tree” .

d) Using the “J48 pruned tree” from the classifier output in part c), classifies the instance of data shown in Table 4. Clearly show the section the tree involved in your answer.