B351 / Q351 Assignment 5

发布时间：2022-11-08

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 5

B351 / Q351

1 Summary

● Gain a basic understanding of machine learning

● Implement entropy, information gain, and classiﬁcation methods for a decision tree algorithm, in addition to seeing how to recursively build a decision tree from an input node.

We will be using Python 3, so be sure you aren’t using older versions. Code that will compile in Python 2.7 may not compile in Python 3. See our installation guide for help removing old versions and installing Python 3.

This assignment must be completed individually. You must submit your own ﬁles. Any shared code will not be tolerated and those involved will be subjected to the university’s cheating and plagiarism policy. If you discuss ideas, each person must write a comment at the top of the assignment ﬁle naming the students with whom ideas were shared.

2 Background

Decision Tree Learning is a machine learning algorithm often used for classiﬁcation and regression problems. Decision trees oﬀer several advantages over other algorithms, such as ease of implementation and reasonable interpretability, while facing several drawbacks, such as a tendency towards overﬁtting. Our reasoning behind choosing decision trees as the subject for this assignment is that they require minimal mathematical background as compared to other machine learning methods (Linear Regression, Neural Networks, etc). In this assignment, we hope to provide you with some general background knowledge regarding machine learning, as well as give you a chance to work with a real, commonly used machine learning algorithm.

In order to discuss decision trees, it will help to have a common background on machine learning terminology. The following videos provide some background on machine learning, and we highly recommend you watch them. (Having this baseline of machine learning knowledge is a prerequisite for a good machine learning-related ﬁnal project!)

1. Machine Learning Basics: What is Machine Learning? (7:51)

2. Introduction to Supervised Learning (12:29)

Decision trees can be used for both classiﬁcation and regression problems, but for this assignment, we will be limiting our scope to classiﬁcation. Even further, we limit the type of data which our algorithm can receive to categorical data.

How does a decision tree work? There are two main phases:

1. Learning the tree

2. Classifying data

2.1 Learning the Tree

Learning the tree is a recursive process. We start with a root node, which contains all the data points. We will use some criterion to decide which attribute to use to split the data into 2 subnodes. Repeat this process at each of the subnodes, using the data they receive from their parent node. The process continues until a node that has all the data of the same class is reached, or has identical attributes. Your job is to implement the criterion that decides what to split. For implementation details of the full algorithm itself to learn the tree, refer to the train function at the bottom of a5 .py.

In the textbook, there are detailed explanations on how to build a decision tree. For further information, please refer to section 8.4 (page 184) in the book (you can ﬁnd the book in syllabus).

2.2 Classifying Data

Once the decision tree is built, classifying new datapoints is trivial. Simply propagate that datapoint through the tree, and when you arrive at a leaf node, return the class of the most common data in that node.

3 Programming Component

In this assignment’s programming component, instead of building a whole decision tree, you will just be implementing speciﬁc functions that are used in the creation of a decision tree.

3.1 Data Structures

You will utilize the following classes. You should read and understand this section before embarking on the assignment.

3.1.1 a5.py

To help with your implementations, we imported the math package so you can easily access mathematical functions. We also provide a useful function called unique, based on numpy’s unique function, which takes an iterable and returns

● a list of the unique items from that iterable

● a corresponding list of the counts of the number of times each item appeared in the original iterable

Example: unique([5,3,5,2,3,4,2,5,4,2]) would return ([2,3,4,5] , [3,2,2,3])

You will also see the functions for entropy and information gain. It is your job to implement these, and their criteria is explained under the Objectives section of this documentation.

3.1.2 OtherKey Class

This class is in the a5 .py ﬁle. An OtherKey object is a special key that represents any unexpected value for an attribute in question. This will be useful in the classify method.

3.1.3 Node Class

This class is in the a5 .py ﬁle. Node is a node in a decision tree. It keeps track of the attribute split for its children. It encapsulates the following data members:

● attribute - the attribute used to split at the Node. It is also a key used in a data point’s dictionary.

● children - a dictionary mapping possible values of the above attribute to child nodes for recursive classiﬁcation

● classification - the selected class for a leaf node. This will be None for non-leaf nodes.

The following are Node’s provided object methods:

1. init (self, attribute=None, children=(} , classification=None) - constructor for the Node class

2. repr (self) - provides a string-based representation of the tree under any Node, suitable both for a human to read and for a machine to use to reconstruct the tree.

3. classify point(self, point) - classiﬁes a given point. It is your job to implement this, and its criteria is explained under the Objectives section.

4. train(self, points, labels) - trains the decision tree. You won’t need to do anything with this, but it will give you an idea on how the functions you’ll be implementing will come together to create a decision tree.

3.1.4 KNN Classiﬁer Class

This class is in the a5 .py ﬁle. This class contains all the needed methods for classiﬁcation using the KNN algorithm. It encapsulates the following data members:

● k - the ’K’ in K-NN. used to determine the number of neighbors to use.

The following are KNN Classiﬁer’s provided object methods:

1. init (self, k) - constructor for the KNN Classiﬁer class

2. calc euclidean distance(self, point a, point b) - calculates the euclidean distance of two points. It is your job to implement this, and its criteria is explained under the Objectives section.

3. get top label(self, top k labels) - picks the most frequent label in a list of labels, generally the closest k labels. It is your job to implement this, and its criteria is explained under the Objectives section.

4. classify point(self, point, training points, training labels)

- classiﬁes a given point using the training points and labels given. It is your job to implement this, and its criteria is explained under the Objectives section.

3.1.5 debug.py

This ﬁle aims to help you see whether or not your functions are correct and get the intended entropy, information gain, and classiﬁcation values when they’re run on the given examples. Further debugging is encouraged to be done on your own, but the grading tool will also provide some feedback as to what you might be missing. Simply run the main function at the bottom of the ﬁle to check your results compared to the intended answers.

3.2 Objective

As stated before, the actual creation of the decision tree has been implemented for you already. Your goal is to provide the implementations to the following functions:

1. calc entropy

2. calc information gain

3. Node .classify point

4. KNN Classifier .calc euclidean distance

5. KNN Classifier .get top label

6. KNN Classifier .classify point

to the following speciﬁcations:

3.2.1 calc entropy(classiﬁcations)

Entropy is a concept from information theory, which in vague terms, describes the amount of information missing from a system. This function should use the information contained in the given classiﬁcations to determine the entropy value for a dataset in a decision tree. It should utilize the entropy formula (as seen in lecture), which involves the probability of each class occurring.

● classifications: a list of the classiﬁcations that are assigned to the points of a dataset

3.2.2 calc information gain(parent classiﬁcations, classiﬁcations by val, val freqs)

Information Gain is a metric for choosing decision tree splits, returning a value based on the diﬀerence between the entropy of the parent and the normalized entropy of all the children.

● parent classifications: a list of the classiﬁcations for the points of the parent node

● classifications by val: a dictionary of each unique value of an attribute mapped to a list of the classiﬁcations of each data point with that unique value.

Example: From the sunburn example in class, classifications by val for the attribute ”Hair” would be (’Blonde’:[’Burn’, ’None’, ’Burn’, ’None’], ’Brown’:[’None’, ’None’, ’None’], ’Red’: [’Burn’]}

● val freqs: a dictionary of each unique value of an attribute mapped to the number of times that value occurs

It may be helpful to refresh on Python dictionary methods when implementing this function.

3.2.3 Node.classify point(self, point)

Write a function that classiﬁes a given point going through the decision tree process. It should:

1. If the node has a classiﬁcation, return that classiﬁcation

Otherwise:

1. Get the value of the node’s attribute at the point

2. If this value maps to one of the node’s child nodes, get the child node corresponding to the value

3. Otherwise, get the child node corresponding to OTHER

4. Recursively call classify on the child node using the provided point argument

● point: a dictionary of each attribute for the point mapped to the attribute’s value at that point

Example: From the sunburn example, if we chose ”Annie” as our point, it would look like (’Hair’:’Blonde’, ’Height’:’Short’, ’Weight’:’Average’, ’Lotion’:’No’}

3.2.4 KNN Classiﬁer.calc euclidean distance(self, point1, point2)

Write a function that calculates the euclidean distance between two given points.

● point a & point b: a 2-tuple of ﬂoats representing a point in the classiﬁers data set. Example: (x, y) where x,y are ﬂoats.

3.2.5 KNN Classiﬁer.get top label(self, top k labels)

Write a function that chooses the most frequent label from a list of labels.

● top k labels: a list of labels in the classiﬁers data set. When calling pick label, this needs to be the closest k labels to the point you’re trying to classify.

3.2.6 KNN Classiﬁer.classify point(self, training labels)

point, training points,

Write a function that classiﬁes a given point in the given data set going through the K-nearest neighbors algorithm. It should:

1. Order the given sample points by distance, calling the calc euclidean distance function to ﬁnd the distance from the sample point to the new point, point.

2. Find the k-nearest neighbors using the above ordering

3. Call get top label on the list of labels for the k-nearest neighbors in order to determine the most frequent occurring label.

4. Return the result of pick label exactly.

● point: the point we are wanting to classify

● training points: sample points from a data set. this is a list of 2-tuples representing (x, y) points.

● training labels: labels of the training points. will be the same length n as training points. for each i from 0 to n, training labels[i] is the label of training points[i]

4 Grading

Problems are weighted according to the percentage assigned in the problem title. For each problem, points will be given based on the criteria met in the corresponding table on the next page and whether your solutions pass the test cases on the online grader.

In addition, the students in the tool-assisted group will have their solutions assessed for clarity (15 points), control ﬂow and time complexity (5 points),

and making use of clean and suitable syntax features (5 points).

Finally, remember that this is an individual assignment.

4.1 calc entropy (20%)

Criteria	Points
Counts how many times each class in classifications occurs	5
Gets the total number of classiﬁcations	2
Goes through each count and gets the probability distribution	5
Returns the entropy using the entropy formula used in lecture, utilizing the probability distributions found	8
Total	20

4.2 clac information gain (20%)

Criteria	Points
Gets the total number of data points	2.5
Gets the weighted entropy by going through each value and their classiﬁcations (a) gets the entropy at that value (the child entropy) (b) ﬁnds the probability of that value occurring	5 5
Gets the entropy of the parent	2.5
Returns the information gain using the information gain formula used in lecture	5
Total	20

4.3 classify point (10%)

Criteria	Points
If the node has a classiﬁcation, returns that classiﬁcation	2
Otherwise, Gets the value of the node’s attribute at the point	2
If this value maps to one of the node’s child nodes, get the child node corresponding to the value	2
Otherwise, get the child node corresponding to OTHER	2
Recursively calls classify on the child node using the provided point argument.	2
Total	10