Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP9417: A collection of sample exam exercises

August 9, 2022

Note to student: Some of these questions are longer/more difficult than the questions that will be on the actual final exam. With that being said, you should aim to work through and understand all questions here as part of your revision. Note that this is not a sample final exam, it is a collection of possible exercises, and the actual final exam will be shorter.

Question 1

Note: this question was incorporated into the Ensemble learning lab and the code there is much im-proved. Refer to that lab for solutions. This question requires you to implement the Adaptive Boosting Algorithm from lectures. Use the following code to generate a toy binary classification dataset:

Your data should look like:

(a) By now, you will be be familiar with the scikitlearn DecisionTreeClassifier class. Fit Decision trees of increasing maximum depth for depths ranging from 1 to 9. Plot the decision boundaries of each of your models in a 3 × 3 grid. You may find the following helper function useful:

(b) We now restrict attention to trees of depth 1. These are the most basic decision trees and are commonly referred to as decision stumps. Consider the adaptive boosting algorithm presented in the ensemble methods lecture notes on slide 50/70. In adaptive boosting, we build a model composed of T weak learners from a set of weak learners. At step t, we pick a model from the set of weak learners that minimises weighted error:

where wt−1,i is the weight at the previous step for observation i, and I{yi = ˆyi} is equal to 1 if yi = ˆyi and zero otherwise. We do this for a total of T steps, which gives us a boosted model composed of T base classifiers:

where αt is the weight assigned to the t-th model. Classification is then carried out by assigning a point to the positive class if M(x) > 0 or to the negative class if M(x) < 1. Here we will take the class of weak learners to be the class of Decision stumps. You may make use of the ‘sample weight’ argument in the ‘fit()’ method to assign weights to the individual data points. Write code to build a boosted classifier for T = 15. Demonstrate the performance of your model on the generated dataset by printing out a list of your predictions versus the true class labels. (note: you may be concerned that the decision tree implementation in scikit learn does not actually minimise  t even when weights are assigned, but we will ignore this detail for the current question).

(c) In this question, we will extend our implementation in (c) to be able to use the plotter function in (b). To do this, we need to implement a boosting model class that has a ‘predict’ method. Once you do this, repeat (c) for T = [2, . . . , 17]. Plot the decision boundary of your 16 models in a 4 × 4 grid. The following template may be useful:

(d) Discuss the differences between bagging and boosting.

Question 2

Note: this question is actually a bit longer and (slightly) more difficult than an exam question, but it is good practice In this question, you will be required to manually implement Gradient Descent in Python to learn the parameters of a linear regression model. You will make use of the ‘real estate.csv’ dataset, which you can download directly from the course Moodle page. The dataset contains 414 real estate records, each of which contains the following features:

• transactiondate: date of transaction

• age: age of property

• nearestMRT: distance of property to nearest supermarket

• nConvenience: number of convenience stores in nearby locations

• latitude

• longitude

The target variable is the property price. The goal is to learn to predict property prices as a function of a subset of the above features.

(a) A number of pre-processing steps are required before we attempt to fit any models to the data. First remove any rows of the data that contain a missing (‘NA’) value. List the indices of the removed data points. Then, delete all features from the dataset apart from: age, nearestMRT and nConvenience.

(b) An important pre-processing step: feature normalisation. Feature normalisation involves rescaling the features such that thay all have similar scales. This is also important for algorithms like gradient descent to ensure the convergence of the algorithm. One common technique is called min-max normalisation, in which each feature is scaled to the range [0, 1]. To do this, for each feature we must find the minimum and maximum values in the available sample, and then use the following formula to make the transformation:

After applying this normalisation, the minimum value of your feature will be 0, and the maximum will be 1. For each of the features, provide the mean value over your dataset.

(c) Now that the data is pre-processed, we will create train and test sets to build and then evaluate our models on. Use the first half of observations (in the same order as in the original csv file excluding those you removed in the previous question) to create the training set, and the remaining half for the test set. Print out the first and last rows of both your training and test sets.

(d) Consider the loss function

where c ∈ R is a hyper-parameter. Consider the (simple) linear model

We can write this more succintly by letting w = (w0, w1, w2, w3) T and X(i) = (1, x ( 1 i) , x ( 2 i) , x3 (i) ) T , so that yˆ (i) = w T X(i) . The mean-loss achieved by our model (w) on a given dataset of n observations is then

Compute the following derivatives:

You must show your working for full marks.

(e) For some value of c > 0, we have

Using this result, write down the gradient descent updates for w0, w1, w2, w3 (using pseudocode), assuming a step size of η. Note that in gradient descent we consider the loss over the entire dataset, not just at a single observation. Further, provide pseudocode for stochastic gradient descent up-dates. Here, w (t) denotes the weight vector at the t-th iteration of gradient descent.

(f) In this section, you will implement gradient descent from scratch on the generated dataset using the gradients computed in Question 3, and the pseudocode in Question 4. Initialise your weight vector to w (0) = [1, 1, 1, 1]T . Consider step sizes

η ∈ {10, 5, 2, 1, 0.5, 0.25, 0.1, 0.05, 0.01}

(a total of 9 step sizes). For each step-size, generate a plot of the loss achieved at each iteration of gradient descent. You should use 400 iterations in total (which will require you to loop over your training data in order). Generate a 3×3 grid of plots showing performance for each step-size. Add a screen shot of the python code written for this question in your report (as well as pasting the code in to your solutions.py file). The following code may help with plotting:

(g) From your results in the previous part, choose an appropriate step size (and state your choice), and explain why you made this choice.

(h) For this part, take η = 0.3, re-run GD under the same parameter initilisations and number of iterations. On a single plot, plot the progression of each of the four weights over the iterations. Print out the final weight vector. Finally, run your model on the train and test set, and print the achieved losses.

(i) We will now re-run the analysis in the previous part, but using stochastic gradient descent (SGD) instead. In SGD, we update the weights after seeing a single observation. Use the same settings as in the gradient descent implementation. For each step-size, generate a plot of the loss achieved at each iteration of stochastic gradient descent, which is to be run for 6 Epochs. An epoch is one pass over the entire training dataset, so in total there should be 6 × ntrain updates, where ntrain is the size of your training data. Be sure to run these updates in the same ordering that the data is stored. Generate a 3 × 3 grid of plots showing performance for each step-size. Add a screen shot of the python code written for this question in your report (as well as pasting the code in to your solutions.py file).

(j) From your results in the previous part, choose an appropriate step size (and state your choice), and explain why you made this choice.

(k) Take η = 0.4 and re-run SGD under the same parameter initilisations and number of epochs. On a single plot, plot the progression of each of the four weights over the iterations. Print yoru final model. Finally, run your model on the train and test set, and record the achieved losses.

(l) In a few lines, comment on your results in Questions 5 and 6. Explain the importance of the step-size in both GD and SGD. Explain why one might have a preference for GD or SGD. Explain why the GD paths look much smoother than the SGD paths.

Question 3

In this question, we will consider the scikitlearn implementations of the following classifiers:

• Decision Trees

• Random Forest

• AdaBoost

• Logistic Regression

• Multilayer Perceptron (Neural Network)

• Support Vector Machine

You are required to compare the performance of the above models on a classifiation task. The following code loads in these classifiers and defines a function to simulate a toy dataset:

(a) Generate a dataset using the default parameters of create dataset. Then, randomly split the dataset into training set Xtrain and test set Xtest (use the sklearn train test split function with random state=45), with 80 % of examples for training and 20 % for testing. Fit each of the models to the training data and then plot the decision boundaries of each of the classifiers (using default parameter settings) on the test set. If you prefer, you may use the following plotter function which plots the decision boundary and works for any sklearn model.

Note that you can use the same general approach for plotting a grid as used in Homework 1, and the plotter function supports an ‘ax’ argument. What to submit: a single 3 × 2 plot, a print screen of the python code used, a copy of your python code in solutions.py.

(b) Next, we will study how the performance of each of the clasifiers varies as you increase the size of the training set. Fix your training and test sets from part (a). Then, starting from a random subset (no need to set a seed) of your training set of size 50 (chosen with replacement), train your classification model, and compute the accuracy on the test set. Repeat this process for training set sizes of [50, 100, 200, 300, . . . , 1000]. Repeat the experiment a total of 10 times for each classifier. Then, for each classifier, plot its average accuracy (achieved on the test set) for each training set size. Compare the accuracy across different algorithms in a single figure, and in 5 lines or less, discuss your observations: For the models covered in the course so far, use what you know about the bias-variance decomposition to inform your discussion. Which model you prefer for this task?

Please use the following color scheme for your plots: [Decision Tree, Random Forest, AdaBoost, Logistic Regression, Neural Network, SVM], and please include a legend in your plot. What to submit: a discussion of your observation, a single plot, a print screen of the python code used, a copy of your python code in solutions.py.

(c) Using the time module, record the training time for each of the classifiers at each of the training set sizes. Plot the log of the average training time over the 10 trials of each classifier as a function of the training size (use log base e). You may add code for this section to your code in part (b). What do you observe? In 5 lines or less, discuss your observations. Use the same color scheme as in (b). What to submit: a discussion of your observation, a single plot, a print screen of the python code used, a copy of your python code in solutions.py.

Question 4

Refer to the following dataset containing a sample S of ten examples. Each example is described using two Boolean attributes A and B. Each is labelled (classified) by the target Boolean function.

(a) What is the entropy of these examples with respect to the given classification?

(b) What is the Information gain of attribute A on sample S above?

(c) What is the information gain of attribute B on sample S above?

(d) What would be chosen as the ‘best’ attribute by a decision tree learner using the ifnromation gain splitting criterion? Why?

(e) What are ensembles? Discuss one example in which decision trees are used in an ensemble.

Question 5

Refer to the following data

(a) Apply the Perceptron Learning Algorithm with starting values w0 = 5, w1 = 1 and w2 = 1, and a learning rate η = 0.4. Be sure to cycle through the training data in the same order that they are presented in the table.

(b) Consider a new point, x? = (−5, 3). What is the predicted value and predicted class based on your learned perceptron for this point?

(c) Consider adding a new point to the data set, x? = (2, 2) and y? = −1. Will your perceptron converge on the new dataset? How might you remedy this?

(d) Consider the following three logical functions:

1. A ∧ ¬B

2. ¬A ∨ B

3. (A ∨ B) ∧ (¬A ∨ ¬B)

Which of these functions can a perceptron learn? Explain. What are two ways that you can extend a perceptron to learn all three functions ?

Question 6

Refer to the following information. In these two questions you will apply the k-means algorithm. You will use a univariate (one-variable) dataset containing the following 12 instances:

{2.01, 3.49, 4.58, 4.91, 4.99, 5.01, 5.32, 5.78, 5.99, 6.21, 7.26, 8.00}

Use the Manhattan or city-block distance, i.e., the distance between two instances xi and xj is the absolute value of the difference xi − xj . For example, if xi = 2 and xj = 3 then the distance between xi and xj is |2 − 3| = 1. Use the arithmetic mean to compute the centroids. Apply the k-means algorithm to the above dataset of examples. Let k = 2. Let the two centroids (means) be initialised to {3.33, 6.67}. On each iteration of the algorithm record the centroids.

(a) After two iterations of the algorithm you should have recorded two sets of two centroids. What are they?

Now apply the algorithm for one more iteration. Record the new centroids after iteration 3 and answer the following question.

(b) After 3 iterations it is clear that:

(a) due to randomness in the data, the centroids could change on further iterations

(b) due to randomness in the algorithm, the centroids could change on further iterations

(c) k-means converges in probability to the true centroids

(d) the algorithm has converged and the clustering will not change on further iterations

(e) the algorithm has not converged and the clustering will change on further iterations

Question 7

Note: there will not be any MCQ on the final this year - this question is still useful practice though. In this question, we will apply a mistake-bounded learner to the following dataset. This dataset has 6 binary features, x1, x2, . . . x6. The class variable y can be either 1, denoting a positive example of the concept to be learned, or 0, denoting a negative example.

Apply the Winnow2 algorithm to the above dataset of examples in the order in which they appear. Use the following values for the Winnow2 parameters: threshold t = 2, α = 2. Initialise all weights to have the value 1.

(a) After one epoch, i.e., one pass through the dataset, which of the above weight configurations has been learned ?

(a) Weight vector 1

(b) Weight vector 2

(c) Weight vector 3

(d) Weight vector 4

(e) Weight vector 5

(b) On which of the examples did the algorithm not make a mistake ?

(a) Examples 1), 2) and 5)

(b) Example 5)

(c) Example 4)

(d) Examples 4) and 5)

(e) None of the above

(c) The algorithm has learned a consistent concept on the training data:

(a) True

(b) False

(c) It is not possible to determine this

(d) Assume the target concept from which this dataset was generated is defined by exactly two fea tures. The worst-case mistake bound for the algorithm on this dataset is approximately:

(a) 1.79

(b) 2.58

(c) 3.58

(d) 4.67

(e) 10.75

Question 8

Consider the following dataset:

D := {((1, 1, 1), 0),((1, 0, 0), 0),((1, 1, 0), 1),((0, 0, 1), 1)},

i.e., each data point is composed of a vector of 3 binary features, and a binary response. We will train a decision tree of depth 2 on this dataset using the standard ID3 algorithm covered in lectures and tutorials (please use the natural log (base e) instead of base 2 in all calculations for this question).

(a) Draw the tree computed and write down the training error it achieves. Be sure to show all work-ing for full marks. What to submit: your working out, either typed or handwritten. Please include a drawing/plot of your tree.

(b) Can you construct a depth 2 decision tree that achieves a lower training error than the one found via ID3? What does this tell you about ID3? Explain. What to submit: your working out, either typed or handwritten. Please include a drawing/plot of your tree.

Question 9

Assume that you have access to a function train perceptron that trains a standard perceptron (with learning rate η = 1, and w (0) = 0p where 0p is the vector of zeroes in p dimensions. Here, p denotes the number of features plus 1 for the bias term). Design an algorithm that establishes whether a given dataset is linearly separable or not, and which makes a single call to train perceptron. Describe your algorithm (a few dot points should suffice), and write a Python function implementing the algo-rithm. When running the perceptron function, terminate it after at most 10000 iterations. Test your function on the following datasets and report whether or not they are linearly separable.

(i) {010, 011, 100, 111}

(ii) {011,100,110,111}

(iii) {0100, 0101, 0110, 1000, 1100, 1101, 1110, 1111}

(iv) {1000000, 1000001, 1000101}.

Present a summary of your results using the following table:

The datasets represent the points that are labelled +1. All other remaining binary vectors of the same length are labelled −1. For example, for (i), there are 2 3 = 8 possible binary vectors of length 3, the four strings {010, 011, 100, 111} belong to the positive class, the remaining 4 strings {000, 001, 101, 110} belong to the negative class. You may find the following StackExchange post helpful in coding up the datasets; it describes a simple approach to generating all length k binary vectors in Python using itertools.product.

Note: Using an existing algorithm to determine linear separability here will result in a grade of zero for this part, but you are free to use existing implementations of the perceptron algorithm. What to submit: a description of your algorithm in words, your filled out table, a screen shot of your code used for this section, a copy of your code in solutions.py