CSCI3151 - Foundations of Machine Learning Assignment 3 Winter 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CSCI3151 - Foundations of Machine Learning
Assignment 3
Winter 2022
Your assignment is to be submitted as a single .ipynb file (please do not zip it when submitting to brightspace) including your answers to both the math and the experimental questions, in the correct order, on Brightspace. Use markdown syntax (https://www.markdownguide.org/cheat-sheet/) to format your answers
Note: in solving the math questions, aim for general (symbolic) solutions and substitute the specific numbers at the end. This demonstrates a solid understanding of the key concepts. You can answer the math questions in two ways:
Use LaTeX to typeset the equations. Section H of this LaTeX reference sheet
(http://tug.ctan.org/info/latex-refsheet/LaTeX_RefSheet.pdf) is a good reference. Here is another LaTeX reference sheet (https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick- reference). The equations in the questions are typeset in LaTeX, so you can use them as examples.
Use neat handwriting, scan your solution using Camscanner
(https://www.camscanner.com/user/download) on your mobile phone, upload the image file, and embed it in your solution notebook. To this end (1) create an empty Markdown cell. 2) Drag-and-drop the image file into the empty Markdown cell, or click on the image icon at the top of the cell and select the image file. The Markdown code that will embed the image, together with its content, then appears.
Your answers to the experimental questions should be in your solution notebook, in the form of code and text cells, using markdown for your text responses. You should also include the results of running your code.
The marking criteria are described in rubrics. There are two rubrics, for math questions, and for experimental questions, respectively.
You can submit multiple editions of your assignment. Only the last one will be marked. It is recommended to upload a complete submission, even if you are still improving it, so that you have something into the system if your computer fails for whatever reason.
IMPORTANT: PLEASE NAME YOUR PYTHON NOTEBOOK FILE AS:
<LAST_NAME>-<FIRST_NAME>-Assignment-N.ipynb
for example: Milios-Evangelos-Assignment-3.ipynb \
1. Multi-class Multi-label classification using Naive Bayes
In this question you will implement Naive Bayes to classify the topic of newsgroup posts.This method works fairly well for certain text classification tasks. This is indeed the case for newsgroup post classification given that there are words that convey a strong indication of a post belonging to a certain topic.
You will make use of the 20 newsgroup Dataset, which can be found in sklearn (The training subset has been
fetched for you). You may want to look at sklearn.feature_extraction.text.TfidfVectorizer (https://scikit- learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to convert the words to
vector representations.
In [ ]:
from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(remove= ('headers','footers','quotes'),subset='train') # removes whitespaces from data remove_ws = lambda x: " ".join(x.split()) X_train = list(map(remove_ws,newsgroups_train.data)) y_train = newsgroups_train.target # uncomment to know more about the dataset #print(newsgroups_train.DESCR)
|
a) First you will build a multi-class (single label) classifier on the above training set. You will be using
complement Naive Bayes (https://scikit- learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html#sklearn.naive_bayes.Complemen
from sklearn for this task. Make sure you account for the zero counts (smoothing), so that a prediction is not penalized just because it uses a word that is not present in the training set.
b) For testing fetch the test subset by passing subset="test" in the code above. Report the classification error and plot the confusion matrix. Note : Remember to apply identical tranformation to the test subset as the training subset before inference.
c) Using the same trained model in (a) apply appropriate sklearn method to build a multi-label classifier. Predict the top 2 labels for atleast 10 samples taken from the test subset.
2. Clustering
In this question we are going to explore two different clustering methods on the Wine Dataset (https://archive.ics.uci.edu/ml/datasets/seeds) and evaluate it using two measures: one is an intrinsic measure
(no labels), while the other one makes use of the available labels.
In [ ]:
# input data from sklearn import datasets wine = datasets.load_wine() X = wine.data # 178 instances, 13 features y = wine.target # target, 3 classes |
a) Cluster the dataset using the Agglomerative Clustering (https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html), and k-Means (https://scikit- learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) clustering algorithm
without using the class information as part of the features. Experiment with different numbers of clusters ranging from 2 to 5.
b) What is the variability of the resulting clusters as a function of different initializations or parameterization? Use the Silhouette coefficient and Adjusted Rand Index as metrics for evaluation to discuss the stability of results.
c) Based on the Silhouette coefficient, discuss (i) which clustering method you would pick, (ii) how many clusters you would use for your data.
Make sure that appropriate visualizations are used to support the analysis.
3. Gaussian Mixture Model
It’s year 2120 and you work as a space taxi driver. One day, you suddenly get lost and find a new small inhabited planet, which looks like a so far unknown civilization. You meet the planet prime minister who explains to you that different alien races joined to live on this planet peacefully. Although the different races look similar, the prime minister explains: “It’s not clear-cut, but the race can be fairly well distinguished by looking at an alien's height and weight”. You spent a good amount of time chatting and laughing with the prime minister until you realized that you had passengers waiting to be picked up.
A few years later, you find that there is a lot of interest in knowing more about this planet, how many races there are, and what their different races look like. You never managed to find that planet again (as you know, planets move around). You are clearly not good at memorizing data, but luckily, the prime minister shared with you the last census data (aliens.csv), which contains aliens’ heights and weights in meters and kilograms, respectively. Would you be able to infer how many races there are and what their characteristics are?
a) Run a Gaussian Mixture Model, so that you can identify the different races in the civilization. Vary the number
of components from 2 to 7. Use the Akaike information criterion (https://scikit- learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.aic)
(AIC) to provide a metric of the goodness of the approximation for each. Indicate the most likely number of races.
In [ ]:
# Load data import pandas as pd from google.colab import files import io
# Upload file (Tutorial 1) uploaded = files.upload() df = pd.read_csv(io.BytesIO(uploaded['aliens.csv']), header=None)
|
b) Using the most probable number of races (# gaussians) based on AIC, plot in a scatter plot of all your points (heights and weights) where the color of each point is defined by the Gaussian with the highest posterior probability.
C) Compare and contrast the above method with Bayesian Gaussian Mixture (https://scikit- learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html) Model of sklearn starting with the default parameters, and experimenting as needed.
2022-04-02