Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS 3262 - Final Exam - Fall 2023

from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive

Welcome to your final exam!

Logistics:

  Open book/open note/no internet

  You are not allowed to discuss the exam with each other

  All questions about the exam will come to me, through email. Do not send any public messages to me, or each other about the exam.

   If there are any clarifications required, I will post them on brightspace and update this document.

A note on the kinds of answers I expect: As is our style on HW and in class, many of these questions are open ended and are not asking you to  repeat what you've read or heard in class. On the contrary, if I read my own words (or a texts) I will mark that down! I expect you to demonstrate your original thoughts. Almost none of these questions require 3-word answers (some do though, those should be clear by the question!).

Having said that, I also don't want you to start just typing out vocabulary words that we've used in class.

Tip: If you feel you can't answer a question, skip it and come back. Sometimes reading the entire thing will help clarify the individual parts. If all else fails, I will award partial credit for effort, and a clear explanation of what you're confused about and why.

Try and explain your confusion!

Changelog

Note: This is version 1, updated on 2023-12-11

   Notebook Setup

# imports

import numpy as np

import matplotlib.pyplot as plt

colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]

import seaborn as sns

import pandas as pd

import sklearn as sk

# styling additions

from IPython.display import HTML

style = '''

<style>

div.info{

padding: 15px;

border: 1px solid transparent;

border-left: 5px solid #dfb5b4;

border-color: transparent;

margin-bottom: 10px;

border-radius: 4px;

background-color: #fcf8e3;

border-color: #faebcc;

}

hr{

border: 1px solid;

border-radius: 5px;

}

</style>'''

HTML(style)

  Problem 0 - Decision Trees/Random Forests

This problem will use two extra packages to make some nice visualizationsof our trees!

Uncommment and run this cell to install these packages:

#!pip install dtreeviz

Collecting dtreeviz

Downloading dtreeviz-2.2.2-py3-none-any.whl (91 kB)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91.8/91.8 kB 2.2 MB/s eta 0:00:00

Requirement already satisfied: graphviz>=0.9 in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (0.20.1) Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (1.5.3)

Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (1.23.5)

Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (1.2.2) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (3.7.1)

Requirement

already

satisfied:

colour in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (0.1.5)

Requirement

already

satisfied:

pytest in /usr/local/lib/python3.10/dist-packages (from dtreeviz) (7.4.3)

Requirement

already

satisfied:

contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dtreeviz) (1.2.0)

Requirement

already

satisfied:

cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dtreeviz) (0.12.1)

Requirement

already

satisfied:

fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dtreeviz) (4.46.0)

Requirement

already

satisfied:

kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->dtreeviz) (1.4.5)

This problem will use two extra packages to make some nice visualizationsof our trees!

Just as what we did in the homeworks, if you run into errors when running'from dtreeviz import clfviz', you can replace it with'from dtreeviz import decision_boundaries':

import dtreeviz

#### approach 1: if this doesn't work, replace it with 'from dtreeviz import decision_boundaries'

from dtreeviz import decision_boundaries

#### approach 2:

from dtreeviz import decision_boundaries

Now we're ready. Lets start with the wine dataset we used in class:

from sklearn.datasets import load_wine

wine = load_wine()

X = wine.data

X.shape

(178, 13)

This dataset has 13 features:

wine.feature_names

['alcohol',

'malic_acid',

'ash',

'alcalinity_of_ash',

'magnesium',

'total_phenols',

'flavanoids',

'nonflavanoid_phenols',

'proanthocyanins',

'color_intensity',

'hue',

'od280/od315_of_diluted_wines',

'proline']

Lets pick a subset for easy plotting:

X = X[:,[12,6]]

y = wine.target

Now we're ready!

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=50, min_samples_leaf=20, n_jobs=-1)

rf.fit(X, y)

▾                        RandomForestClassifier

RandomForestClassifier(min_samples_leaf=20, n_estimators=50, n_jobs=-1)

#### approach 1: using clfviz to visualize the boundary

#### if it's not working, try approach 2

fig,axes = plt.subplots(1,1,dpi=300)

clfviz(rf, X, y, ax=axes,

# show classification regions not probabilities

show=[ 'instances', 'boundaries', 'misclassified'],

feature_names=[ 'proline', 'flavanoid']);

NameError                                 Traceback (most recent call last)

<ipython-input-15-9950d8cf5234> in <cell line: 5>()

3

4 fig,axes = plt.subplots(1,1,dpi=300)

----> 5 clfviz(rf, X, y, ax=axes,

6        # show classification regions not probabilities

7        show=[ 'instances', 'boundaries', 'misclassified'],

NameError: name 'clfviz' is not defined

SEARCH STACK OVERFLOW

#### approach 2:

fig,axes = plt.subplots(1,1,dpi=300)

decision_boundaries(rf, X, y, ax=axes,

# show classification regions not probabilities

show=[ 'instances', 'boundaries', 'misclassified'],

feature_names=[ 'proline', 'flavanoid'])

WARNING:matplotlib.font_manager:findfont: Font family 'Arial' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Arial' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Arial' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Arial' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Arial' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Arial' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Arial' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Arial' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Arial' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Arial' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Arial' not found.

  Pause-and-Ponder: Below, regenerate the above analysis for different values of:

min_samples_leaf 

max_depth

n_estimators

Investigate their effect on the decision boundary!

Double-click (or enter) to edit

  BONUS B1 - PCA

For this bonus problem, run PCA on the full wine dataset we imported above! (meaning you don't have to split your data into trainig and test)

from sklearn.decomposition import PCA

Note: I have intentionally not given you a code example for this problem! Try reading the sklearn documentation and use what we currently know to see how to specify a PCA yourself!

from sklearn.decomposition import PCA

from sklearn.datasets import load_wine

import pandas as pd

wine = load_wine()

X = wine.data

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)

pca_df = pd.DataFrame(data=X_pca, columns=[ 'PC1', 'PC2'])

print(pca_df)

 

PC1

PC2

0

318.562979

21.492131

1

303.097420

-5.364718

2

438.061133

-6.537309

3

733.240139

0.192729

4

-11.571428

18.489995

..               ...            ...

173

-6.980211

-4.541137

174

3.131605

2.335191

175

88.458074

18.776285

176

93.456242

18.670819

177

-186.943190

-0.213331

[178 rows x 2 columns]

  Pause-and-Ponder: Comment below on the quality of the ¦t! How did PCA do on this dataset? Give a good answer here!

PC1 outperforms PC2

  Pause-and-Ponder: Explain what exactly PCA is doing to our dataset. How is it different than linear regression? Comment below!

PCA is a method used to condense the information contained in a dataset with many variables into a smaller set of new variables, known as principal components. These components are ranked such that each subsequent component has the highest possible variance under the

constraint that it is orthogonal to the preceding components.

In contrast to PCA, Linear Regression is a predictive technique that requires both input and output variables. It attempts to predict the value of a dependent variable, based on one or more independent variables, by ¦tting a linear equation to observed data.

Key distinctions between the two methods include:

  PCA operates without guidance from an output variable, aiming to simplify the data structure through variance. It is a technique for feature extraction and dimensionality reduction.

  Linear Regression works under supervision, employing a target output to shape its predictions. It is a method for understanding the relationship between inputs and outputs within the dataset.

  Bonus B2 - KMeans

For this problem run a K-Means on the result of your PCA dimensionality reduced data for the following:

  2 components

  5 components

  10 components

And give the three plots!

Note: I have intentionally not given you a code example for this problem! Try reading the sklearn documentation and use what we currently know to see how to specify a KMeans yourself!

from sklearn.decomposition import PCA

from sklearn.cluster import KMeans

from sklearn.datasets import load_wine

import matplotlib.pyplot as plt

import pandas as pd

# 2 components

wine = load_wine()

X = wine.data

pca_2 = PCA(n_components=2)

X_pca_2 = pca_2.fit_transform(X)

kmeans_2 = KMeans(n_clusters=3, random_state=42)

clusters_2 = kmeans_2.fit_predict(X_pca_2)

plt.figure(figsize=(8, 6))

plt.scatter(X_pca_2[:, 0], X_pca_2[:, 1], c=clusters_2)

plt.title( 'K-Means with 2 PCA Components')

plt.xlabel( 'PC1')

plt.ylabel( 'PC2')

plt.show()

/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning:

#5 components

pca_5 = PCA(n_components=5)

X_pca_5 = pca_5.fit_transform(X)

kmeans_5 = KMeans(n_clusters=3, random_state=42)

clusters_5 = kmeans_5.fit_predict(X_pca_5)

plt.figure(figsize=(8, 6))

plt.scatter(X_pca_5[:, 0], X_pca_5[:, 1], c=clusters_5)

plt.title( 'K-Means with 5 PCA Components')

plt.xlabel( 'PC1')

plt.ylabel( 'PC2')

plt.show()

#10 components

pca_10 = PCA(n_components=10)

X_pca_10 = pca_10.fit_transform(X)

kmeans_10 = KMeans(n_clusters=3, random_state=42)

clusters_10 = kmeans_10.fit_predict(X_pca_10)

plt.figure(figsize=(8, 6))

plt.scatter(X_pca_10[:, 0], X_pca_10[:, 1], c=clusters_10)

plt.title( 'K-Means with 10 PCA Components')

plt.xlabel( 'PC1')

plt.ylabel( 'PC2')

plt.show()

/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning:

 

  Pause-and-Ponder: How are KMeans KNN's different? How are they similar? Explain!

   Differences:

1. Purpose and Application:

  KMeans is an unsupervised learning algorithm used for clustering. It groups data into a speci¦ed number K of clusters based on feature similarity.

  KNN is a supervised learning algorithm used for classi¦cation or regression. In classi¦cation, it predicts the class of a data point by looking at the K nearest labeled data points and taking a majority vote.

2. Learning Method:

  KMeans learns by iteratively updating the centroids of clusters until convergence. It does not use labeled data; the algorithm organizes data into clusters based on feature similarity alone.

  KNN does not have an explicit training phase. It makes predictions based on the labels of the nearest neighbors in the feature space. Each query involves analyzing the entire training set (or a signi¦cant portion of it) to ¦nd the K nearest neighbors.

Similarities:

1. Parameter K:

  Both algorithms use a parameter K, but its meaning and purpose are different in each. In KMeans, K represents the number of clusters, while in KNN, K represents the number of nearest neighbors to consider for making predictions.

2. Reliance on Distance Metrics:

  Both KMeans and KNN rely on distance metrics to measure similarity or proximity. In KMeans, this is used to assign points to the nearest cluster centroid. In KNN, it's used to ¦nd the nearest neighbors.

3. Feature Space Analysis:

  Both algorithms operate in the feature space and perform some form of grouping based on feature similarity, although the way they use this information differs signi¦cantly.