BA222 - Lecture Notes 15: Introduction to Machine Learning, Cluster Analysis
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
BA222 - Lecture Notes 15: Introduction to Machine Learning, Cluster Analysis
Table of Contents
• Introduction
• Introduction to Cluster Analysis
• Measuring Similarities
• The K-Means Algorithm
• Step 0: Initialization
• Step 1: Distance Calculation
• Step 2: Cluster Assignment
• Step 3: Centroid Recalculation
• Final Step: Convergence Criteria
• Implementation in Python
• Extracting and Displaying Results
• More than two dimensions
• Recommendations based on similarities
Introduction
Machine Learning (ML) is the process of using computer algorithms to detect patterns in data. Machine learning procedures are different from classical statistics because they require minimal inputs from the analyst and, generally, require large amounts of data. They are mostly used to design models for prediction, not necessarily to identify causal relations.
In this course we'll learn single ML methodology called cluster analysis. It is an example of unsupervised learning. Unsupervised learning is a branch of ML that is focuses on techniques for pattern detections without a specific goal in mind. That is, with unsupervised learning we'll simply let the computer detect patterns for us.
Introduction to Cluster Analysis
Let's start with something simple. Say we want to teach a computer to differentiate three different species of animals based on height and weight data. We'll make it very easy for the machine, and use very different animals: Cottontail Rabbit, Grey Wolf and Polar Bear. All the animals are adult males.
You can find the data for each observation in the animals .csv file. Each row is a different animal, the h column includes the height in centimeters and w is the weight in pounds. name is the name of the specie.
Let's start by making a scatterplot using w as the horizontal axis and h as the vertical. What patterns do you see in the data?
We are not using a scatterplot as in classical statistics to find the correlation between height and weight, but this time we are looking for clusters of points. Clusters are subsets of the data that are similar to each other.
In this problem is easy to identify three different clusters. The rabbits are tiny and light, so their data is all collected on the bottom left of the scatterplot (it looks like a line because they are so tiny). The wolves are in the top left, they are heavier than rabbits and taller. Finally, the remaining cluster represents the bears; the heaviest of the dataset.
Measuring Similarities
The similarities between two observations can be represented as their Euclidean Distance. Recall that the formula for the Euclidean distance between two points i and j is:
di,j = (yi − yj )2 + (xi − xj )2
For the previous problem it will look something like this:
di,j = (heighti − heightj )2 + (weighti − weightj )2
So, when comparing two rabbits, because the height and weight are similar among rabbits, the Euclidean distance will be closer to zero than when comparing a rabbit to a bear.
Let's compute the average weight and height for the three species (we can do this using group by):
# DB is the name of the database
db[['name', 'h', 'w']] .groupby('name') .agg('mean')
The Euclidean distance of the average rabbit to the average bear is:
avgs = db[['name', 'h', 'w']] .groupby('name') .agg('mean')
hDiff = (avgs .h['bear'] - avgs .h['rabbit'])
wDiff = (avgs .w['bear'] - avgs .w['rabbit'])
np .sqrt( (hDiff ** 2) + (wDiff ** 2))
The problem with this procedure is that because the scales of weight and height are different (pounds and centimeters) the calculation of the distance is biased towards the values of the variable with the largest range (in this case weight).
A solution to this problem is using standardized values of the variable. A standardize value (or Z-Score) is calculated by taking the original variable, subtracting the mean and dividing it by the standard deviation. This produces a variable that represents, in terms of standard deviations, how far is a variable relative to its mean.
On Python you can standardize variables using the scale from the
sklearn .preprocessing package :
from sklearn .preprocessing import scale
db[['sH', 'sW']] = scale(db[['h', 'w']])
The previous command will add the standardize version of variables h and w to the data base as sH and sW respectively. If we repeat the original scatterplot using the standardize variables we'll see that the variation in the data is not totally dominated by weight.
We can also use zero as a reference to identify values that are above or below the mean. This is sometimes helpful to identify clusters visually:
plt .scatter(db .sW, db .sH)
plt .axhline(0, color = 'red')
plt .axvline(0, color = 'red')
plt.show ()
Now is a bit easier for us to identify the cluster. Rabbits are below average in height and weight. Wolves are below average in weight and above average in height. Bears are above average in height and weight.
The K-Means Algorithm
The computer, unfortunately, don't have an advanced brain like ours to process visual information to detect patterns. We are going to aid the machine by designing an algorithm to detect similar observations based on the Euclidean distance. An algorithm is simply specific sequence steps, generally in a loop, that is used to solve a problem.
Step 0: Initialization
To initialize the algorithm we are going to ask the computer to start with a random guess about what are the average weight and height of the three different species. Thus, we'll start with three randomly selected points (They are displayed in the diagram below in three different colors). Each of these points, which we'll call centroids from now own, are going to be updated as the algorithm is executed and are going to aid in the classification of observations. For now they are just random values:
Step 1: Distance Calculation
Now we'll let the computer calculate the Euclidean distance from each observation to the centroids. This will serve as a measure of how similar is each observation to the value of the centroids.
Step 2: Cluster Assignment
Based on the Euclidean distance. Each observation is assigned to the nearest centroid.
Step 3: Centroid Recalculation
Based on the observations assigned to a cluster, new centroids are calculated. This time using the average of all the observations that are assigned to a given cluster.
Final Step: Convergence Criteria
Steps 1 to 3 will be repeated in a loop until the values of the centroids and the cluster assignment is constant.
Implementation in Python
Luckily for use we don't need to actually code the K-means algorithm from scratch, instead we'll simply use the function KMeans() from the sklearn .cluster package:
from sklearn .cluster import KMeans
kmeans = KMeans(n_clusters = 3, random_state = 0) .fit(db[['h', 'w']])
In the KMeans() function we need to specify the number of clusters we want to identify using the n_clusters option. Then, just like in a regression, we need to use the .fit() function to actually run the algorithm. This time we don't need to specify any formula, just the data we want to use. Data must be numerical for the algorithm to work.
The option random_state = 0 is simply to use the same starting initialization values for the algorithm.
Extracting and Displaying Results
To get the results of the algorithms we proceed in a similar way to regression analysis. We need to extract specific pieces of information from the estimated model, in this case we called it kmeans . First we are going to look at the resulting centroids of the algorithm:
centroids = pd .DataFrame(kmeans .cluster_centers_)
centroids .columns = ['h', 'w']
centroids
The first centroid has a negative standardized value for h , meaning below average height and a negative standardized value for w , meaning below average weight. This is the centroid that corresponds to rabbits. The second centroid has positive standardize value (above average height and weight). These must be the bears. Finally, the last one has above average height but below average weight, the wolves.
Now, let's see if the individual observations are clearly identified. You can find the individual observations assignments using the .labels_ method. Let's add it to the original database and compare them:
db['clusters'] = kmeans .labels_
db[['name', 'clusters']]
I think some data visualization can help with this. We'll use a function similar to regplot , called lmplot from the seaborn package. It works in a similar manner, but it allows you to color code data points based on categorical variables:
sb .lmplot(x = 'sW', y = 'sH', hue = 'clusters', data = db, fit_reg = False)
plt .scatter(centroids .w, centroids .h, color = 'r', s = 100)
plt.show ()
You can see each cluster color coded, and the centroid is displayed as a red point.
More than two dimensions
The data in wine -clustering .csv contains data on different wines. Thirteen different variables are measured for the wines. It is hard for us to identify clusters with more than two dimensions because we cannot rely on visual representation to do it. But, because we can still compute the Euclidean distance for any number of dimensions, the K-Means algorithm still works. Let's try it.
Let's start with standardizing the data:
wineS = pd .DataFrame(scale(wine))
wineS .columns = wine .columns
wineS
Now we run the K-Means algorithm, run it with three centroids:
kmeans = KMeans(n_clusters = 3, random_state = 0) .fit(wineS) kmeans
Let's take a look at the centroids:
centroids = pd .DataFrame(kmeans .cluster_centers_)
centroids .columns = wine .columns
centroids
The values of the centroids are Z-Scores. So positive values mean values above average, negative values mean values below average.
Let's save the cluster assignment back into the standardize database:
wineS['cluster'] = kmeans .labels_
With the cluster assignment then we can move to data visualization. Let's see how the cluster classification looks like based on the variables Alcohol and Hue:
sb .lmplot(x = 'Alcohol', y = 'Hue', hue = 'cluster', data = wineS, fit_reg = False)
plt .scatter(centroids .Alcohol, centroids .Hue, color = 'r', s = 100) plt.show ()
Let's try other dimensions to see if the classification remains as good as the previous one. Let's do Color_Intensity and Total_Phenols :
sb .lmplot(x = 'Alcohol', y = 'Hue', hue = 'cluster', data = wineS, fit_reg = False)
plt .scatter(centroids .Alcohol, centroids .Hue, color = 'r', s = 100) plt.show ()
Interesting! The computer was able to capture all of these patterns on just a simple command. Even for high-dimensional problems like this one (13 variables)
Recommendations based on similarities
Using the concept of Euclidean distance we can do more than classify things into clusters; e.g. wines and species. Similarity is the basis of some recommendation algorithms used in social media platforms and streaming services. The most basic version of a recommendation algorithm based on similarity works like this: based on a database of past behaviors, compute a centroid and use the Euclidean distance to identify a set of similar observations, then select a number of them randomly and show it to the user.
Let's go over a simple example of a recommendation algorithm using the data for wines.
We are going to select 15 random wines from cluster 0, 2 from cluster 1 and 3 from cluster 2 and pretend that this is the data that correspond to the buying behavior of
a consumer.
from0 = wineS[wineS .cluster == 0] .sample(15)
from1 = wineS[wineS .cluster == 1] .sample(2)
from2 = wineS[wineS .cluster == 2] .sample(3)
customerData = pd .DataFrame(data = pd .concat([from0, from1, from2])) customerData = customerData .drop(columns = 'cluster', axis = 1)
customerData
I'm using the function .sample to take a random sample. And the function pd .concat to concatenate (join) the different random samples together.
Now let's compute the centroid (averages) for the data of the customer:
centroid = pd .DataFrame(customerData .mean()) .T
centroid
We are making it a data frame and transposing ( .T to transpose) the averages. This is simply so that the column names matches the column names in the wineS data base.
Let's be sure to eliminate the cluster assignment to not get biased results in the standardized wine data:
wineS = wineS .drop(columns = 'cluster', axis = 1)
wineS
Now, let's compute the euclidean distances:
from sklearn .metrics .pairwise import euclidean_distances
eucD = euclidean_distances(centroid, wineS)
eucD
Then we can add it to the original database:
wineS['D'] = = pd .DataFrame(eucD) .T
Then we can filter the database using some threshold for the similarity. For instance, let's use only those that have a distance lower than the 25th percentile. From that let's select five random wines:
recommendation = wineS[wineS .D < wineS .D .quantile(0 .25)] .sample(5) recommendation
Let's visualize how recommendations compare to the original customer database in different dimensions:
sb .lmplot(x = 'Alcohol', y = 'Hue', hue = 'cluster', data =
customerData, fit_reg = False)
plt .scatter(recommendation .Alcohol, recommendation .Hue, color = 'r', s = 100)
plt.show ()
sb .lmplot(x = 'Alcohol', y = 'Hue', hue = 'cluster', data =
customerData, fit_reg = False)
plt .scatter(recommendation .Alcohol, recommendation .Hue, color = 'r', s = 100)
plt.show ()
2023-05-03