ECS607U Data Mining Assignment 2


Assignment 2

Task 1

In this task we need to write a function to return Kulczynski measure. The Kulczynski measure is defined as:

The Groceries dataset is selected for itemsets, it contains 38765 rows for Market Basket Analysis. Some data samples are as follows:

In Kulczynski funciton, we need to calculate the support for A, B, A and B. The function code is as follows:

import pandas aspd

import numpy as np

def kulczynski(iterm_sets, rule):

rule a part, rule b part = rule

support a = sum(groceries_data['itemDescriptionapply(

lambda x : len(set(x)&set(rule a part)

)== len(rule a part)

support b= sum(groceries_data['itemDescriptionapply(

lambda x : len(set(x)&set(rule b part)

)== len(rule b part)

support a and b= sum(groceries_data['itemDescriptionapply(

lambda x : len(set(x)&set(rule a part+ rule b part)

)== len(rule a part+ rule b part)

return(support a and b/support a + support a and b/support b)*


Task 2

In  this  task  we  need  to  write  a  function  to  return  imbalance  ratio.  The imbalance ratio is defined as:

The Groceries dataset is selected for itemsets. In imbalance ratio funciton, we need to calculate the support for A, B, A and B. The function code is as follows:

import pandas aspd

import numpy as np

def imbalance_ratio(iterm_sets, rule):

rule a part, rule b part = rule

support a = sum(groceries_data['itemDescriptionapply(

lambda x : len(set(x)&set(rule a part)

)== len(rule a part)

support b= sum(groceries_data['itemDescriptionapply(

lambda x : len(set(x)&set(rule b part)

)== len(rule b part)

support a and b= sum(groceries_data['itemDescriptionapply(

lambda x : len(set(x)&set(rule a part+ rule b part) )== len(rule a part+ rule b part)

return np.abs(support a - support b) / (support a + support b - support a and b)

Task 3

Excluding  the  case  of  a  single  item,  there  contiains  (N-1)!  possible  valid itemsets.

Task 4

Through the box plot we can determine the outlier, the outlier is 5.49. The boxplot regards outliers as points in the outer 1% of a normal population.

Task 5

After reading the dataset, the percentage of changes was calculated, and a one-class SVM classifier was used to identify outliers. The 3D scatterplot of the dataset with object is color-coded outlier is shown in the figure below.

The  histogram  and  the  frequencies  of  the  abnormal  detection  results  are shown  in  the  figure  below,  in  which  18%  of  the  samples  are  abnormal samples.  The  results  of  One-class  SVM  are  not  based  on  distance  and density, from distance.

import pandas aspd

#Load CSVfile, setthe 'Date'values asthe index ofeach row, anddisplay the first rows ofthe dataframe

stocks = pd.read excel('stocks.xlsx header='infer')

stocks.index = stocks['Date]'

stocks = stocks.drop(['Dateaxis=1)

N,d= stocks.shape

delta = pd.DataFrame(

100*np.divide(stocks.iloc[1:,:].values-stocks.iloc[:N-1,:].values, stocks.iloc[:N-1,:].values),

columns=stocks.columns, index=stocks.iloc[1:].index



from sklearn.svm import OneClassSVM

ee = OneClassSVM(nu=0.01, gamma='auto')

yhat = ee.fit predict(delta)

import matplotlib.pyplot asplt

fig = plt.figure(dpi=150)

ax = fig.add subplot(projection='3d')

ax.scater(stocks['MSFTiloc[np.where(yhat==1)[0]+ 1],

stocks['Filoc[np.where(yhat==1)[0]+ 1],

stocks['BACiloc[np.where(yhat==1)[0]+ 1])

ax.scater(stocks['MSFTiloc[np.where(yhat==-1)[0]+ 1],

stocks['Filoc[np.where(yhat==-1)[0]+ 1],

stocks['BACiloc[np.where(yhat==-1)[0]+ 1])

plt.legend(['Normal 'Abnormal]')

ax.set xlabel('MSFT')

ax.set ylabel('F')

ax.set zlabel('BAC')

Task 6

The data set is first subjected to PCA dimensionality reduction, and then the distance of each sample  is  obtained  by  nearest  neighbor  calculation.  We plotted the sample distances as  boxplots. As  shown  in  the figure  below, samples with a sample distance greater than 2.5 can be regarded as outliers.

The scatter diagram of PCA is shown in the figure below. It can be seen that the distance between the outlier point and other points is relatively long.

df= pd.read csv(url, header=None)

data = df.values

X, y = data[:, :-1], data[:, -1]

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

X pca = pca.fittransform(X)

from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(n_neighbors=2)

knn.fit(X pca)

plt.scater(X pca[distances.mean(1)<2.5, 0], X pca[distances.mean(1)<2.5, 1])

plt.scater(X pca[distances.mean(1)>2.5, 0], X pca[distances.mean(1)>2.5, 1])

plt.legend(['Normal 'Abnormal]')

Task 7

There are many HTML tags in the web page, such as html, body, h1, p, table, thead, tr, th, rd. By consulting the introduction of the HTML web page, the meaning of each tag is as follows:

. html, page start and end

. body, content start and end

. h1, header

. p, paragraph

. table, table start and end

. thead, header of table

. tr, row of table

. th, head row of table

. td, content row of table

With Beautiful Soup, data is crawled and stored in tables. The result is shown in the figure below.

from bs4import BeautifulSoup

data =

BeautifulSoup(requests.get('htp:/eecs.qmul.ac.uk/~emmanouilb/income _table.htmltext)

header = [x.text.strip()for x in table.find al(t'head')[0].find al(t'h')]

table = data.find(t'able')

table_body = table.find(t'body')

rows = table_body.find al(t'r')

table_data = []

for row in rows:

cols = row.find al(t'd')

cols = [ele.text.strip()for ele in cols]

table_data.append([ele for ele in cols ifele])

table_data = pd.DataFrame(table_data, columns=header)

Task 8

Through Beautiful Soup, we crawl webpages with multiple keywords. And the frequency of words in the article is counted, and then the web pages are clustered. We experimented with various numbers of clusters and recorded the clustering loss. The elbow diagram is shown in the figure below. It can be seen from the figure that the optimal number of clusters is 6.

The final clustering results are as follows. The clustering  results are  more reasonable.  Semi-supervised  and  supervised  learning  are  in  the  same

category, and data mining and data warehouse are in the same category.

1.   ['unsupervised  learning'  'anomaly  detection'  'dimensionality  reduction' 'statistical classification']

2.   ['association rule learning']

3.   ['data mining' 'data warehouse']

4.   ['supervised    learning'    'semi-supervised    learning'    'online    analytical processing']

5.   ['cluster analysis']

6.   ['regression analysis']

data_texts = []

for keyword in keywords:

keyword= keyword[0].upper()+keyword[1:]

keyword= keyword.replace(' ' ')

data = requests.get('htps:/en.wikipedia.org/wiki/' + keyword).text data_text= BeautifulSoup(data).find(d'iv atrs={'class':



from sklearn.feature_extraction.text import CountVectorizer

from sklearn.cluster import Kmeans

data_bag_of words =


sse = []

for n_clusters in range(2, 13):

km = KMeans(n_clusters=n_clusters)

km.fit(data_bag_of words)


km = KMeans(n_clusters=6)

km.fit(data_bag_of words)

for idx in range(6):

print(np.array(keywords)[km.labels_ == idx])