ASSIGNMENT TWO

Semester 1 - 2021


PAPER NAME: Data Mining and Machine Learning

PAPER CODE: COMP809

DUE DATE: Sunday 30 May 2021 (midnight)

TOTAL MARKS: 100


Note: This assignment must be complemented individually


Student Name: ………………………………………………………….……………………………………………….

Student ID: ………………………………………………………….………………………………….………………..


INSTRUCTIONS:

1. ACADEMIC INTEGRITY GUIDELINES

The following actions may be deemed to constitute a breach of the General Academic Regulations Part 7: Academic Discipline, Section 2 Dishonesty During Assessment or Course of Study

• 2.1.1 copies from, or inappropriately communicates with another person

• 2.1.3 plagiarises the work of another person without indicating that the work is not the student’s own – using the full work or partial work of another person without giving due credit to the original creator of that work

• 2.1.4 Unauthorised collaboration in Assessment - collaborates with others in the preparation of material, except where this has been approved as an assessment requirement. This includes contract cheating where a student obtains services to produce or assist with an assessment

• 2.1.5 resubmits previously submitted work without prior approval of the exam board

• 2.1.6 Using any other unfair means

Please email [email protected] if you have any technical issues with your online submission on Blackboard immediately


Question One

In this question you are required to explore various architectures for building an Artificial Neural Network (ANN). The dataset for this experiment is the Breast Cancer dataset which can be downloaded from here. Use 70% of the data for training and the rest for testing.

Submit Python code used for this question.

1. Use the sklearn.MLPClassifier with default values for parameters and a single hidden layer with k= 25 neurons. Use default values for all parameters other than the number of iterations. Determine the best number for iteration that gives the highest accuracy. Use this classification accuracy as a baseline for comparison in later parts of this question.

[5 marks]

2. Enable the loss value to be shown on the training segment and track the loss as a function of iteration count. You will observe that even when the loss value decreases the error value increases between consecutive iterations. Conversely, it is possible that that the error value decreases when the loss increases between consecutive iterations. How do you explain this?

[5 marks]

3. Experiment with two hidden layers and experimentally determine the split of the number of neurons across each of the two layers that gives the highest classification accuracy. In part 1, we had all k neurons in a single layer, in this part we will transfer neurons from the first hidden layer to the second iteratively in step size of 1. Thus, for example in the first iteration, the first hidden layer will have k-1 neurons whilst the second layer will have 1, in the second iteration k-2 neurons will be in the first layer with 2 in the second and so on. Summarise your classification accuracy results in a 25 by 2 table with the first column specifying the combination of neurons used (e.g., 12, 13) and the second column specifying the classification accuracy.

[8 marks]

4. From the table created in part 3 of this question you will observe a variation in accuracy with the split of neurons across the two layers. Give explanations for some possible reasons for this variation.

[4 marks]

5. By now you must be curious to see whether the trends that you noted in part 4 above hold true for other datasets as well. Identify four suitable criteria for selection of other datasets for further experimentation as described in part 3 of this question. These criteria must be based on metadata properties.

[5 marks]

6. Source two different datasets that meet your criteria from the UCI Machine Learning repository or Kaggle or any other public source. Briefly describe how these datasets fit the criteria that you identified in part 5.

[5 marks]

7. Now experiment with each of the two datasets that you identified and produce two 25 by 2 tables that you produced in part 3 of this question. Compare the tables that you produced in part 6 with the one in part 3.

a) Discuss whether the trends that you identified in part 4 are also true for the new datasets. [3 marks]

b) Give reasons for any difference in trends. [5 marks]


Question Two

For this question you will explore clustering methods you have learnt in this course. You have been given datasets from four very different application environments and you are required to explore three widely used clustering algorithms and deploy each of them on the different datasets.

The three algorithms that you have decided to explore are 1) K-Means 2) DBSCAN and 3) Agglomerative.

The four datasets that you have been given are:

● Dow Jones Index

● Facebook Live Sellers in Thailand

● Sales Transactions

● Water Treatment Plant

You need to complete three tasks as detailed below.


Task 1

For each activity in this task, you must apply a suitable feature selection algorithm before deploying each clustering algorithm. Your clustering results should include the following measures:

Time taken, Sum of Squares Errors (SSE), Cluster Silhouette Measure (CSM). You may use Davis-Bouldin score as an alternative to SSE.

Submit Python code used for parts a) to c) below. You only need to submit the code for one of the 2 datasets.

a) Run the K means algorithm on each of the four datasets. Obtain the best value of K using either SSE and/or CSM. Tabulate your results in a 4 by 3 table, with each row corresponding to a dataset and each column corresponding to one of the three measures mentioned above. Display the CSM plot for the best value of the K parameter for each dataset. [15 marks]

b) Repeat the same activity for DBSCAN algorithm and tabulate your results once again, just as you did for part a). Display the CSM plot and the 4 by 3 table for each dataset. [10 marks]

c) Finally, use the Agglomerative algorithm and document your results as you did for parts a) and b). Display the CSM plot and the 4 by 3 table for each dataset. [10 marks]


Task 2

a) For each dataset identify which clustering algorithm performed best. Justify your answer.

In the event that no single algorithm performs best on all three performance measures you will need to carefully consider how you will rate each of the measures and then decide how you will produce an overall measure that will enable you to rank the algorithms. [10 marks]

b) For each winner algorithm and for each dataset explain why it produced the best value for the CSM measure. This explanation must refer directly to the conceptual design details of the algorithm. There is no need to produce any further experimental evidence for this part of the question. [10 marks]

c) Based on what you produced in a) above, which clustering algorithm would you consider to be the overall winner (i.e., after taking into consideration performance across all four datasets). Justify your answer. [5 marks]