MASTER OF COMPUTER and INFORMATION SCIENCES


COMP809

Data Mining & Machine Learning


ASSIGNMENT ONE


Semester 1, 2021


Due: Friday 16 April at midnight.

Weighting: 50%

Note: This assignment may be completed individually or in groups of size 2.

Submission: A soft copy needs to be submitted through Turnitin (a link for this purpose will be set up in Blackboard). When submitting the assessment make the name(s) and student ID(s) must be indicated on the front page of the report.


AIMS

The Aim of this assignment is two-fold. Firstly, in Part A, you are required to conduct a literature review of data mining applications in Industry and will thus provide you with a further insight into the ways that data mining is used in Part B.


Part A

Your survey should cover two different application areas (ensure that these are from different domains – e.g. banking, health, etc). The survey is intended to assist you in establishing a suitable framework (application area, tools, algorithms) on which your mining project will be based.


DELIVERABLES

Background information on the organisation that initiated the Data Mining application.

A brief description of the target application (e.g. detecting credit card fraud, diagnosing heart disease, etc.) and the objectives of the data mining exercise undertaken.

A description of the data used in the mining exercise (the level of detail published here will differ due to commercial sensitivity, hence flexibility will be used in the marking of this section).

A description of the mining tools (data mining software) used, together with an identification (no details required) of the mining algorithms and how the mining algorithms were applied on the data.

Discussion of the outcomes and benefits (be as specific as possible, talk about accuracy of results, potential or actual savings in dollar terms or time savings; do not talk in vague, general terms) to the organisation that resulted from the mining exercise. This discussion should contain, in addition to the published material, your own reflection on the level of success achieved by the organisation in meeting their stated aims and objectives.

The total length of your report for Part A is expected to be no longer than 3 pages (1.5 pages for each case study). The criteria that will be used for assessment in Part A is as follows:

  Criterion
  Mark
  Overall Quality of Presentation
  6
  Background
  6*2=12
  Tools and Mining algorithms
  8*2=16
  Outcomes and Benefits
  8*2=16


Part B

This part allows you to solve two real-world data mining problems using Python. In the two questions given below justification of your answers carries a high proportion (50%) of the marks awarded.

Q1: Application Area 1 (dataset for this is Mortgage.csv; dataset description is in Mortgage.txt)

This application is concerned with predicting the outcome of mortgage applications. The dataset contains 700 applications for mortgages for which outcomes (paid back=0, default on loan=1) are known. A further 150 mortgages have currently being granted but the outcomes for these are not known as the loans are still in progress.

You are required to build a model using the Decision Tree learner and answer the following questions based on the model built. Use the data segment on the 700 mortgages whose outcomes are known. In building the model, use the 10 fold cross-validation option for testing.

Your answers below need to be supported by suitable evidence, wherever appropriate. Some examples of suitable evidence are Decision Trees, Confusion Matrices, Model Visualizations and Summary Statistics.

a) Using an appropriate method identify the top 4 most influential features in classifying this dataset. [5 marks]

b) Now build a model using the Decision Tree Classifier. By adjusting two suitable parameters (one at a time) reduce the size of the tree to not more than 10 to 15 nodes in order to improve the interpretability of the model generated. Which of the two parameters yielded better accuracy while producing smaller trees? [5 marks]

c) Describe the role of the two parameters in the model building that you used in b) above. Do you expect that manipulating the parameter in the same way, will improve accuracy for other types of datasets? Justify your answer. [8 marks]

d) Examine the Confusion Matrix carefully. You will notice that the success rate of predictions for the “default on loan” (1) outcome is significantly smaller than the corresponding success rate for the 0 outcome. Why do you think this happens? Will a suitable visualization help to explain this phenomenon? [5 marks]

e) Do you expect to replicate the same level of success as with the 700 mortgages that you built the model from, or do you expect the prediction to be significantly worse? Justify your answer. Hint: Examine the data distributions of the two sets of data and look for similarities or differences between the two. [10 marks]


2. Application Area 2 (dataset for this is Heart.arff)

This application is from the Medical domain and is concerned with predictions of heart disease for a collection of individuals from whom relevant medical data has been obtained. The objective is to predict whether a given individual will suffer from heart disease (outcome 2) or not (outcome 1) in a year’s time from gathering the data.

For this dataset, you will use both the Decision Tree Classifier and Naïve Bayes algorithms to mine the data. Use the 10 fold cross-validation option for testing the performance for both models on testing data.

1. Compare the accuracy of Naïve Bayes algorithm with independence assumption between the features and accuracy of the probabilisticmodel with dependent features. Which one is more preferred. Support your answer with a reference to the dataset. [5 marks]

2. Use this probability table to identify the top 3 (feature, value) pairs that predict the presence of heart disease. Show all working. [7 marks]

3. Now run the Decision tree algorithm and compare the list of the most significant features with the top 3 features produced by the Decision treemodel. Identify similarities and differences. Discuss any differences. [5 marks]



Good luck!