Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

U25764

PRACTICAL DATA ANALYTICS AND MINING

CW II :Data Mining and Machine Learning [60% of the module total mark]

In the ever-evolving field of Data Mining and Machine Learning, classification stands as a cornerstone technique, essential for transforming raw data into actionable insights. Classification is one of the main supervised learning methods, where the objective is to predict categorical class labels of new observations based on past data. The significance of classification lies in its wide array of applications, from predicting customer behavior in business to diagnosing diseases in healthcare. Your task is:

1. Dataset Selection and Justification (10%):

● You are required to search for at least 10 different datasets.

● Ensure diversity in datasets in terms of size, complexity, and domain (e.g., healthcare, finance, social media).

● Clearly justify your choice of each dataset in terms of the characteristics e.g. Number of attributes.

○ Number of labels (classes).

○ Size of the dataset

○ Type of attributes

○ Missing value.

○ Class balance.

● You can use tables to show the characteristics of your data.

● The marks will be based on the datasets characteristics and justification.

2. Exploratory Analytics and Data Visualisation (15%):

● Employ a variety of visualization tools to create insightful graphics. These might include (but not limited to) histograms, scatter plots, box plots, and heat maps to understand distributions and relationships in the data.

● The expectation is to draw 5 charts or more.

○ These visualizations should ideally cover a range of aspects from the datasets.

○ The charts can either encapsulate insights from all the datasets collectively or focus on individual datasets. It's crucial to demonstrate a mix of both these approaches to ensure comprehensive coverage.

● The evaluation of these visualizations will be based on several key criteria:

○ Quality: This pertains to the technical accuracy and clarity of the visualizations.

○ Diversity: The variety of charts should showcase a range of visual techniques and highlight different aspects of the data.

○ Clarity: Visualizations should be clear and uncluttered, with appropriate labeling, legends, and annotations where necessary.

○ Coverage: The charts should collectively provide a comprehensive view of the datasets, ensuring no significant aspect is overlooked.

○ Novelty: Creative and innovative approaches in presenting data visually will be highly regarded. Novelty can be in the form of unique chart types and/or interesting combinations of data.

3. Application of Classification Techniques:

● Apply the following methods

○ Decision tree (J48),

○ Random Forest,  and

○ K-NN (IBk) (with K taking the value of 1 up to the number of class labels in the dataset)

○ A method that has not been covered in the lectures.

● Ensure that your implementation in WEKA and/or Python is correctly configured for each dataset, paying attention to parameters that might need tuning.

● Report the results using suitable tables and charts.

4. Performance Comparison and Analysis:

● Use a variety of metrics (accuracy, precision, recall etc,) for a comprehensive analysis.

● Include confusion matrices for each technique and dataset.

● Analyze the suitability of each classification technique for different types of datasets, considering aspects like dataset size, imbalance, and feature types.

5. Report Writing Classification (40%):

● Write a report of no longer than 1000 words detailing the results you have reached in (3) and (4) with recommendations on the choice of the data mining technique according to the features of the datasets.

● Structure your report logically: Introduction, Data Selection, Exploratory Analytics,  Results, Analysis, Conclusion.

● In your conclusion, provide specific recommendations for matching datasets with classification techniques.

● The mark will be based on the:

○ The correctness and range of experiments in (3)

○ The use of tables and figures that effectively summarize the results (3 and 4). Ensure these are clearly labeled and referenced in the text.

○ The correct use of the classification methods (4)

○ The quality and depth of the comparison and discussion.

○ Conclusion with recommendations on how to match a dataset to a technique.

6. Application of Clustering Techniques

● You are required to apply the following clustering techniques using the WEKA and/or Python software on only 5 of the datasets selected in the classification task (1).

● Apply the following clustering techniques (details are in the next point):

○ K-means,

○ Agglomeration method

● Remove the class attribute before applying the above clustering methods.  Once you have applied the clustering techniques on all the datasets, it is required to accomplish the following tasks:

○ Use the clustering evaluation methods to compare the performance of the above algorithms.

7. Report Writing Classification (25%):

● Write a report of no longer than 500 words detailing the results you have reached in (6) with recommendations on the choice of the data mining technique according to the features of the datasets.

● Structure your report logically: Clustering Methodology, Results & Analysis, Conclusion.

● Provide insights on when to use each clustering technique based on dataset characteristics.

● The mark will be based on the:

○ The correctness and range of experiments in (6)

○ The use of tables and figures that effectively summarize the results (6). Ensure these are clearly labeled and referenced in the text.

○ The correct use of the Clustering methods (6)

○ The quality and depth of the comparison and discussion.

○ Conclusion with recommendations on how to match a dataset to a technique.

8. Coursework structure, Organization and language (10%)