ETW3482 DATA MINING AND PREDICTIVE MODELLING Semester 2, 2025
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Department of Econometrics and Business Statistics
ETW3482 DATA MINING AND PREDICTIVE MODELLING
Semester 2, 2025
Data Exploration and Data Preparation Project
Due date: Friday, 26 September 2025, 11:55 p.m.
Weighting: 40% of your final grade.
Objective:
In this assignment, you will demonstrate your data visualisation and statistical skills by exploring and preparing a GenAI-simulated dataset with a binary target variable. You will then prepare a technical report, suitable for a data analytics manager, documenting your process and findings.
Background:
This project builds upon the dataset you identified in the first assignment. You will now explore and prepare this data using SAS Viya. The goal is to create a clean, well-documented dataset ready for predictive modelling.
Project Scope:
This project consists of three stages: Data Simulation, Data Exploration, and Data Preparation.
Stage 1: Data Simulation (10 marks)
1. Dataset Review: Review the dataset you selected in the first assignment. Ensure it has at least 15 columns and between 8,000 and 15,000 rows. It should ideally contain variables with characteristics such as missing values, extreme values, skewed distributions, and high cardinality. The dataset must include a binary target variable. If a binary target variable doesn't exist, create one by consolidating existing variables or using Gen AI.
2. Data Simulation with GenAI: Use GenAI tools (e.g., ChatGPT, Copilot, etc.) to simulate and regenerate columns in your dataset, introducing realistic data issues as described in requirement (1). Focus on simulating missing values, skewed distributions, and high cardinality categorical variables.
o Specific Goals:
Introduce missing values randomly across several columns.
Create a positively skewed distribution for at least one numerical variable.
Simulate a categorical variable with high cardinality (but no more than 10 distinct levels after regeneration).
o Report: Include the prompts you used in the Appendix of your report.
3. Data Quality Control: Ensure the simulated data meets the following criteria:
o Missing values are limited to less than 15% in any single column.
o Skewed distributions are positively skewed.
o High cardinality categorical variables have no more than 10 distinct levels. Deliverable (Stage 1): A CSV file containing the dataset that includes simulated variables.
Stage 2: Data Exploration (40 marks)
1. SAS Viya Project Setup: Create a SAS Visual Analytics Project and upload the simulated data from Stage 1.
2. Variable Naming: Ensure all variable names adhere to the SAS Naming Convention before uploading to SAS Viya.
3. Binary Target Analysis: Use SAS Visual Analytics to explore the binary target variable. Create visualisations and summary statistics to understand its distribution and relationship with other variables.
4. Data Exploration in Model Studio: Create a Data Mining and Machine Learning Project in Model Studio. Use the Data Exploration Node to identify data quality issues (e.g., missing values, outliers, unusual distributions).
Deliverable (Stage 2): A report (included in the final technical report) summarising:
• The distribution and characteristics of the binary target variable.
• A detailed description of the data quality issues identified using the Data Exploration Node in Model Studio. Include relevant visualisations and summary statistics.
Stage 3: Data Preparation (50 marks)
1. Solution Proposal: For each data quality issue identified in Stage 2, propose a specific solution.
2. Implementation in Model Studio: Implement the proposed solutions using the metadata settings in the Data Tab and Nodes from Data Mining Preprocessing in SAS Viya Model Studio.
3. Documentation: Document the implemented solutions, including the specific steps taken and the rationale behind each choice.
Deliverable (Stage 3): A report (included in the final technical report) summarising:
• The proposed solutions for each data quality issue.
• A detailed description of how the solutions were implemented in the SAS Viya Model Studio.
• Justification for the chosen data preparation techniques.
Report Formatting:
• The report should be professional, clean, and adhere to the assessment criteria.
• Use the following formatting options throughout the body of your report (i.e., all sections of the report excluding the front cover and appendix):
o Line spacing: 1.5
o Alignment: Justify
o Font: Calibri (Body)
o Font size: 12
o Page numbering: Bottom of page (centre) & style: Page x of y
• Visual aids such as charts, graphs, diagrams, and tables should be included to enhance the report. All charts, graphs, diagrams, and tables should be labelled with a title and referred to in the report. Lack of appropriate visual aids (e.g., charts and graphs) will be penalised accordingly.
• The report should be written clearly and concisely, and the level of technical language must be suitable and appropriate for business technical personnel.
• The word count for this report should not exceed 2000 words (excluding the title page, table of contents, references, and appendices). Please put down the total word count at the end of the technical report. Anything beyond 2000 words will NOT be read and marked.
Submission:
Submit the following files through the Moodle submission link:
1. Technical Report: A single PDF file
named ETW3482_DataPreparation_Report_
2. Simulated Dataset: A CSV file
named ETW3482_DataPreparation_Data_
Deadline:
All submissions must be made by 11.55 p.m. on Friday, 26 September 2025, through the Moodle submission link.
2025-09-29