Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Department of Econometrics and Business Statistics

ETW3482 DATA MINING AND PREDICTIVE MODELLING

Semester 2, 2025

Data Exploration and Data Preparation Project

Due date: Friday, 26 September 2025, 11:55 p.m.

Weighting: 40% of your final grade.

Objective:

In  this  assignment,  you  will  demonstrate  your  data  visualisation  and  statistical  skills  by exploring and preparing a GenAI-simulated dataset with a binary target variable. You will then prepare a technical report, suitable for a data analytics manager, documenting your process and findings.

Background:

This  project  builds  upon  the dataset you  identified  in the first  assignment. You will  now explore and prepare this data using SAS Viya. The goal is to create a clean, well-documented dataset ready for predictive modelling.

Project Scope:

This project consists of three stages: Data Simulation, Data Exploration, and Data Preparation.

Stage 1: Data Simulation (10 marks)

1. Dataset Review: Review the dataset you selected in the first assignment. Ensure it has at least 15 columns and between 8,000 and 15,000 rows. It should ideally contain variables  with  characteristics  such  as   missing  values,  extreme  values,  skewed distributions, and high cardinality. The dataset must include a binary target variable. If a binary target variable doesn't exist, create one by consolidating existing variables or using Gen AI.

2. Data Simulation with GenAI: Use GenAI tools (e.g., ChatGPT, Copilot, etc.) to simulate and regenerate columns in your dataset, introducing realistic data issues as described in requirement (1). Focus on simulating missing values, skewed distributions, and high cardinality categorical variables.

o Specific Goals:

Introduce missing values randomly across several columns.

Create  a  positively  skewed  distribution  for  at  least  one  numerical variable.

Simulate a categorical variable with high cardinality (but no more than 10 distinct levels after regeneration).

o Report: Include the prompts you used in the Appendix of your report.

3. Data Quality Control: Ensure the simulated data meets the following criteria:

o  Missing values are limited to less than 15% in any single column.

o  Skewed distributions are positively skewed.

o  High cardinality categorical variables have no more than 10 distinct levels. Deliverable (Stage 1): A CSV file containing the dataset that includes simulated variables.

Stage 2: Data Exploration (40 marks)

1. SAS Viya Project Setup: Create a SAS Visual Analytics Project and upload the simulated data from Stage 1.

2. Variable  Naming: Ensure all variable  names adhere to the SAS Naming Convention before uploading to SAS Viya.

3. Binary Target Analysis: Use SAS Visual Analytics to explore the binary target variable. Create  visualisations  and  summary  statistics  to   understand   its  distribution  and relationship with other variables.

4. Data Exploration in Model  Studio: Create  a   Data   Mining  and  Machine  Learning Project in Model Studio. Use the Data Exploration Node to identify data quality issues (e.g., missing values, outliers, unusual distributions).

Deliverable (Stage 2): A report (included in the final technical report) summarising:

•     The distribution and characteristics of the binary target variable.

•     A detailed description of the data quality issues identified using the Data Exploration Node in Model Studio. Include relevant visualisations and summary statistics.

Stage 3: Data Preparation (50 marks)

1. Solution Proposal: For each data quality issue identified in Stage 2, propose a specific solution.

2. Implementation   in   Model Studio: Implement  the   proposed   solutions  using  the metadata settings in the Data Tab and Nodes from Data Mining Preprocessing in SAS Viya Model Studio.

3. Documentation: Document the  implemented  solutions,  including the specific steps taken and the rationale behind each choice.

Deliverable (Stage 3): A report (included in the final technical report) summarising:

•     The proposed solutions for each data quality issue.

•     A detailed description of how the solutions were implemented in the SAS Viya Model Studio.

•     Justification for the chosen data preparation techniques.

Report Formatting:

•     The report should be professional, clean, and adhere to the assessment criteria.

•     Use  the  following formatting options throughout the  body of your  report  (i.e.,  all sections of the report excluding the front cover and appendix):

o Line spacing: 1.5

o Alignment: Justify

o Font: Calibri (Body)

o Font size: 12

o Page numbering: Bottom of page (centre) & style: Page x of y

•     Visual aids such as charts, graphs, diagrams, and tables should be included to enhance the report. All charts, graphs, diagrams, and tables should be labelled with a title and referred to in the report. Lack of appropriate visual aids (e.g., charts and graphs) will be penalised accordingly.

•     The report should be written clearly and concisely, and the level of technical language must be suitable and appropriate for business technical personnel.

•     The word count for this report should not exceed 2000 words (excluding the title page, table of contents, references, and appendices). Please put down the total word count at the end of the technical report. Anything beyond 2000 words will NOT be read and marked.

Submission:

Submit the following files through the Moodle submission link:

1. Technical Report: A single PDF file

named ETW3482_DataPreparation_Report_.pdf

2. Simulated Dataset: A CSV file

named ETW3482_DataPreparation_Data_.csv

Deadline:

All submissions must be made by 11.55 p.m. on Friday, 26 September 2025, through the Moodle submission link.