Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

The Purpose:

This assignment should be completed in teams of 1-4 students. This assignment focuses on the analysis of authentic data and business applications. It is designed to enhance students’ knowledge and skills in the following: identifying data sources; programming for data wrangling and machine learning; construction of tables and graphics and their integration into written reports; high-level written communication.

The Task:

· Identify a dataset (main dataset) that has business application.

· Identify a supporting dataset to provide additional information to the main dataset.

(Hint 1: The dataset has business application if the answers you provide with the data have any material benefit to any business. Some examples: Clientele clustering can help businesses better identify the demographic or behavioural differences among their customers. Sales prediction can guide business decisions. Viewer movie ratings can be used to selectively push movies to clients.)

(Hint 2: Data cleaning should be a minor factor in choosing your dataset. Make sure the data is simple enough that you can feasibly manage it, but complex enough that proper data analysis can be carried out.)

(Hint 3: supporting dataset should be less sophisticated and more accessible than the main dataset. For example, man dataset can have 10,000 observations and 20 variables, supporting dataset could have 10 observations and 1 variable)

(Hint 4: supporting dataset can be merged or used to transform the main dataset, “e.g., create new variable”, to provide better insight)

· Identify 1-3 business questions that the dataset can address.

(Hint 1: Your grade is unrelated to the number of questions and answers. It is about the quality of your answers. The questions should be simple (or complex) enough to showcase your machine learning understanding. It is not acceptable if all your questions can be answered by non-machine learning techniques such as calculating means or generating scatter plots.)

· Carry out machine learning analysis such as clustering and classification to address the business questions.

(Hint 1: The amount of analysis should depend on the dataset and the questions. You can use multiple clustering techniques, for example, if they are motivated by your question and datasets.)

(Hint 2: Avoid wholesale implementation of example codes from the class. Some codes are about introducing one specific technique. Many of the codes have redundancies for educational purposes. You should acquire an integrated knowledge based on all examples and run the most suitable codes for your dataset and question(s). To illustrate. The example code sometimes uses a small sample, say a 5% data sampler, to run a pilot analysis. The goal here is better clarity and visualization, not better analysis.)

(Hint 3: You should structure your code based on your discussion and results. That is, if you tried to include or exclude a feature to see the change in performance and quoted numbers in your main text for both trials, you should have separate codes for including or excluding this feature. If you used the scatter plot function to plot two pairs of variables, just include two scatter plots in your final code. If you tried several methods but never discussed them in the main text, remove these codes unless they are the necessary steps for the results you discussed.)

· Present and interpret your findings. Discuss how these findings would benefit the business.

(Hint 1: You should prioritize a concise and intuitive presentation with graphs and simple tables to aid your discussion. Refer to Week 2 for data visualization advice. Leverage the data visualization course in the program if possible.)

(Hint 2: The exact form of the empirical write up should depend on the data, your results, and the questions. You can discuss, amongst other things, data identification, wrangling, sampling, choice of model, performance measures, overfitting, interpretation, and business application. However, these should not be given equal weights. Your understanding of what is worth discussing is a part of the assessment. For example, if your data is clean and not much wrangling is needed, do not spend more than a sentence or two on data wrangling.)

The Report Structure:

The report should consist of the following sections:

1. Introduction. This should include a brief overview of the dataset and business questions being addressed.

2. Data and methodology section.

a. for each question analysed, you briefly describe

i. the method applied,

ii. the portion of data used,

iii. and why you have chosen the method.

iv. Cite the source of the dataset with a web link.

b. Describe the specific data wrangling steps you have undertaken (keep it in the appendix)

3. Interpretation and discussion section. Present and discuss your main results and their business applications. Key tables and graphs for your main result should be in your main text (i.e., sections 1-4; they will not count towards the writing page limit.).

4. Conclusion.

5. Appendix if needed. But extra tables and/or graphs that is less connected to the main result can be put in the appendix. Put supplementary information in the Appendix.

6. References.

(Hint 1: With a short report, you should not spend too much time defining concepts. Assume that the reader knows the basics such as what a logistic regression is. Go straight to the point: the what, the how and the why of your results.)

What to Submit and When?

Each team should submit one copy of the following via learnonline by 5 pm, 10 Oct 2025:

1. A WORD file (of no more than 2 pages of writing). Supporting materials, such as formulas, calculations, tables, figures, references, and appendices, do not count towards the page limit. There is no word limit. The word limit shown in the course outline is the ‘word-equivalent’ limit. It does not apply because we have quantitative elements.

2. An Excel file showing both the raw dataset and, in a separate sheet, the dataset before you import it to your machine learning language. Sometimes you will have saved the Excel file as a .csv file for a smoother import. However, still ensure that you submit the Excel file. The assignment submission system has a 1GB data limit. If your data is larger than 1GB, put it in cloud storage (for example, Google drive, Onedrive, etc.) and share the link with me via email.

3. A code file or a zip file with codes for your programming language. The Excel file and codes are not graded. The grader may reference the Excel file if the main report is unclear or ambiguous. Therefore, there are minimum expectations in terms of the formatting within the file and of the explanation provided.

a. Orange

b. Python – Provide instructions and comments

c. R – Provide instructions and comments

d. STATA

e. Others (check with me via email)

Forming your team

This Assignment should be completed by students working in teams of 1 to 4 students. Students are responsible for forming their own teams. Internal and external students can be in the same team.

You are encouraged to discuss team formation before or after classes. Alternatively, you can make a post on the Student Discussion forum to seek teams/teammates.

It has been noticed that early and deliberate team formation is associated with successful assessment outcomes. Only one final assignment per team should be submitted.

Data sources

The dataset chosen should not be one that is included in the course, the textbooks, or Orange. Here are some suggested sources:

• From governments: Australian Bureau of Statistics, Data.gov.au, Data.gov.uk and Data.gov.

• From NGOs: World Bank surveys, World integrated trade solutions data platform and world development indicators.

• Financial datasets: Yahoo Finance, Reserve Bank of Australia, NASDAQ and UniSA library.

• Crowd-sourced datasets: Kaggle, GitHub and Dataworld.

• Others: UC Irvine, AWS open data.

Programming help:

Besides our course resources and staff, there are many open forums and sources for programming help:

· Orange widgets documentation

· Orange to Python documentation

· Medium.com

Artificial Intelligence

The use of artificial intelligence tools, such as ChatGPT, is permitted if the following conditions are met:

1. These tools should be applied critically and with sufficient digestion and transformation.  That is, it should be akin to another form of Googling. Students are expected to understand the pros and cons of the AI outputs, synthesize these outputs, and write these up in their own words.

2. All prompts and outputs of any AI interactions should be attached verbatim as an appendix.

Feedback form for Financial Risk Simulation

Assessment feedback

F2

F1

P2

P1

C

D

HD

Qualitative aspects (40%)

• Overview of the dataset and business questions

• Interpretation of key results

• Discussion of business applications

Programming aspects (40%)

• Data wrangling

• Choice of methods

• Correctness and appropriateness of methods

• Succinct visualization and presentation of key results

General Presentation (10%):

Appropriate layout; Proper formatting for tables, figures; Appropriate referencing

Written Presentation (10%):

Legible and well set out; Arguments presented in a clear and logical manner; Correct grammar; Absence of typos, incomplete sentences, and confusing expressions

Additional comments: