Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


QBUS6180

Statistical Learning and Data Mining

Semester 2, 2021

Classification Project: Marketing Analytics


1. Overview

In this project, your team will analyse marketing data from a bank and a retail company. Your team will have two tasks. The first will be to build machine learning models to predict the success of marketing campaigns. The second will be to uncover insights that can help your clients make better marketing decisions.


2. Problem description

As a team of data scientists and business analysts working for a marketing consulting company, you have been tasked with helping two clients, a bank and a fashion store, to leverage their data to increase the effectiveness of their marketing campaigns.

The two clients provided your team with data from their latest direct marketing campaigns. You have two tasks:

1. To develop statistical learning models to predict whether the marketing campaign will be successful with a customer.

2. To obtain at least three insights that can help the clients make decisions about their marketing campaigns. What types of customers are more responsive to marketing campaigns?

We will refer to these tasks as statistical learning and data mining, respectively.

As part of the project, you need to write a report according to the instructions below.


3. Understanding the data

3.1 Two datasets

This project involves two marketing datasets, one from a bank and another from a fashion store. The assignment requires you to work with both datasets, but you’ll be able to pick one out of the two for some parts of the report.

One dataset primarily has numerical variables, while the other emphasises categorical variables.


3.2 Bank dataset

The bank dataset is from a phone campaign to encourage clients to subscribe to a term deposit.

The dataset has two files, a training dataset and a second dataset without the response labels for the Kaggle competition.

Kaggle randomly splits this second file into validation (50%) and test (50%) cases, but you will not know which ones are which. You get a score equal to the competition metric (to be announced) computed on the validation cases when you submit to the competition. These scores are displayed on the Public Leaderboard and provide an ongoing ranking of teams. You can use the scores of your submissions to help you select the best model.

You will select one of your submissions to be used as the final model at the end of the competition. Once the competition is over, Kaggle will rank the teams’ final submissions based on the test cases only, and those will be displayed on the Private Leaderboard. Your goal is to score as best as possible on the Private Leaderboard at the end of the competition. Therefore, please be careful not to overfit the validation cases in an attempt to improve your public ranking.

Each row corresponds to a call made to a customer. The response variable, subscribed, is the last column in the dataset. It indicates whether the client subscribed to a term deposit, which was the objective of the campaign.

The data dictionary file describes the predictor variables.


3.3 Fashion store dataset

The store dataset refers to a promotional e-mail campaign.

Each row refers to a different customer. The response variable, responded, indicates whether the customer responded to the promotion.

The data dictionary file describes the predictor variables.


5. Statistical Learning (Task 1)

Requirements:

Bank dataset: your report must provide the Kaggle Public Leaderboard scores for at least five different sets of predictions, including your final model. You need to submit to Kaggle to get each validation score. The five sets of predictions should all come from different machine learning methods.

Fashion store dataset: your report must provide model selection results for at least five different models, including your final model. The project must also include test results.

Fashion store dataset: assume at least one loss matrix.

Both datasets: at least one of your models should be a linear model.

Both datasets: at least one of your models should be a tree-based model.

Both datasets: at least one of your models should be a model average or model stack.

Both datasets: identify one of your five models as the benchmark.

Note that these are only minimum requirements. Refer to the rubric for the details on the marking criteria.


6. Data Mining (Task 2)

Business question: What types of customers are more responsive to marketing campaigns?

Requirements:

Extract at least three quantitative insights from the data that address the business question.

You can use any combination of the two datasets for this task.

Notes:

This task is open-ended, as is the nature of data mining applications. Think creatively and explore the data in a way that you find interesting. The ability to approach open-ended problems is vital in data science.

Remember that association is not causation. Do not oversell your insights.


7. Written report

The purpose of the report is to describe, explain, and justify your solution to the clients. You can assume that the clients have training in business analytics. However, do not assume that they are experts on the methods used in your project.

Preparing the report will involve careful consideration of what should go in the main text (15 pages). The main text should focus on the highlights of the project. Note that there is no page limit for the appendix. It’s ok to put extra material (such as additional figures and tables) in the appendix and refer to it in the main text.

Requirements:

Discuss three of your best models in detail in the methodology section (the others do not need to be discussed, just mentioned). Make sure to include the best performing model for each of the datasets.

Discuss the business problem from the perspective of decision theory (in the problem formulation section). How can machine learning help businesses optimise their marketing efforts?

Suggested outline:

1. Introduction: write a few paragraphs introducing the project and overview the methodology and main results. Use plain English and avoid technical language as much as possible in this section (write it for a broad audience).

2. Problem formulation and objectives: state the problem to be solved and the goals of the project.

3. Data understanding: provide essential information about the data, discuss potential issues, and highlight the most interesting findings. Due to a possible lack of space, you may want to refer to the appendix for most EDA plots.

4. Feature engineering.

5. Methodology: focus on the three models specified above. Explain the rationale for using these learning algorithms and explain the choices that you’ve made regarding configuration, training and hyperparameter optimisation. This part is allowed to be more technical than the rest of the report.

6. Results.

7. What types of customers are more responsive to marketing campaigns?


8. Kaggle Competition

We will post the link to join the competition on Canvas.

You will need to create a Kaggle account identifiable by your name to access the competition and make submissions. After creating an account and logging into Kaggle, use the provided link to get to the competition page. Click on “Join Competition”, located in a light blue box near the top right corner of the page, then click to accept the competition rules.

Each group should create a team on Kaggle. The group leader can create a team by joining the competition and then going into the “Team” tab, which will appear near the top of the competition page. The leader can then invite other group members using their (Kaggle) names. The name of the Kaggle team must be identical to the group name on Canvas, i.e. the team number must match the group number. Each student in the group must sign up and be identifiable as a member of a Kaggle team.

Requirement: the Kaggle team must be set up and have a valid submission by the first prediction deadline posted on Canvas.

The purpose of the Kaggle competition is to incorporate feedback by allowing you to compare your performance with that of other groups. Participation in the competition is part of the assessment. Make sure that your final submission is correct. Your ranking in the competition will typically not affect your marks directly, as long as we can establish that your participation represents a genuine effort to submit good predictions and improve them over the course of the competition.

Real-world relevance: employers highly value the ability to participate in a Kaggle competition. Some companies in Australia go as far as to set up a Kaggle competition just for recruitment.

Bonus marks: The team with the best performance on the Private Leaderboard will receive ten bonus marks on the assignment. To qualify for the bonus, the choice of the final model needs to be well justified in the report, and your Python code must reproduce the winning predictions. Furthermore, the group would need to post a description of their winning solution on Ed.

Attention! You have to manually select which submission Kaggle will use to compute the test (Private Leaderboard) results. It will not necessarily pick the best submission for you.