Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ACT 6241: Data Mining and Business Analytics

Group Project Description

I. Introduction

This group project is an opportunity for student groups to apply data mining techniques to real- world business problems. It tries to simulate a scenario in which you are supposed to provide   recommendations to the management team of a commercial bank on the problem of “Bank    marketing” on a given product (term deposit).

You are provided a training dataset (“bank_marketing_train.csv”) that is about the direct           marketing campaigns of a commercial banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

Specifically, the training dataset is composed of 26,246 observations with 20 variables (19 input variables/features and 1 target variable y) as below.

1)     age (numeric)

2)    job : type ofjob (categorical: 'admin.','blue-

collar','entrepreneur','housemaid','management','retired','self-

employed','services','student','technician','unemployed','unknown')

3)    marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4)     education (categorical:

'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.deg ree','unknown')

5)    default: has credit in default? (categorical: 'no','yes','unknown')

6)    housing: has housing loan? (categorical: 'no','yes','unknown')

7)    loan: has personal loan? (categorical: 'no','yes','unknown')

8)     contact: contact communication type (categorical: 'cellular','telephone')

9)    month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10)  day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11)   campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

12)  pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

13)  previous: number of contacts performed before this campaign and for this client (numeric)

14)  poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

15)   emp.var.rate: employment variation rate - quarterly indicator (numeric)

16)   cons.price.idx: consumer price index - monthly indicator (numeric)

17)   cons.conf.idx: consumer confidence index - monthly indicator (numeric)

18)   euribor3m: euribor 3 month rate - daily indicator (numeric)

19)  nr.employed: number of employees - quarterly indicator (numeric)

20)  y - has the client subscribed a term deposit? (binary: 'yes','no')

Then you are required to build a (binary) classification model to predict if the client will subscribe a term deposit (target variable y) for following two tasks.

1.   Task-1: build a model that could predict the ranking score of being positive (y=’yes’)      that achieve the best AUC score of the ROC curve, and predict the ranking score of       being positive for the 8,000 clients in the “bank_marketing_test.csv” , which has 19 input variables without the target variable y. The expected deliverables to be submitted are       listed as below.

a.   "bank_marketing_test_scores.csv" which stores the predicted ranking score for     the 8,000 clients in the "bank_marketing_test.csv". This file is supposed to have   8,000 rows, and each row records corresponding ranking score of being positive   for the corresponding client in the same order in “bank_ marketing_test.csv” . You may refer to the "bank_marketing_test_scores(example).csv" for example. Please note that the file name should exactly be "bank_marketing_test.csv".

b.   Python code files that generate the submitted result file "bank_marketing_test_scores.csv"

2.   Task-2: formulate this business problem of “Bank marketing” with your business            understandings, and select the best models and strategies that could achieve the best         performance that is defined by your formulation of the business problem. Compared to    Task- 1 which simply treating the “Bank marketing” problem as maximizing the overall  response rate of marketing,  Task-2 encourages you to formulate this problem in business applications, e.g., to maximize the revenue or the profit. In addition, you could also          introduce assumptions are made to formulate the problem.  For example, you need to       assume the distribution of term deposit amount among clients if you formulate the           problem for the objective for bank marketing as attract more revenue (larger amount of   term deposit). Alternatively, you may also assume that you have a fixed budget for the     marketing, and you try to maximize the proposed objectives with limited budget rather    than the ROC curve for general situations.  The expected deliverables to be submitted are listed as below.

a.   Report (12-point font, single spaced, 10 pages at most including supporting tables, figures, and calculations)

b.   Presentation deck

c.   Project files, including Python code and other required data files (if any, preferably in Excel or CSV format).

The key requirements and difference of the above two tasks have been summarized as below for your reference.

 

Task- 1

Task-2

Main Dataset

bank_marketing_train.csv

bank_marketing_train.csv

Additional

Dataset

 

Not allowed

Based on your assumption if   necessary (e.g., distribution of client value/cost)

 

Objective

 

AUC of ROC curve

Other objective with             applications in business, that is proposed by the group

 

 

 

 

Deliverables

 

"bank_marketing_test_scores.csv" which stores the predicted ranking score (with your selected best       model) for the 8,000 clients in the "bank_marketing_test.csv".

1)  Report

2)  Presentation deck

3)  Project files, including     Python code and other     required data files (if any, preferably in Excel or      CSV format).

 

Other

 

Try and compare at least three different types of predictive    models based on your             proposed objective

You need to form a group of 4 students to accomplish the above two tasks and submit these expected deliverables on/before the corresponding deadlines specified in Section III.

The components of project assessing criteria and associated weights are listed in the file            “ACT6241_group project_grade_book_2023.xlsx” .  As you may notice, the Task-2 accounts for higher weight during the assessment, since you are required to provide a report and              presentation that contain a summary of the data mining process, as well as the conclusion and   discussions for suggested actions. Your presentation should summarize your analysis and focus on your contributions by comparing with at least two baseline method. The detailed guidelines for Task-2 report and presentations could be found as below.

II. Guidelines of Task-2 Report and Presentations

At a minimum, your report and presentation for Task-2 should include the following:

1. Introduction and assumptions

Introduce the background of the project, and then highlight the objective and motivation of your proposed  project,  finally  outline  your  proposed  methodology  as  well  as  summarize  your contributions. You may also introduce assumptions are made to simplify or formulate the problem. For example, you need to assume the distribution of term deposit amount among clients if you formulate the problem for the objective for bank marketing as attract larger amount of term deposit.

2. Dataset Preparation and Preprocess

Describe the dataset your project will work on. A brief introduction on the source, scale, range and sampling method of the dataset. Some preliminary exploratory analysis will also be welcome to analyze the general characteristics and identify potential issues of the data, such as skewness, outliers, imbalance, missing data.

Accordingly, you may also introduce what kind measures are taken to preprocess based on the assumptions or identified issues in the previous steps. Possible preprocess includes but not limited to   data   cleaning,   feature   encoding/transformation,   feature   normalization/scaling,   feature reduction/selection.

3. Modeling

How do you decompose the business problem into small problems, and also the data mining models that you proposed to solve each individual problem? In general, you are supposed to implement and compare at least three different types of predictive models, i.e., linear model (such as Logistic Regression), non-linear models (like Decision Tree. SVM with non-linear kernels, Neural Networks) and ensemble models (like Random Forest). And you may choose one as the adopted model and set at least two others as baseline models for comparison.

4. Performance Evaluation

Describe the framework and metrics you propose to evaluate your proposed methodologies. Usually, you are supposed to demonstrate the advantages of your proposed models/approaches over the baseline models/approaches.

5. Conclusion

What conclusions can be drawn from your analysis?  What additional actions or analyses would be useful to conduct in the future? What additional questions would be useful to answer that is beyond the scope of your project?

III. Deliverables and Due Dates

Electronic submission will be required on following materials with specified due time as follows.

Part 1: All deliverables of Task-1, report of Task-2 (Electronic Copy Due: 23:59, Apr 29, 2023)

Required:

1) Taks- 1: "bank_marketing_test_scores.csv" which stores the predicted ranking score (with your selected best model) for the 8,000 clients in the "bank_marketing_test.csv".

2) Report of Task-2:  12-point font, single spaced,  10 pages at most including supporting tables, figures, and calculations

Part 2: Presentation (Due May 6, 2023)

Required:

1)   10-minute presentation and 2-minute Q&A

2)  In your presentation slides, you need to include one page containing details of the job allocations among group members.

3)  Not all team members are required to be active in the presentation. While those who do not present may subject to query during Q&A session.

Part 3: Slides, Project Files of Task-2 (Due: 23:59, May 7, 2023)

Required:  In  addition to the presentation  slides, you  should  also  include your Project  files, including Python code and other required data files (if any, preferably in Excel or CSV format). You may submit this as a file attachment or a file sharing link if the size of file exceeds the attachment limit (e.g., Baidu Yun, Onedrive).

Part 4: Peer Evaluation (Due: 23:59, May 10, 2023)

V. Tips

•   For Task-2, you may introduce external relevant datasets (please see Appendix for reference) to support your analysis.

•   Start exploring your dataset EARLY to identify data issues and this will be time            consuming. These may include: datasets that require time consuming manual cleaning, datasets that are large and will need to be analyzed in smaller subsets or imported into a database, more data may need to be collected to conduct your analysis, or you decide to learn a new software tool that would greatly improve your analysis.

•   If you are working with a large dataset, conduct your analyses over a subset of the data first. Once you have finalized what types of analysis you will conduct, you can then

apply your analysis to the full sample. If it is infeasible or very costly (in terms of time or computation) to conduct your analyses over the full sample, then state this and explain     whether you believe the results would be similar or different over the full sample.

Appendix: External Relevant Datasets

1.    Dataset

Commercial Database: WRDS/CSMAR/Wind

Tableau Resources:https://public.tableau.com/en-us/s/resources

Kaggle Dataset:https://www.kaggle.com/datasets

Data Fountain:https://www.datafountain.cn/datasets

Tianchi:https://tianchi.aliyun.com/competition/gameList/activeList

UCI:http://archive.ics.uci.edu/ml/index.php

AWS Public Datasets:https://aws.amazon.com/public-datasets

KDD Cup:https://www.kdd.org/kdd-cup

Dataset List :https://github.com/awesomedata/awesome-public-datasets         

Any publicly accessible data source with the help of web crawler or data API.