ACT 6241: Data Mining and Business Analytics
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ACT 6241: Data Mining and Business Analytics
Group Project Description
I. Introduction
This group project is an opportunity for student groups to apply data mining techniques to real- world business problems. It tries to simulate a scenario in which you are supposed to provide recommendations to the management team of a commercial bank on the problem of “Bank marketing” on a given product (term deposit).
You are provided a training dataset (“bank_marketing_train.csv”) that is about the direct marketing campaigns of a commercial banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
Specifically, the training dataset is composed of 26,246 observations with 20 variables (19 input variables/features and 1 target variable y) as below.
1) age (numeric)
2) job : type ofjob (categorical: 'admin.','blue-
collar','entrepreneur','housemaid','management','retired','self-
employed','services','student','technician','unemployed','unknown')
3) marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4) education (categorical:
'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.deg ree','unknown')
5) default: has credit in default? (categorical: 'no','yes','unknown')
6) housing: has housing loan? (categorical: 'no','yes','unknown')
7) loan: has personal loan? (categorical: 'no','yes','unknown')
8) contact: contact communication type (categorical: 'cellular','telephone')
9) month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10) day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11) campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
12) pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
13) previous: number of contacts performed before this campaign and for this client (numeric)
14) poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
15) emp.var.rate: employment variation rate - quarterly indicator (numeric)
16) cons.price.idx: consumer price index - monthly indicator (numeric)
17) cons.conf.idx: consumer confidence index - monthly indicator (numeric)
18) euribor3m: euribor 3 month rate - daily indicator (numeric)
19) nr.employed: number of employees - quarterly indicator (numeric)
20) y - has the client subscribed a term deposit? (binary: 'yes','no')
Then you are required to build a (binary) classification model to predict if the client will subscribe a term deposit (target variable y) for following two tasks.
1. Task-1: build a model that could predict the ranking score of being positive (y=’yes’) that achieve the best AUC score of the ROC curve, and predict the ranking score of being positive for the 8,000 clients in the “bank_marketing_test.csv” , which has 19 input variables without the target variable y. The expected deliverables to be submitted are listed as below.
a. "bank_marketing_test_scores.csv" which stores the predicted ranking score for the 8,000 clients in the "bank_marketing_test.csv". This file is supposed to have 8,000 rows, and each row records corresponding ranking score of being positive for the corresponding client in the same order in “bank_ marketing_test.csv” . You may refer to the "bank_marketing_test_scores(example).csv" for example. Please note that the file name should exactly be "bank_marketing_test.csv".
b. Python code files that generate the submitted result file "bank_marketing_test_scores.csv"
2. Task-2: formulate this business problem of “Bank marketing” with your business understandings, and select the best models and strategies that could achieve the best performance that is defined by your formulation of the business problem. Compared to Task- 1 which simply treating the “Bank marketing” problem as maximizing the overall response rate of marketing, Task-2 encourages you to formulate this problem in business applications, e.g., to maximize the revenue or the profit. In addition, you could also introduce assumptions are made to formulate the problem. For example, you need to assume the distribution of term deposit amount among clients if you formulate the problem for the objective for bank marketing as attract more revenue (larger amount of term deposit). Alternatively, you may also assume that you have a fixed budget for the marketing, and you try to maximize the proposed objectives with limited budget rather than the ROC curve for general situations. The expected deliverables to be submitted are listed as below.
a. Report (12-point font, single spaced, 10 pages at most including supporting tables, figures, and calculations)
b. Presentation deck
c. Project files, including Python code and other required data files (if any, preferably in Excel or CSV format).
The key requirements and difference of the above two tasks have been summarized as below for your reference.
|
Task- 1 |
Task-2 |
Main Dataset |
bank_marketing_train.csv |
bank_marketing_train.csv |
Additional Dataset |
Not allowed |
Based on your assumption if necessary (e.g., distribution of client value/cost) |
Objective |
AUC of ROC curve |
Other objective with applications in business, that is proposed by the group |
Deliverables |
"bank_marketing_test_scores.csv" which stores the predicted ranking score (with your selected best model) for the 8,000 clients in the "bank_marketing_test.csv". |
1) Report 2) Presentation deck 3) Project files, including Python code and other required data files (if any, preferably in Excel or CSV format). |
Other |
|
Try and compare at least three different types of predictive models based on your proposed objective |
You need to form a group of 4 students to accomplish the above two tasks and submit these expected deliverables on/before the corresponding deadlines specified in Section III.
The components of project assessing criteria and associated weights are listed in the file “ACT6241_group project_grade_book_2023.xlsx” . As you may notice, the Task-2 accounts for higher weight during the assessment, since you are required to provide a report and presentation that contain a summary of the data mining process, as well as the conclusion and discussions for suggested actions. Your presentation should summarize your analysis and focus on your contributions by comparing with at least two baseline method. The detailed guidelines for Task-2 report and presentations could be found as below.
II. Guidelines of Task-2 Report and Presentations
At a minimum, your report and presentation for Task-2 should include the following:
1. Introduction and assumptions
Introduce the background of the project, and then highlight the objective and motivation of your proposed project, finally outline your proposed methodology as well as summarize your contributions. You may also introduce assumptions are made to simplify or formulate the problem. For example, you need to assume the distribution of term deposit amount among clients if you formulate the problem for the objective for bank marketing as attract larger amount of term deposit.
2. Dataset Preparation and Preprocess
Describe the dataset your project will work on. A brief introduction on the source, scale, range and sampling method of the dataset. Some preliminary exploratory analysis will also be welcome to analyze the general characteristics and identify potential issues of the data, such as skewness, outliers, imbalance, missing data.
Accordingly, you may also introduce what kind measures are taken to preprocess based on the assumptions or identified issues in the previous steps. Possible preprocess includes but not limited to data cleaning, feature encoding/transformation, feature normalization/scaling, feature reduction/selection.
3. Modeling
How do you decompose the business problem into small problems, and also the data mining models that you proposed to solve each individual problem? In general, you are supposed to implement and compare at least three different types of predictive models, i.e., linear model (such as Logistic Regression), non-linear models (like Decision Tree. SVM with non-linear kernels, Neural Networks) and ensemble models (like Random Forest). And you may choose one as the adopted model and set at least two others as baseline models for comparison.
4. Performance Evaluation
Describe the framework and metrics you propose to evaluate your proposed methodologies. Usually, you are supposed to demonstrate the advantages of your proposed models/approaches over the baseline models/approaches.
5. Conclusion
What conclusions can be drawn from your analysis? What additional actions or analyses would be useful to conduct in the future? What additional questions would be useful to answer that is beyond the scope of your project?
III. Deliverables and Due Dates
Electronic submission will be required on following materials with specified due time as follows.
Part 1: All deliverables of Task-1, report of Task-2 (Electronic Copy Due: 23:59, Apr 29, 2023)
Required:
1) Taks- 1: "bank_marketing_test_scores.csv" which stores the predicted ranking score (with your selected best model) for the 8,000 clients in the "bank_marketing_test.csv".
2) Report of Task-2: 12-point font, single spaced, 10 pages at most including supporting tables, figures, and calculations
Part 2: Presentation (Due May 6, 2023)
Required:
1) 10-minute presentation and 2-minute Q&A
2) In your presentation slides, you need to include one page containing details of the job allocations among group members.
3) Not all team members are required to be active in the presentation. While those who do not present may subject to query during Q&A session.
Part 3: Slides, Project Files of Task-2 (Due: 23:59, May 7, 2023)
Required: In addition to the presentation slides, you should also include your Project files, including Python code and other required data files (if any, preferably in Excel or CSV format). You may submit this as a file attachment or a file sharing link if the size of file exceeds the attachment limit (e.g., Baidu Yun, Onedrive).
Part 4: Peer Evaluation (Due: 23:59, May 10, 2023)
V. Tips
• For Task-2, you may introduce external relevant datasets (please see Appendix for reference) to support your analysis.
• Start exploring your dataset EARLY to identify data issues and this will be time consuming. These may include: datasets that require time consuming manual cleaning, datasets that are large and will need to be analyzed in smaller subsets or imported into a database, more data may need to be collected to conduct your analysis, or you decide to learn a new software tool that would greatly improve your analysis.
• If you are working with a large dataset, conduct your analyses over a subset of the data first. Once you have finalized what types of analysis you will conduct, you can then
apply your analysis to the full sample. If it is infeasible or very costly (in terms of time or computation) to conduct your analyses over the full sample, then state this and explain whether you believe the results would be similar or different over the full sample.
Appendix: External Relevant Datasets
1. Dataset
Commercial Database: WRDS/CSMAR/Wind
Tableau Resources:https://public.tableau.com/en-us/s/resources
Kaggle Dataset:https://www.kaggle.com/datasets
Data Fountain:https://www.datafountain.cn/datasets
Tianchi:https://tianchi.aliyun.com/competition/gameList/activeList
UCI:http://archive.ics.uci.edu/ml/index.php
AWS Public Datasets:https://aws.amazon.com/public-datasets
KDD Cup:https://www.kdd.org/kdd-cup
Dataset List :https://github.com/awesomedata/awesome-public-datasets
Any publicly accessible data source with the help of web crawler or data API.
2023-04-20