Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECON 1611: Big Data, Machine Learning and Society 

Assessment 2: Empirical Project in Python

Written report (30 marks)

Write a 3-5 page report that answers the following questions:

1) Summarise and describe the data (2 marks)

Steps in python

1. Print first 20 rows of data

2. Describe the data e.g. mean, median, standard deviation of all the variables

3. Count the number of observations in each response category

2) Graphing (2 marks)

Steps in python

1. Basic scatter plot of two features against each other

2. Histogram

3) From the 20 inputs, choose the set of controls you will use for your machine learning models. Justify why you have excluded some variables. Hint: the ones highlighted in yellow are those you may want to exclude: why? (1 mark)

4) Build a classification tree in Python (1 mark)

Steps in python

1. Split sample into train and test

2. Without doing any pruning of the tree

3. Set hyperparameters

4. Draw tree in Python

5) Interpret the tree that is built (2 marks)

a) What do the squared errors indicate in each node/leaf?

b) What does it mean to have squared errors = 0?

c) Which node splitting rule is very important and why?

d) Would you say this tree provides a good prediction of who will sign up to a term deposit: why/why not?

6) Calculate feature importance for each feature (2 marks)

a) Which is most important feature?

b) Which is the second most important?

c) Out of the top 10, are there similarities between the top influential features?

d) What are some problems with this feature importance exercise

7) Do GridsearchCV to find the optimal tree and draw the tree (2 marks)

a) What changed between this pruned tree and the previous tree that you built (without pruning?)

b) Which hyperparameter changed this?

c) Why do you think the pruned tree is better

d) Why does CV enable you to find the best predicting tree?

8) After GridsearchCV, calculate feature importance for each feature (3 marks)

a) How have the features changed compared to the non-pruned tree?

b) Out of the top 10, are there similarities between the top influential features?

c) What are some problems with this feature importance exercise

9) Run a LASSO model (5 marks)

a) Split the sample into train and test

b) Run GridsearchCV

c) Report the optimal hyperparameter

d) Explain what the optimal hyperparameter is

e) What is a potential issue with fitting a LASSO model here?

10) Calculate feature importance values for all the variables using the best LASSO model (1 mark)

11) Which features are deemed unimportant by LASSO? (1 mark)

12) Compare features that were deemed important in LASSO and those in the pruned classification tree (6 marks)

a) What are the differences?

b) Why do you think these differences exist?

c) Make a policy recommendation to the bank about their marketing effort.

d) What is a good marketing strategy that the bank should employ to increase the chance that people will sign up to the bank product of a term deposit?

e) Do you think the variables identified in the feature importance exercises would “cause” term deposit purchases to increase?

f) Why?

13) What do you think are some ways you could get around the issue of correlated variables when trying to identify the important features? (2 marks)

Python script (5 marks)

To answer the assessment questions, open the provided ipynb in Python Jupyter notebook and complete the code required to answer the written report component above.

Details about the dataset:

1. Access the dataset at this link: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

2. Dataset Information:

The data are from direct marketing campaigns by a Portuguese banking institution. The marketing campaigns were conducted via phone calls. Often, more than one contact attempt per client was required in order to confirm that the product (bank term deposit) was subscribed to.

There are four datasets: 

1) bank-additional-full.csv contains all examples (41188) and 20 inputs, ordered by date from May 2008 to November 2010

2) bank-additional.csv contains 10% of the examples (4119), randomly selected from dataset 1, and 20 inputs.

3) bank-full.csv contains all examples and 17 inputs, ordered by date, it is an older version of dataset 1 with less inputs.

4) bank.csv contains 10% of the examples and 17 inputs, randomly selected from dataset 3.

3. For this assessment use dataset 1: bank-additional-full.csv

4. The classification goal is to predict if the client will subscribe (yes/no) to a term deposit (variable y).

5. Attribute Information:

Input variables

Bank client data

1. age (numeric)

2. job: type of job (categorical)

3. marital: marital status (categorical)

4. education (categorical)

5. default: has credit in default? (categorical)

6. housing: has housing loan? (categorical)

7. loan: has personal loan? (categorical)

Related with the last contact of the current campaign

1. contact: contact communication type (categorical)

2. month: last contact month of year (categorical)

3. day_of_week: last contact day of the week (categorical)

4. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other attributes

1. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

2. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

3. previous: number of contacts performed before this campaign and for this client (numeric)

4. poutcome: outcome of the previous marketing campaign (categorical)

Social and economic context attributes

1. emp.var.rate: employment variation rate - quarterly indicator (numeric)

2. cons.price.idx: consumer price index - monthly indicator (numeric)

3. cons.conf.idx: consumer confidence index - monthly indicator (numeric)

4. euribor3m: euribor 3 month rate - daily indicator (numeric)

5. nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target)

1. y - has the client subscribed to a term deposit? (binary: 'yes','no')