BUSI3122 INTRODUCTION TO DATA SCIENCE: BIG DATA ANALYSIS IN BUSINESS AUTUMN SEMESTER 2022-2023
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
BUSI3122-E1
A LEVEL 3 MODULE, AUTUMN SEMESTER 2022-2023
INTRODUCTION TO DATA SCIENCE: BIG DATA ANALYSIS IN BUSINESS
Question 1. Assorted True/False Quesetions (20 marks)
Please write on the answer booklet whether each of the following statements is True or False.
a) ‘Whether a customer would like to purchase IPhone 14 or not’ is an example of classification problems in data mining. (2 marks)
b) Predicting the relationship between the customer number and the price of the product is an example of regression tasks in data mining. (2 marks)
c) Jane and Rob share 10 friends, we would like to predict whether they are also friends to each other. This is an example of co-occurrence grouping tasks. (2 marks)
d) In the data mining terminology, one variable is the same as one feature. (2 marks)
e) Logistic regression is one kind of predictive model for regression tasks. (2 marks)
f) Generalization tries to find the real pattern that can be applied to new data, while
Overfitting may find some patterns that only fit the training data. (2 marks)
g) Similarity measures are most essential for Naïve Bayes. (2 marks)
h) SVM chooses the line to minimize the margin between two classes. (2 marks)
i) Profit Curve and Lift Curve share the same X-axis. (2 marks)
j) Each individual tree in the Random Forest is built on all observations. (2 marks)
Question 2. Naïve Bayesian (25 marks)
The following dataset contains loan information and can be used to try to predict whether a borrower will default (the last column is the classification). We are going to build a Naïve Bayes model to determine whether a loan X should be classified as a Defaulted Borrower or not. So, determine which is larger, P(Yes|X) or P(No|X) :
Tid |
Home Owner |
Marital Status |
Annual Income |
Defaulted Borrower |
1 |
Yes |
Single |
High |
No |
2 |
No |
Married |
High |
No |
3 |
No |
Single |
Low |
No |
4 |
Yes |
Married |
High |
No |
5 |
No |
Divorced |
Low |
Yes |
6 |
No |
Married |
Low |
No |
7 |
Yes |
Divorced |
High |
No |
8 |
No |
Single |
Low |
Yes |
9 |
No |
Married |
Low |
No |
10 |
No |
Single |
Low |
Yes |
a) First, please calculate the prior probabilities for Defaulted Borrower (P(YES)) and Non-
Defaulted Borrower (P(NO)), and all the necessary parameters for a Naïve Bayesian classifier. (10 marks)
b) Given a new customer X (Home Owner = No, Marital Status=Married, Income=High), calculate the probability that this customer is a Defaulted Borrower or Non-Defaulted Borrower respectively. Based on the Naïve Bayesian classifier, what will be the predicted class of this customer? (10 marks)
c) If we set the prior P(YES)=0.1 P(NO)=0.9 and the other parameters remain the same, answer question b) again, and briefly explain how the prior probabilities influence the judgement of classifier. (5 marks)
Question 3. Logistic Regression (30 marks)
The dating web site Jiayuan.com requires its users to create profiles based on a survey in which they rate their interest (on a scale from 0 to 3) in five categories: physical fitness, music, spirituality, education, and alcohol consumption. A new Jiayuan customer, Joseph NoBody, has reviewed the profiles of 20 prospective dates and classified whether he is interested in learning more about them.
Based on Joseph's classification of these 20 profiles, Jiayuan has applied a logistic regression to predict whether Joseph is interested in other profiles that he has not yet viewed. The resulting logistic regression model is as follows:
Log odds of Interested = -0.920 + 0.325 × Fitness - 3.611 × Music
+ 5.535 × Education - 2.927 × Alcohol
For the 20 profiles (observations) that Joseph has viewed and shown his interests, this logistic regression model generates the following probability of Interested.
Observation ID |
Interested (Actual) |
Probability of Interested (Predicted) |
Observation ID |
Interested (Actual) |
Probability of Interested (Predicted) |
20 |
1 |
1.000 |
16 |
1 |
0.512 |
17 |
1 |
0.999 |
9 |
0 |
0.485 |
4 |
1 |
0.999 |
6 |
0 |
0.419 |
12 |
0 |
0.877 |
18 |
1 |
0.368 |
14 |
1 |
0.853 |
3 |
0 |
0.365 |
19 |
1 |
0.767 |
2 |
0 |
0.330 |
11 |
1 |
0.754 |
8 |
0 |
0.322 |
7 |
0 |
0.666 |
5 |
0 |
0.200 |
13 |
1 |
0.657 |
1 |
0 |
0.168 |
10 |
1 |
0.602 |
15 |
0 |
0.128 |
a) Using a cut-off value of 0.5 to classify whether Joseph is interested or not, and construct the confusion matrix for this 20-observation data set. (5 marks)
b) According to Jiayuan, it costs the website 10 cents to recommend one profile to Joseph. If Joseph is interested, he will spend 1 rmb to gain the full access to the profile. Please calculate the expected profit the website can gain from Joseph if it applied the classifier. (10 marks)
c) Based on the logistic regression result table, construct the ROC curve and calculate the AUC. Hint: the ROC curve is a set of line segments parallel or vertical to the x-axis. (10 marks)
d) If we have a new profile that has values of Fitness = 3, Music = 1. Education = 3, and Alcohol = 1, use the estimated logistic regression equation to compute the probability of Joseph's interest in this profile. (5 marks)
Question 4. Decision Analytic Thinking (25 marks)
As the World Cup 2022 is ongoing in Qatar now, you are about to launch a legal betting shop in your neighbourhood. Thanks to your friend Lucky Yu, a veteran in the betting shop business, you have access to an extensive dataset on existing customers, including gender, age, demographic information by zip code, and their betting history. You plan to send invitations to residents in your area (with a targeting cost) and run experiments under the following assumptions:
Customers’ bets may vary.
Customers may place a bet when they pass by your betting shop (even when they did not receive the invitation).
Targeting cost is fixed.
Other than the targeting cost, there are no additional costs.
You have been asked to build several data mining models that would suggest which customers should be targeted to maximize your profit. Use the expected value framework to determine what models should be used to address the problem, and explain the dataset you need for each of them.
Note: It is sufficient to write down the correct expected value equations to identify the models that should be constructed. You need to consider different situations and build individual models for different situations.
2024-01-04