Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Problem Set 1

DSME 6756:  Business Intelligence Techniques and Applications  (Winter 2022)

Due at 9:30AM, Monday, December 12, 2022

Instructions

Please read the Jupyter Notebook of Session 1 and finish the questions below.  Submit a Jupyter Notebook of your solutions with code on Blackboard.  The total achievable points are 6 for this problem set. Please name your Jupyter Notebook as

 YourLastName_YourFirstName_PS1.ipynb (e.g., Zhang_Renyu_PS1.ipynb)

1.  Playing with the WHO Data Set (3 points)

Please read the data set WHO.csv into Python and answer the following questions:

(a)  (0.6 point) Missing data. Which variables have at least THREE missing (i.e., NA) value? (b)  (0.6 point) Fertility rate. Which country has the highest and lowest fertility rate?

(c)  (0.6 point) Variations  of  GNI. Which region has the minimum variation (measured by standard deviation) in Gross National Income (GNI)? What is the standard deviation of GNI in this region?

(d)  (0.6 point) Child mortality of rich countries. We define a country to be a rich country if its GNI exceeds $20,000. What is the mean child mortality of the rich countries?

(e)  (0.6 point) Correlation.  Demonstrate the relationship between income level vs.  life ex- pectancy through calculating their correlations and visualization.

2.  User Retention (3 points)

The data set Retention.csv contains the active user information of an App for 3 days.  It has three variables:

• user_id: A unique identifier for each user.

• play_duration: The amount of time (in minutes) the user uses the App.

• day: The day of the record.

Note that only users who are active (i.e., log into the App) will be recorded in this data set. Please load the data set into Python and answer the following questions. Hint: You need to use the join operator to link the active users of different days together.

(a)  (2 points) The retention rate of day i is defined as the proportion of active users of day i who will remain active in day i + 1:

Ni,i+1

Ni     ,

where Ni,i+1  is the number of users who are active in BOTH days i and i + 1, and Ni  is the number of active users in day i. Calculate the retention rate of day 1 and day 2, respectively.

(b)  (1 point) Define the users whose play_duration exceed 6 minutes in day i as very active users.

The other users are defined as marginally active users.  Compare the retention rates of very active users and marginally active users in day 1 and day 2.