Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Group Project II

Note:

1. The submission is due on Dec 8th  Friday (5 PM, the SAME time as your final exam).

2. Group submission (up to 5 persons) is encouraged for project assignments. If you submit your  assignment as a study group, please make sure that EVERY student in this group has first tried to

solve ALL questions before group discussion and can solve ALL questions after group discussion.

3. Please submit your group assignment to Canvas with the title “Group ID_Last Name1_Group ID_Last Name 2_Group ID_Last Name 3”. For example, if Nash from Group 1, Li from Group

2, and Patel from Group 1 worked together on the third assignment, the title should be

“G1_Nash_G2_Li_G1_Patel”. Only ONE submission is needed for all members included in the  same group. All students included in the same group submission will receive the same grade and please submit a separate group project if you do not agree with other group members and intend  to offer different answers.

4. For calculation questions, please show your derivation step by step or you can take a photo of your derivation steps and submit it with your answers (please combine these in one PDF file).

Please feel free to add extra space (below each question) if needed. Only PDF files are accepted.

Part I

1.   The standard error refers to the standard deviation of the mean from:

Your answer              

A.  Any sample

B.  All possible samples from a population with the same sample size

C.  Many samples from a population with different sample sizes

D.  Many samples from a population with the same sample size

E.  All possible samples from a population with different sample sizes

2.   The central limit theorem states that as the                     (TWO words) becomes larger,

the sampling distribution becomes approximately a normal distribution. This conclusion is useful for statistical inference.

3.   Which one of the following statement is correct?   Your answer:                          

A.  If a population distribution is skewed, the sampling distribution of the mean is skewed.

B.  Statistical inference cannot be based on snowball sampling.

C.  Social scientists often collect repeated samples from the same population

D.  The standard error is not related to the variance of the population.

A survey on 900 UBC undergraduate students shows that their average monthly expenditure is 1600 CAD with a standard deviation of 300 CAD. For the calculation below you may need to  use:

 

4.   The standard error of estimating the monthly expenditure is                       .

5.   Calculate a 95% confidence interval of monthly expenditure for all UBC undergraduates (An interval as your answer and keep two decimal places).

1600 +/- 1.96 x 10=1580.4 to 1619.6

6. Suppose employees from a company A have an average salary of 3000 CAD with a standard deviation of 500. You draw a random sample of 400 employees.

6.1 What is the probability that the mean of the sample is between 3050 and 2950?

6.2 What is the probability that the mean of the sample is between 3025 and 2975?

6.3 Why is this probability in (2) lower or higher than the previous probability related to 3050 and 2950 in (1)?

because a wider interval lead to a higher level confidence

6.4   A similar company B would like to conduct another survey on its employees. For all

employees in company B the researcher assumed the same average salary of 3000 CAD with a standard deviation of 500. To achieve a margin of error (sampling error) of 100 at a 95%

confidence interval, how many employees should the researcher interview? (provide an integer value as answer)

100 = 1.96 * 500/根号n —> n=9.8**2 ->96.04

7. One class in the Sociology Department has an average grade of 80 points with a standard

deviation 3 points. The instructor decides to adjust the grades distribution by adding 2 points to  EVERY student. The mean and standard deviation of the new grades distribution are                    (one number as your answer) and                  (one number as your answer) , respectively.

8. You, as a sales manager of Apple, are interested in estimating the average amount spent by teenage shoppers at an online music store in a one-month period with 90% confidence. You    collect a random sample of purchases of 64 teens. In this sample, the 64 teens purchased an average of $56 worth of music online. And the standard deviation of purchases is $16. Please calculate the 90% confidence interval.

8.1 The standard error of the sample mean is                           16/8

8.2 The 90% confidence interval is                                   56 +/- 1.65*2

Part II

1.1 Please calculate the correlation matrix for hdi_2013, life_expectancy, mean_schooling, expected_schooling and income_per capita

1.2 Please briefly describe and explain findings from the 5 X 5 matrix above.

Please refer to the Body Mass Index data provided by the instructor.

Variables of the dataset are defined as follows:

Wave: the year when the survey was conducted.

Age: age of respondents

Male: male=1 female=0

Ch_BMI: body mass index expressed in units of kg/m2

Ch_overweight: overweight status- 1 overweight   0 normal weight

Income: household income per capita, adjusted by inflation

Income_tertile: levels of income-3 high income;    2  medium income;    1 low income

Urban: urban area=1; rural area=0

Birthyear: years of birth of respondents.

Birthyear5: years of birth organized by 5-year groups.

2.1 If we treat survey years (wave) as a continuous variable, please regress ch_BMI on wave and describe your findings. Are respondents surveyed in more recent years associated with higher or  lower BMI?

2.2 Now treat wave as a categorical variable and code it as a set of dummy variables. Please show regression results based on different coding methods using the first category (the year  1991), the last category (the year 2009), the year 2000 and the most frequent category as the reference group, respectively.

Note: you may need to use –i- or –ib- options in Stata.

2.3 When the first survey year (1991) is regarded the reference group, please show the effect of the year 1991 on BMI.

2.4 Please calculate the average BMI across different groups of survey years. Is the average BMI associated with the year 1991 the same as your answer above? Why?

2.5 Please show results from a new regression model with survey waves (still treated as a

continuous variable), male, urban, age, and income. Please describe the effects of each

independent variable and their significance. How do you explain the reduction in survey waves’ size of effect from the base model (in Question 2.1) to the new model?

2.6 How do you explain the increase in R2 from the base model to the new model?

2.7 The R2 , Sum of squares from Regression (SSR),  and Sum of Squares from Errors (residuals, or SSE) are all given in the regression outputs. Actually, you could calculate R2 from SSR and

SSE. Can you show the equation?

2.8 Now treat survey wave as a categorical variable and use the same model in Question 2.5. For a hypothetical girl at age 9 from urban areas and surveyed in 2006. If his family income per capita is 2000, what is her predicted BMI (keep four decimal places and you must also show the STATA output of the new regression model)?

2.9 For the model in 2.8, calculate the residuals (errors) using –yhat- and –generate- commands and call this variable as error. What is the sum of error? What is the correlation coefficient

between error and predicted values of BMI? What is the correlation coefficient between error

and any of the independent variables (wave, male, urban, age, and income) of the new regression model?

3 Scholars have examined the relationship between gender and cigarette smoking for adults in a Canadian city.

Smoking everyday

Men

Women

Total

Yes

80

10

90

No

90

100

190

 

170

110

280

3.1 What is the number of degrees of freedom for this table? Your answer is                      

The calculation for the chi-square value is listed as follows. Please fill in 2-3 with a number (keep one decimal place)

 

fobserved

f

expected

(fobserved   fexpected )2  

f

expected

Men/Yes

80

54.6

11.8

Men/No

90

Question 3.2

5.6

Women/Y

es

10

 

35.4

18.2

Women/N

o

100

 

74.6

Question 3.3

3.2 Your answer                 (please refer to the table above for question 3.2)

3.3 Your answer                 (please refer to the table above for question 3.3)

3.4 The Chi-square test statistic is                               (keep one decimal place)

3.5 If we set alpha as 0.001, the corresponding chi-square critical value is 10.828. Based on your calculation, will you reject or accept the null hypothesis that sex and smoking are independent?

(provide ONE word “reject” or “accept” as your answer)

Your answer                    

4. A student plots four correlation graphs as follows based on r=0.98, 0.63, -0.52 and -0.89.

 

4.1  Which one of the four figures corresponds to r=-0.89?    Your answer                  (choose

from A, B, C, D)

4.2  Which one of the four figures corresponds to r=0.63? Your answer                (choose from A, B, C, D)

5. In regression analysis, we learned that the sum of errors ei  is                (please provide a number as your answer)

Suppose that we have a regression model as: Income=a+b*Age.

6. The dependent variable of this model is                         (choose one from Income, a, b and Age)

7. We wish to account for variability in the writing test scores using information on reading   scores, math scores and the program type the student is in. The categorical variable prog has three levels: 1) general program, 2) academic program, and 3) vocational program.


We have Stata output as follows:

regress write read math prog2 prog3

Source |       SS       df       MS

-------------+------------------------------

Model |  8170.58624     4  2042.64656

Residual |  9708.28876   195  49.7860962

-------------+------------------------------

Total |   17878.875   199   89.843593

Number of obs =     200

F(  4,   195) =   41.03

Prob > F      =  0.0000

R-squared     =  ??????

Adj R-squared =  0.4459

Root MSE      =  7.0559


write |      Coef.   Std. Err.      t    P>|t|      [95% Conf. Interval]

-------------+----------------------------------------------------------------

read |

.289028

.0659478

4.38

0.000

.1589656

.4190905

math |

.3587215

.0745443

4.81

0.000

.2117048

.5057381

prog2 |

.6647754

1.32845

0.50

0.617

-1.955198

3.284749

prog3 |

-2.253484

1.468445

-1.53

0.127

-5.149556

.6425886

_cons |

19.00854

3.40933

5.58

0.000

12.28465

25.73243

7.1 The sum of squares due to regression (SSR) is                       (provide a number as answer)

7.2 As the value of R2 is missing, can you calculate R2 according to other information listed in the table (keep four decimal places)?

Your answer                          

7.3 For a student from the academic program (prog2), her reading score is 80 and her math score is 90, her predicted writing test score is                           (provide a number as answer, keep two    decimal places)

7.4. Which variables above have significant effects on writing scores at the 0.05 level?

A. Reading score        B. Math score         C. academic program (prog2)       D. vocational

program (prog3)

Your answer                      

8. If we add more independent variables to the same regression model, what will happen to R2 ?

A. It will always decrease

B. It will always increase

C. It may decrease or increase

D. It can be larger than 100%.

Your answer                              

Part III

Please refer to the STATA dataset titled “Group Project II” for a dataset collected on UBC students and its corresponding questionnaire in the Appendix. For each question of the two questions below, please:

1) Identify the key dependent and independent variables in your model;

2) Show your STATA codes for data management (e.g., create new variables, remove outliers);

3) Show your STATA codes for regression analysis and the corresponding STATA output (screenshots are OK).

4) Interpret the significant and non-significant results of your regression analysis.

5) Assess if your regression model(s) fits the data well.

1. There are two competing assumptions about monthly expenses among college students. First, their

monthly expenses are determined by their individual characteristics (e.g., grades, major, socioeconomic status). Second, their monthly expenses are determined by more structural factors such as residential arrangements and immigration status.

Please try to adjudicate these two assumptions based on your analysis of Question 18 about monthly expenses and discuss factors associated with higher/lower expected expenses.

2. Please focus on Question 21 (expected GPA) and explore what factors predict one’s high/low GPA.