Group Project II
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Group Project II
Note:
1. The submission is due on Dec 8th Friday (5 PM, the SAME time as your final exam).
2. Group submission (up to 5 persons) is encouraged for project assignments. If you submit your assignment as a study group, please make sure that EVERY student in this group has first tried to
solve ALL questions before group discussion and can solve ALL questions after group discussion.
3. Please submit your group assignment to Canvas with the title “Group ID_Last Name1_Group ID_Last Name 2_Group ID_Last Name 3”. For example, if Nash from Group 1, Li from Group
2, and Patel from Group 1 worked together on the third assignment, the title should be
“G1_Nash_G2_Li_G1_Patel”. Only ONE submission is needed for all members included in the same group. All students included in the same group submission will receive the same grade and please submit a separate group project if you do not agree with other group members and intend to offer different answers.
4. For calculation questions, please show your derivation step by step or you can take a photo of your derivation steps and submit it with your answers (please combine these in one PDF file).
Please feel free to add extra space (below each question) if needed. Only PDF files are accepted.
Part I
1. The standard error refers to the standard deviation of the mean from:
Your answer
A. Any sample
B. All possible samples from a population with the same sample size
C. Many samples from a population with different sample sizes
D. Many samples from a population with the same sample size
E. All possible samples from a population with different sample sizes
2. The central limit theorem states that as the (TWO words) becomes larger,
the sampling distribution becomes approximately a normal distribution. This conclusion is useful for statistical inference.
3. Which one of the following statement is correct? Your answer:
A. If a population distribution is skewed, the sampling distribution of the mean is skewed.
B. Statistical inference cannot be based on snowball sampling.
C. Social scientists often collect repeated samples from the same population
D. The standard error is not related to the variance of the population.
A survey on 900 UBC undergraduate students shows that their average monthly expenditure is 1600 CAD with a standard deviation of 300 CAD. For the calculation below you may need to use:
4. The standard error of estimating the monthly expenditure is .
5. Calculate a 95% confidence interval of monthly expenditure for all UBC undergraduates (An interval as your answer and keep two decimal places).
1600 +/- 1.96 x 10=1580.4 to 1619.6
6. Suppose employees from a company A have an average salary of 3000 CAD with a standard deviation of 500. You draw a random sample of 400 employees.
6.1 What is the probability that the mean of the sample is between 3050 and 2950?
6.2 What is the probability that the mean of the sample is between 3025 and 2975?
6.3 Why is this probability in (2) lower or higher than the previous probability related to 3050 and 2950 in (1)?
because a wider interval lead to a higher level confidence
6.4 A similar company B would like to conduct another survey on its employees. For all
employees in company B the researcher assumed the same average salary of 3000 CAD with a standard deviation of 500. To achieve a margin of error (sampling error) of 100 at a 95%
confidence interval, how many employees should the researcher interview? (provide an integer value as answer)
100 = 1.96 * 500/根号n —> n=9.8**2 ->96.04
7. One class in the Sociology Department has an average grade of 80 points with a standard
deviation 3 points. The instructor decides to adjust the grades distribution by adding 2 points to EVERY student. The mean and standard deviation of the new grades distribution are (one number as your answer) and (one number as your answer) , respectively.
8. You, as a sales manager of Apple, are interested in estimating the average amount spent by teenage shoppers at an online music store in a one-month period with 90% confidence. You collect a random sample of purchases of 64 teens. In this sample, the 64 teens purchased an average of $56 worth of music online. And the standard deviation of purchases is $16. Please calculate the 90% confidence interval.
8.1 The standard error of the sample mean is 16/8
8.2 The 90% confidence interval is 56 +/- 1.65*2
Part II
1.1 Please calculate the correlation matrix for hdi_2013, life_expectancy, mean_schooling, expected_schooling and income_per capita
1.2 Please briefly describe and explain findings from the 5 X 5 matrix above.
Please refer to the Body Mass Index data provided by the instructor.
Variables of the dataset are defined as follows:
Wave: the year when the survey was conducted.
Age: age of respondents
Male: male=1 female=0
Ch_BMI: body mass index expressed in units of kg/m2
Ch_overweight: overweight status- 1 overweight 0 normal weight
Income: household income per capita, adjusted by inflation
Income_tertile: levels of income-3 high income; 2 medium income; 1 low income
Urban: urban area=1; rural area=0
Birthyear: years of birth of respondents.
Birthyear5: years of birth organized by 5-year groups.
2.1 If we treat survey years (wave) as a continuous variable, please regress ch_BMI on wave and describe your findings. Are respondents surveyed in more recent years associated with higher or lower BMI?
2.2 Now treat wave as a categorical variable and code it as a set of dummy variables. Please show regression results based on different coding methods using the first category (the year 1991), the last category (the year 2009), the year 2000 and the most frequent category as the reference group, respectively.
Note: you may need to use –i- or –ib- options in Stata.
2.3 When the first survey year (1991) is regarded the reference group, please show the effect of the year 1991 on BMI.
2.4 Please calculate the average BMI across different groups of survey years. Is the average BMI associated with the year 1991 the same as your answer above? Why?
2.5 Please show results from a new regression model with survey waves (still treated as a
continuous variable), male, urban, age, and income. Please describe the effects of each
independent variable and their significance. How do you explain the reduction in survey waves’ size of effect from the base model (in Question 2.1) to the new model?
2.6 How do you explain the increase in R2 from the base model to the new model?
2.7 The R2 , Sum of squares from Regression (SSR), and Sum of Squares from Errors (residuals, or SSE) are all given in the regression outputs. Actually, you could calculate R2 from SSR and
SSE. Can you show the equation?
2.8 Now treat survey wave as a categorical variable and use the same model in Question 2.5. For a hypothetical girl at age 9 from urban areas and surveyed in 2006. If his family income per capita is 2000, what is her predicted BMI (keep four decimal places and you must also show the STATA output of the new regression model)?
2.9 For the model in 2.8, calculate the residuals (errors) using –yhat- and –generate- commands and call this variable as error. What is the sum of error? What is the correlation coefficient
between error and predicted values of BMI? What is the correlation coefficient between error
and any of the independent variables (wave, male, urban, age, and income) of the new regression model?
3 Scholars have examined the relationship between gender and cigarette smoking for adults in a Canadian city.
Smoking everyday |
Men |
Women |
Total |
Yes |
80 |
10 |
90 |
No |
90 |
100 |
190 |
|
170 |
110 |
280 |
3.1 What is the number of degrees of freedom for this table? Your answer is
The calculation for the chi-square value is listed as follows. Please fill in 2-3 with a number (keep one decimal place)
|
fobserved |
f |
expected |
(fobserved − fexpected )2 |
|
f |
expected |
||||
Men/Yes |
80 |
54.6 |
11.8 |
||
Men/No |
90 |
Question 3.2 |
5.6 |
||
Women/Y es |
10 |
35.4 |
18.2 |
||
Women/N o |
100 |
74.6 |
Question 3.3 |
3.2 Your answer (please refer to the table above for question 3.2)
3.3 Your answer (please refer to the table above for question 3.3)
3.4 The Chi-square test statistic is (keep one decimal place)
3.5 If we set alpha as 0.001, the corresponding chi-square critical value is 10.828. Based on your calculation, will you reject or accept the null hypothesis that sex and smoking are independent?
(provide ONE word “reject” or “accept” as your answer)
Your answer
4. A student plots four correlation graphs as follows based on r=0.98, 0.63, -0.52 and -0.89.
4.1 Which one of the four figures corresponds to r=-0.89? Your answer (choose
from A, B, C, D)
4.2 Which one of the four figures corresponds to r=0.63? Your answer (choose from A, B, C, D)
5. In regression analysis, we learned that the sum of errors ei is (please provide a number as your answer)
Suppose that we have a regression model as: Income=a+b*Age.
6. The dependent variable of this model is (choose one from Income, a, b and Age)
7. We wish to account for variability in the writing test scores using information on reading scores, math scores and the program type the student is in. The categorical variable prog has three levels: 1) general program, 2) academic program, and 3) vocational program.
We have Stata output as follows:
regress write read math prog2 prog3
Source | SS df MS
-------------+------------------------------
Model | 8170.58624 4 2042.64656
Residual | 9708.28876 195 49.7860962
-------------+------------------------------
Total | 17878.875 199 89.843593
Number of obs = 200
F( 4, 195) = 41.03
Prob > F = 0.0000
R-squared = ??????
Adj R-squared = 0.4459
Root MSE = 7.0559
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
read | |
.289028 |
.0659478 |
4.38 |
0.000 |
.1589656 |
.4190905 |
math | |
.3587215 |
.0745443 |
4.81 |
0.000 |
.2117048 |
.5057381 |
prog2 | |
.6647754 |
1.32845 |
0.50 |
0.617 |
-1.955198 |
3.284749 |
prog3 | |
-2.253484 |
1.468445 |
-1.53 |
0.127 |
-5.149556 |
.6425886 |
_cons | |
19.00854 |
3.40933 |
5.58 |
0.000 |
12.28465 |
25.73243 |
7.1 The sum of squares due to regression (SSR) is (provide a number as answer)
7.2 As the value of R2 is missing, can you calculate R2 according to other information listed in the table (keep four decimal places)?
Your answer
7.3 For a student from the academic program (prog2), her reading score is 80 and her math score is 90, her predicted writing test score is (provide a number as answer, keep two decimal places)
7.4. Which variables above have significant effects on writing scores at the 0.05 level?
A. Reading score B. Math score C. academic program (prog2) D. vocational
program (prog3)
Your answer
8. If we add more independent variables to the same regression model, what will happen to R2 ?
A. It will always decrease
B. It will always increase
C. It may decrease or increase
D. It can be larger than 100%.
Your answer
Part III
Please refer to the STATA dataset titled “Group Project II” for a dataset collected on UBC students and its corresponding questionnaire in the Appendix. For each question of the two questions below, please:
1) Identify the key dependent and independent variables in your model;
2) Show your STATA codes for data management (e.g., create new variables, remove outliers);
3) Show your STATA codes for regression analysis and the corresponding STATA output (screenshots are OK).
4) Interpret the significant and non-significant results of your regression analysis.
5) Assess if your regression model(s) fits the data well.
1. There are two competing assumptions about monthly expenses among college students. First, their
monthly expenses are determined by their individual characteristics (e.g., grades, major, socioeconomic status). Second, their monthly expenses are determined by more structural factors such as residential arrangements and immigration status.
Please try to adjudicate these two assumptions based on your analysis of Question 18 about monthly expenses and discuss factors associated with higher/lower expected expenses.
2. Please focus on Question 21 (expected GPA) and explore what factors predict one’s high/low GPA.
2023-12-21