Statistics 2120: Introduction to Statistical Analysis Homework 10
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Statistics 2120: Introduction to Statistical Analysis
Homework 10
Instructions:
. Be sure to provide your full name and computing ID at the top of your work.
. Write out the Honor Pledge under your name and computing ID: “On my honor, I did not give nor receive aid on this assignment beyond the listed collaboration.”
. List the names of students with whom you collaborated under the Honor Pledge. If you did not collaborate, write ‘None’ .
. Respond to each problem below thoroughly, showing all relevant work.
. Use Python for all calculations. Include a screen shot showing relevant code and output for each part using Python.
. Save your completed work as a PDF and upload it to Gradescope. Be sure to select the appropriate page(s) for each answer. Unselected work will not be graded.
Problems:
1. Pollution of water resources is a serious problem that can require substantial e↵orts and funds to rectify. In order to determine the financial resources required, an accurate assessment of the water quality, which is measured by the index of biotic integrity (IBI), is needed. Since IBI is very expensive to measure, a study was done for a collection of streams in the Ozark Highland ecoregion of Arkansas in which the IBI for each stream was measured along with land use measures that are inexpensive to obtain. The land use measures collected in the study are the area of the watershed in square kilometers, Area, and the percent of the watershed area that is forest, Forest. The objective of this study is to determine if these land use measures can predict the IBI so that the funding required for pollution clean up can be accurately estimated. The data collected from the n = 49 watersheds are provided in the file ozark.csv.
A. Create a scatterplots to show the relationship between IBI and each of the explanatory variables. Using these plots, does each explanatory variable seem to be useful in predicting the IBI? Explain.
B. Write out the appropriate regression model for this analysis.
Note: This equation should contain no numbers.
What is the least-squares regression equation for this model?
C. Interpret each of the estimated regression coefficients in the equation from part B. in context, including the intercept.
D. Create the three relevant residual plots for this linear regression model.
Do the assumptions for the regression model hold? Explain.
E. Conduct the test to determine if this model is uesful. Be sure to show all steps. F. What proportion of the variation in IBI is explained by this model?
G. Conduct the hypothesis tests for each slope. Be sure to show all steps.
Do the conclusions for these tests match your observations from part A?
H. Researchers would like to estimate the IBI for a stream that was not a part of this study. What
is the predicted IBI when the area of the watershed is 55 square kilometers and is 84% forest.
I. Determine the 95% prediction interval for the IBI for the stream described in part H. and the 95% confidence interval for the average IBI for streams with the characteristics given in part H. .
J. Ecologists have long lobbied for protection for watershed forests because they believe that it increases the water quality. Before their next discussion with policy makers, they would like to have a estimate for the change in water quality when the percentage of watershed forest increases by 1% (assuming area of the watershed remains the same) and they ask you for an estimate with 95% confidence. What values should you give them? Explain.
2. The mathematics department at a university is interested in learning more about the grades that students earn in an introductory calculus class. A sample of n = 80 students who have taken this introductory calculus course during any semester of the three last three academic years is selected. The academic record for each selected student is reviewed. The professors in the math department determine that, of the available information for all students, the relevant variables are their final grade in the calculus course, their highschool percentile rank, their score on the algebra placement test given to all incoming students who plan to take a calculus course, their ACT Math score, and their ACT Natural Sciences score.
A. The professors in the math department believe that the score on the algebra placement test and the highschool percentile rank will be the best explanatory variables for the final grade in calculus. The partially complete ANOVA table below shows relevant values for the model containing these two explanatory variables. Conduct the F test for whether this model is useful or not.
SS df MS F p-value
Regression Residuals
Total
2840.4
7491.8
10332.2
1420.2
97.3
B. What is the minimum number of values that need to be known to be able to fill in the ANOVA table completely?
C. One of the professors suggests testing whether adding the two ACT scores to the regression model would improve the model. The ANOVA table below shows the relevant values for the model containing all four explanatory variables. Conduct the appropriate F test to determine if adding the ACT scores improves the model.
SS df MS F p-value
Regression Residuals
Total
2986.2
7346.0
10332.2
746.5
97.9
7.6 3.5e −5
D. Which of the two suggested models should the math department use?
E. In an attempt to further improve the model, the professors decide to look at the a t-tests for each of the coefficients in the model that you chose in part D. . Using the Python output for both models shown below, conduct each t-test for slope.
F. Based on the conclusions from part E., which explanatory variable, if any, would you try removing first?
HW10
Jessica Xiong (pqf6rd)
On my honor, I did not give nor receive aid on this assignment beyond the listed collaboration.
Problems:
1. Pollution of water resources is a serious problem that can require substantial efforts and funds to rectify. In order to determine the financial resources required, an accurate assessment of the water quality, which is measured by the index of biotic integrity (IBI), is needed. Since IBI is very expensive to measure, a study was done for a collection of streams in the Ozark Highland ecoregion of Arkansas in which the IBI for each stream was measured along with land use measures that are inexpensive to obtain. The land use measures collected in the study are the area of the watershed in square kilometers, Area, and the percent of the watershed area that is forest, Forest. The objective of this study is to determine if these land use measures can predict the IBI so that the funding required for pollution clean up can be accurately estimated. The data collected from the n = 49 watersheds are provided in the file ozark.csv.
A. Create a scatterplots to show the relationship between IBI and each of the explanatory variables. Using these plots, does each explanatory variable seem to be useful in predicting the IBI? Explain.
Each explanatory variable seems to be useful in predicting the IBI because as can be seen on the
scatterplot, there is still a weak but positive linear relationship between each explanatory variable, area and forest, and IBI. Increase in either Area or Forest would lead to increase in IBI.
B. Write out the appropriate regression model for this analysis. Note: This equation should contain no numbers.
What is the least-squares regression equation for this model?
xi1: the area of the watershed in square kilometers, Area
xi2: the percent of the watershed area that is forest, Forest
The appropriate regression model for this analysis is yi = β0 + β 1xi1 + β2xi2 + εi
The least-squares regression equation for this model is y = 40. 629 + 0. 569x1 + 0. 234x2
C. Interpret each of the estimated regression coefficients in the equation from part B. in context, including the intercept.
,0 equals 40.629 means that 40.629 is the predicted response when all the explanatory variables are zero. In context, when the area of the watershed and the percent of the watershed area that is forest are zero, the predicted IBI value is 40.629.
,1 equals 0.569 means that 0.569 is the predicted increase in the response of IBI for one more square kilometer change in the area of the watershed when the other explanatory variable, the percent of the watershed area that is forest, is held constant.
,2 equals 0.234 means that 0.234 is the predicted increase in the response of IBI for one more percent change in the percent of the watershed area that is forest when the other explanatory variable, the area of the watershed, is held constant.
D. Create the three relevant residual plots for this linear regression model. Do the assumptions for the regression model hold? Explain.
From the graphs, the assumptions for the regression model hold because all residual plots show the residuals scattered randomly around 0 with uniform variation. There isn’t a linear or curved trend in each residual plot. Therefore, the linearity, independence, and constant variance assumption are satisfied. The MLR model is also robust against deviations from the normal distribution. The normality assumption is satisfied.
E. Conduct the test to determine if this model is uesful. Be sure to show all steps.
H0 : This model is not useful (β1 = β2 = 0)
Ha : This model is useful (at least one of β1 o β2 =/ 0)
From the table we can see that the f-statistic is 12.78 and the p-value is 3. 86 * 10−5 . Since p-value is less than the assumed significance level 0.05, there is sufficient evidence to support the alternative hypothesis and reject the null hypothesis. We have sufficient evidence to conclude that this model is useful, and at least one of β1 or β2 =/ 0.
F. What proportion of the variation in IBI is explained by this model?
About 35.7% of the variation in IBI can be explained by this model because R-squared is 0.357.
G. Conduct the hypothesis tests for each slope. Be sure to show all steps. Do the conclusions for these tests match your observations from part A?
Area:
H0 : β 1 = 0
Ha : β 1 =/ 0
From the table we can see that the test statistic is 4.511 and p-value is 0.000, which is less than the assumed significance level 0.05. There is sufficient evidence to support the alternative hypothesis and reject the null hypothesis. We have sufficient evidence to conclude that β1 ≠ 0 and there is a relationship between the area of the watershed and the IBI value.
Forest:
H0 : β2 = 0
Test statistic is 3.365 and p-value is 0.002, which is less than the assumed significance level 0.05. There is sufficient evidence to support the alternative hypothesis and reject the null hypothesis. We can have sufficient evidence to conclude that β2 ≠ 0 and there is a relationship between the percent of the watershed area that is forest and the IBI value.
Since my observations in part A mention that there is a linear relationship between explanatory variables, area and forest, and IBI value, both of the conclusions for these tests match my observations from part A.
H. Researchers would like to estimate the IBI for a stream that was not a part of this study. What is the predicted IBI when the area of the watershed is 55 square kilometers and is 84% forest.
The predicted IBI when the area of the watershed is 55 square kilometers and is 84% forest is 91.5744.
I. Determine the 95% prediction interval for the IBI for the stream described in part H. and the
95% confidence interval for the average IBI for streams with the characteristics given in part H.
The 95% prediction interval for the IBI for the stream described in part H is (59.4265, 123.7221).
The 95% confidence interval for the average IBI for streams with the characteristics given in part H is (80.3822, 102.7665).
J. Ecologists have long lobbied for protection for watershed forests because they believe that it increases the water quality. Before their next discussion with policy makers, they would like to have a estimate for the change in water quality when the percentage of watershed forest increases by 1% (assuming area of the watershed remains the same) and they ask you for an estimate with 95% confidence. What values should you give them? Explain.
The 95% confidence interval that I would give is (0.094, 0.373) because in this question, we need to estimate the change in water quality, which is the change in IBI, while the percentage of watershed forest increases by 1% with the assumption that the area of the watershed remains the same. This corresponds to the last line in the table. We can also notice that the predicted change in 1% of watershed forest is 0.234. The values of 0.025 and 0.975 correspond to the values of 0.094 and 0.373, which make up the 95% confidence interval.
Problem 2
A.
H0 : This model is not useful (β1 = β2 = 0)
Ha : This model is useful (at least one of β1 OT β2 =/ 0)
F = RegTessiOn MS = 1420.2 = 14. 5961
From the calculation we can notice that the test statistics F is 14.5961 and the p-value is 0.0.
Because p-value is less than the assumed significance level 0.05, there is sufficient evidence to support the alternative hypothesis and reject the null hypothesis. We have sufficient evidence to conclude that this model is useful, and at least one of β1 OT β2 =/ 0.
B.
The minimum number of values that need to be known to be able to fill in the ANOVA table completely is 4. We can calculate the MS column by knowing two values in each SS and df columns.
C.
H0 : the two ACT scores are not useful for the improvement of the two-variable models.
Ha : at least one of the two ACT scores improves the two-variable models.
From the calculation we can notice that the test statistics F is 0.7443 and the p-value is 0.4786, which is larger than the assumed significance level 0.05. There is insufficient evidence to support the alternative hypothesis and we fail to reject the null hypothesis. We do not have sufficient evidence to conclude that adding two ACT scores could be useful for the improvement of the two-variable models.
D.
E.
HSRank:
H0 : This variable is not significant (β1 = 0)
Ha : This variable is significant (β1 =/ 0)
Test statistic is 1.896 and p-value is 0.062, which is larger than the assumed significance level 0.05. There is insufficient evidence to support the alternative hypothesis and we fail to reject the null hypothesis. We do not have sufficient evidence to conclude that the β1 ≠ 0, and high school percentile rank is not significant for this model.
Algebra:
H0 : This variable is not significant (β2 = 0)
Ha : This variable is significant (β2 =/ 0)
Test statistic is 4.248 and p-value is 0.000, which is smaller than the assumed significance level 0.05. There is sufficient evidence to support the alternative hypothesis and reject the null hypothesis. We have sufficient evidence to conclude that the β2 ≠ 0. The score on the algebra placement test is significant for this model.
F.
Based on the conclusions from part E, I would try to remove the high school percentile rank first. Since the high school percentile rank has the smallest test statistics and p-value, which means the it is the most insignificant variable for this model compared to other variables . And also, based on the conclusions from part E, this is the only variable that is insignificant to our model.
2023-08-16