Assignment 2 for Semester 2, 2021


This assignment is a total of 100 marks worth 15% of your overall grade for this course.

Please submit your assignment in the Assignment section on Wattle using the Turnitin submission link. When uploading to Wattle you must submit the following, combined into a single ’PDF’ document:

1. Your assignment/report in a pdf document.

2. All your R codes you have used for the assignment added as an Appendix to the end of the report. Failure to upload the R code will result in a penalty.

Assignment solutions should be typed. Your assignment may include some carefully edited R output (e.g. graphs, tables) showing the results of your data analysis and a discussion of these results, as well as some carefully selected code. Please be selective about what you present and only include as much R output as necessary to justify your solution. It is important to be be concise in your discussion of the results. Clearly label each part of your report with the part of the question that it refers to.

Unless otherwise advised, use a significance level of 5%.

Marks may be deducted if these instructions are not strictly adhered to, and marks will certainly be deducted if the total report is of an unreasonable length, i.e. more than 15 pages including graphs and tables. You must include an appendix that is in addition to the above page limits which include all the R code. Although, the appendix will not be marked but if the R codes are not provided then marks will be deducted. The R codes are required should there be any question the markers have about the work you have submitted.

You may ask me (Abhinav Mehta) questions about this assignment up to 24 hours before the submission time. This will allow me enough time to respond to your ques-tions. The tutors will not entertain any questions about the assignment other than troubleshooting R codes.

Late submissions will attract a penalty of 5% of your mark for each day of delay. No assignments will be accepted 10 days beyond the due date.

Extensions will usually be granted on medical or compassionate grounds on production of appropriate evidence, but must have my permission by no later than 24 hours before the submission date. If you are granted an extension and submit your assignment after the extended deadline then the late submission penalty will still apply.

Question 1 [50 Marks]

A group of researchers in the US attempted to look at the pollution related factors affect-ing mortality. Sixty US cities were sampled. Total age-adjusted mortality, (mortality), from all causes, in deaths per 100,000 population, was measured, along with the follow-ing covariates: mean annual precipitation (in inches) (precipitation); median number of school years completed for persons aged 25 years or older (education); percentage of population that is non-white (nonwhite); relative pollution potential of oxides of nitro-gen (nox); and relative pollution potential of sulphur dioxide (so2). “Relative pollution potential” is the product of tons emitted per day per square kilometre and a factor correcting for the city dimension and exposure. The data is available in a .csv file, pollution.

(a) [8 marks] Fit a multiple linear regression (MLR) model with Mortality as the re-sponse variable and all other covariates as predictors. Is the regression model significant?

(b) [12 marks] What are the estimated coefficients of the (MLR) model in part (a) and the confidence intervals for each of these coefficients at a joint confidence level of 95%? Interpret the values of these estimated coefficients with regards to model specification.

(c) [6 marks] There is a t-test associated with each of these coefficients. Briefly explain, what these tests can or cannot be used for? In your answer, be sure to mention the appropriate hypotheses that can be assessed using these t-tests.

(d) [8 marks] Construct an appropriate test of the hypothesis that education and nox are not significant contributors to the model. That is, test = = 0.

(e) [10 marks] A researcher from this group suggested that they have been using a model with coefficients: = 2, = −10, = 3, = 0, and = 1. Can you test whether this existing model is consistent with the new model you have fit? Write down appropriate full and reduced models for carrying out such a test. Perform the test and comment on the results.

(f) [6 marks] One of the researcher is from the city of San Antonio, and has recorded a new set of measurements on each of the predictors. The precipitation is 33, education is 11.5, nonwhite is 17.2 and nox and so2 are each 1. What do you predict the mortality rate to be? Find a 99% interval for this prediction.

Question 2 [50 Marks]

Minnesota Department of Revenue have collected data on nearly every agricultural land sale in the six major agricultural regions of Minnesota for the period 2002-2011. The data is provided to you by your boss and is available in the alr4 package in the dataset MinnLand. The variables in the dataset are described in the table below:

  Variable Name
  Sale price in dollars per acre, adjusted to a common date within each year
  A factor with levels giving the geographic names of six eco-nomic regions of Minnesota
  Year of sale
  Size of property, acres
  Percentage of farm that is rated arable
  Percentage of property value due to buildings and other im-provements
  Type of financing: either a title transfer or seller finance
  Any part of the acreage is enrolled in the U.S. Conservation Reserve Program (CRP), and none otherwise
  Percentage of land in CRP
  A numeric score between 1 and 100 with larger values indi-cating more productive land, calculated by the University of Minnesota

He has asked you to investigate the effects of various predictors on the sale price of land. You have been asked to build a regression model which can estimate the sale price of any piece of agricultural land in Minnesota once the characteristics of the land for sale are provided.

(a) [8 marks] Based on the data description, which variables are qualitative variables? Read the whole data set into R. Are these qualitative variables shown as factor objects in R? If not, manually convert them to factor objects. For each qualitative variable, how many observation does each group have? For each of these qualita-tive variables, provide boxplots for the price of land sales, acrePrice, across the different groups of each qualitative variable. Compare the group difference and summarize your findings.

(b) [6 marks] Provide any comments on the shape of the distribution of the sale price variable. Does the response variable need to be transformed to meet the assump-tions of a linear regression model? Suggest a transformation and test it out with supporting plots.

(c) [8 marks] You have been told that the price of land is only influenced by the year of sale, year, size of the land, acres, and whether there are any improvements made on the land, improvements. Fit a regression model with these predictors and the sale price with the transformation you may have chosen in part(b) as the response variable. Show the summary table of the fitted results. Is the model significant? Interpret all the estimated coefficients except for the intercept.

(d) [10 marks] You believe that region should be included in the regression model as well as the percentage of the land is in CRP, crpPct. Fit a regression model by adding these variables and using the ’Central’ region as the reference category for the region variable. Is this model significant? How can you assess whether adding these variables has improved the regression model? Produce the plots or summary statistics which can answer this question.

(e) [8 marks] Produce the Bonferroni 95% joint confidence intervals for the coefficients of only the quantitative variables in the model fitted in part(d).

(f) [10 marks] There is a piece of land in the South Central region with these charac-teristics:

Year of sale is 2010 and the size of property is 150 acres

5% of the property value is due to improvements and CRP percentage is 4%

the land sale is financed by seller

What is the expected sale price for this piece of land and also construct a 95% confidence interval for the sale price? Also, provide the 90% prediction interval for this same piece of land.