关键词 > STAT2008/STAT2014/STAT4038/STAT6014/STAT6038

STAT2008/STAT2014/STAT4038/STAT6014/STAT6038 REGRESSION MODELLING

发布时间:2021-09-02

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES AND STATISTICS

REGRESSION MODELLING

(STAT2008/STAT2014/STAT4038/STAT6014/STAT6038)

Assignment 1 for Semester 2, 2021


INSTRUCTIONS:

● This assignment is a total of 100 marks worth 15% of your overall grade for this course.

● Please submit your assignment in the Assignment section on Wattle using the Turnitin submission link. When uploading to Wattle you must submit the following, combined into a single ’PDF’ document:

1. Your assignment/report in a pdf document.

2. All your R codes you have used for the assignment added as an Appendix to the end of the report. Failure to upload the R code will result in a penalty.

● Assignment solutions should be typed. Your assignment may include some carefully edited R output (e.g. graphs, tables) showing the results of your data analysis and a discussion of these results, as well as some carefully selected code. Please be selective about what you present and only include as much R output as necessary to justify your solution. It is important to be be concise in your discussion of the results. Clearly label each part of your report with the part of the question that it refers to.

● Unless otherwise advised, use a significance level of 5%.

● Marks may be deducted if these instructions are not strictly adhered to, and marks will certainly be deducted if the total report is of an unreasonable length, i.e. more than 10 pages including graphs and tables. You must include an appendix that is in addition to the above page limits which include all the R code. Although, the appendix will not be marked but if the R codes are not provided then marks will be deducted. The R codes are required should there be any question the markers have about the work you have submitted.

● You may ask me (Abhinav Mehta) questions about this assignment up to 24 hours before the submission time. This will allow me enough time to respond to your ques-tions. The tutors will not entertain any questions about the assignment other than troubleshooting R codes.

● Late submissions will attract a penalty of 5% of your mark for each day of delay. No assignments will be accepted 10 days beyond the due date.

● Extensions will usually be granted on medical or compassionate grounds on production of appropriate evidence, but must have my permission by no later than 24 hours before the submission date. If you are granted an extension and submit your assignment after the extended deadline then the late submission penalty will still apply.


Question 1    [40 Marks]

Data on eruptions of Old Faithful Geyser, in October 1980 was collected and stored in a .csv file ‘oldfaithful’. Variables are the duration in seconds of the current eruption, and the interval time in minutes to the next eruption. Data was not collected between approximately midnight and 6 AM. It is suspected that Duration is associated with the Interval

(a) [5 marks] Conduct an exploratory data analysis to assess whether the two variables are associated. Is there a statistically significant correlation between the variables? Use the cor.test() function to conduct a suitable hypothesis test. Clearly specify the hypotheses you are testing and present and interpret the results.

(b) [20 marks] Fit a simple linear regression (SLR) model with Interval as the response variable and Duration as the predictor. Construct a plot of the residuals against the fitted values, a normal Q-Q plot of the residuals, a bar plot of the leverages for each observation and a bar plot of Cook’s distances for each observation. Use these plots (and other means) to comment on the model assumptions and on any unusual data points.

(c) [10 marks] What are the estimated coefficients of the SLR model in part (b) and the standard errors associated with these coefficients? Interpret the values of these estimated coefficients and perform t-tests to test whether or not these coefficients differ significantly from zero. What do you conclude as a result of these t-tests?

(d) [5 marks] If there is a eruption which lasted for 100 seconds then what will be the interval of time before the next eruption, as predicted by your model? Construct an appropriate interval estimate for the length of this interval.


Question 2    [60 Marks]

The international bank UBS regularly produces a report (UBS, 2009) on prices and earnings in major cities throughout the world. Three of the measures they include are prices of 1kg of rice, a 1kg loaf of bread and the price of a Big Mac. An interesting feature of the prices they report is measured in the minutes of labor required for a ‘typical’ worker in that location to earn enough money to purchase the commodity. Using minutes of labor corrects at least in in part for currency fluctuations, prevailing wage rates, and local prices. The data file includes measurements for rice, bread and big mac from the 2003 and 2009 reports. The year 2003 was before the GFC and the year 2009 which is after the GFC and so would reflect changes in prices due to this recession.

You may access the dataset from the ‘ALR4’ package and load the data from this package using the data(UBSprices) command.

(a) [5 marks] Plot a scatterplot of Big Mac prices with year 2003 as the predictor variable. Add a line for y = x to this plot. How would you interpret points that are on this line, above the line and below the line?

(b) [5 marks] Identify the cities which you consider to have the highest increase or decrease in the Big Mac price. You will find these R functions very useful for this step: identify() and row.names().

(c) [10 marks] Is a linear model the right one to fit the Big Mac price data? If not, what variable transformations would you consider appropriate? Give justification for your choice of transformation.

(d) [20 marks] Fit a linear model with your chosen transformation using OLS estima-tion method. Write out the mathematical expression for the functional form of your model using both the transformed variables and the untransformed original variables. Interpret the effect of coefficients on the untransformed response variable in the regression model.

(e) [10 marks] Produce the ANOVA (Analysis of Variance) table for the SLR model and interpret the results of the F-test. What is the coefficient of determination for this model and how should you interpret this summary measure?

(f) [10 marks] Assess the model fit for the assumptions of a Normal Error Regression Model. Are there any influential points in your regression model and if yes, identify these cities and the type of influence they have on the model fit.