Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 1

2022

Your assignment should be submitted as a Rmarkdown document with all analysis and graphics included as

R-chunks, together with the compiled pdf file. The Rmarkdown document should compile without error. You can assume that the data files are in the same folder as your Rmd file. This assignment consists of two parts,

and has 40 marks in total.

Part 1 (20 marks)

You are making a big investment in education, hoping for a good return on investment. The goal of the first part of this assignment is to model the returns to education for employees of a bank. The data set Wage .csv includes the following variables:

•  Salary: current yearly salary in dollars

•  Salbegin: yearly salary at first position at the bank

•  Educ:  number of finished years of education

•  Gender:  0 for females and 1 for males.

•  Minority:  0 for non-minorities and 1 for minorities

• Jobcat: 1 for administrative jobs, 2 for custodial jobs, 3 for management jobs.

Please follow the following steps and answer the questions:

1. Do some data preparation: Read in the data set, convert categorical variables to factors, define the labels of the categories, and take the natural logarithm of both salary variables. (4 marks)

2. Fit a regression model for ‘log(Salary)’ with the main effects of the predictors ‘log(Salbegin)’, ‘Educ’, ‘Gender’, ‘Minority’, and ‘Jobcat’. Provide an interpretation of the coefficients for‘Educ’and ‘Gender’.  (4 marks)

3. The relation between education and wage may be non-linear. Fit a regression model for log(Salary)’ with a polynomial for education and the main effects of the remaining predictors as linear terms. Select  the degree of the polynomial with the AIC and consider degrees 1,2,3,4. (4 marks)

4. Produce one diagnostic plot for heteroskedasticity and one for the error distribution in the model with the selected polynomial degree. Comment on potential outliers. (4 marks)

5. Use visreg to visualize the relation between education and log salary in the model with the selected polynomial degree. Describe what you learn from this. (4 marks)

Part 2 (20 marks)

The second part of this assignment analyzes a production robot in a car factory. The robot produces six cars in an hour, and we are interested in the number of cars with defective components in one hour. This number might be explained by how polluted the robot is, measured by a pollution score, or the type of car that is produced in that hour. The data set Robot .csv includes the following variables:

•  Defects: the number of cars with defective components in an hour.

•  Pollution: the pollution score of the production robot in that hour.

• Type: Sedan, SUV, or Coupe.

Please follow the following steps and answer the questions:

1. Use ggplot() to produce appropriate plots of‘Defects’against‘Type’and‘Pollution’. What do you learn about the relation between ‘Defects’and each of these predictors? (4 marks)

2. Fit a regression model for Defects’ with the main effects of the predictors ‘Type’and ‘Pollution’. Provide an interpretation of the coefficients in terms of the effect of each variable on the odds of a car being defect. (4 marks)

3. Estimate the dispersion parameter using the Pearson residuals and provide the scaled standard errors of the coefficients. (4 marks)

4. Include an interaction between‘Type’and‘Pollution’, together with their main effects, in a regression that takes overdispersion into account. Test if the interaction is significant. (4 marks)

5. Use visreg to visualize the interaction term in the model, and describe what you learn from this. (4 marks)