Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


BU.510.650

Data Analytics

Assignment #2


Attention: Please prepare two files for each homework assignment: the .PDF file for your an-swers including figures to each question; the other .R file for your R script. File names should be “LastName_FirstName_number.docx” and “LastName_FirstName_number.R”. All assignments should submitted via Blackboard.


1. Grade point average of 12 graduating MBA students, GPA, and their GMAT scores taken before entering the MBA program are given below. Use the GMAT scores as a predictor of GPA, and conduct a regression of GPA on GMAT scores.

(a) Obtain and interpret the coefficient of determination R2.

(b) Calculate the fitted value for the second person

(c) Test whether GMAT is an important predictor variable (use significant level 0.05)


2. Suppose we have a data set with five predictors, X1 =GPA, X2 = IQ, X3 = Gender (1 for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get = 50, = 20, = 0.07,  = 35, = 0.01, = −10.

(a) Which answer is correct, and why?

i. For a fixed value of IQ and GPA, males earn more on average than females.

ii. For a fixed value of IQ and GPA, females earn more on average than males.

iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough.

iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.

(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.

(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.


3. In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use command set.seed(1) prior to starting part (a) to ensure consistent results. (Hint: rnorm(n, mean = a, sd = b) generates n random variables with mean a, standard deviation b, e.g., rnorm(100, mean = 10, sd = 5) returns a vector with 100 values, each of which follows a normal distribution with mean 10 and standard deviation 5.)

(a) Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0, 1) distribution. This represents a feature, X.

(b) Using the rnorm() function, create a vector, , containing 100 observations drawn from a N(0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25.

(c) Using x and , generate a vector y according to the model

What is the length of the vector y? What are the values of and in this linear model?

(d) Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

(e) Fit a least squares linear model to predict y using x. Comment on the model obtained. How do and compare to and

(f) Now fit a polynomial regression model that predicts y using x and x2. Is there evidence that the quadratic term improves the model fit? Explain your answer.

(g) Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model (1) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term in (b). Describe your results.

(h) Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model (1) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term in (b). Describe your results.

(i) What are the confidence intervals for and based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.