Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Project 2 – Best Subset Selection and Suggested Procedure

Goal:

Once you’ve identified your data set and loaded it into R, your main task will be to build the best possible (in some sense) model for your chosen response variable. As a first step toward selecting the best model for your response variable, you’ll use a systematic procedure called best subest selection.

Best Subset Selection Example

We’ll use a package called leaps to carry out best subset selection. To install this package, run the following R code once:

install .packages("leaps")

This will add leaps to your library of R packages.  Once you’ve installed it, you can open the package by using the usual

library(leaps)

For a concrete example, suppose we wanted to perform best subset selection for the FirstYearGPA data set that we’ve seen before in labs. We’ll load in the Stat2Data package

library (Stat2Data)  #Load  the  Stat2Data package

data (FirstYearGPA)  #Load  the  FirstYearGPA  data  set

colnames (FirstYearGPA)  #Print  the  names  of  each  of  the  columns  in  the  dataframe

##    [1]  "GPA"

##    [6]  "HU"

"HSGPA"

"SS"

"SATV"

"FirstGen"

"SATM"

"White"

"Male"

"CollegeBound"

As you can see, this data set has 10 variables: 3 categorical (Male, FirstGen, and White) and 7 numerical. As usual, we’d like to build a model that can predict the GPA of a first-year college student using all this information.

However, now that we have the tools for multiple linear regression, we could include any combination of predictors in our model. In total, there are 29  = 512 possible models to choose from (for each predictor, we can choose to include it in the model or not). Note that this doesn’t include any interaction terms, either. What a mess! In principle, we could build all of these models and compare them against each other, but it would take a long time to do manually.

Fortunately, we can automate this process by using the regsubsets command. For instance, if we just want to consider the variables HSGPA, SATV, SATM, Male, and White as predictors for GPA, we can use the command:

BSS  =  regsubsets(GPA ∼ HSGPA+SATV+SATM+Male+White,  data=FirstYearGPA,  nvmax=4)

A little extra syntax explanation: nvmax tells R the maximum number of predictors we want to include in our model.  In summary, the code above tells R to build all linear models with GPA as response and up to four predictors, selected from among: HSGPA, SATV, SATM, Male, White.

Now, lets graph our results.

plot (BSS,  scale= "r2")

0.29

0.28

0.27

0.2

We haven’t encountered this type of graph before, but here’s how to interpret it: the multiple R2  for each model is plotted along the y-axis, and the predictor variables are on the x-axis. A shaded square indicates that a predictor was included in the model, and a white square indicates that it wasn’t. For instance, from the top row of the graph we can infer that the model with the highest multiple R2  included HSGPA, SATV, Male, and White as predictors, and for this model R2  ≈ 0.29. From the next row, we can see that the next best model included only three predictors (HSGPA, SATV, and White), but still had a multiple R2  of about

0.28.

Note that you can use other metrics for evaluating which model is best.”For instance, we may prefer to

use Ra(2)dj  instead, in which case we would use the code:

plot (BSS,  scale="adjr2")

0.27    

0.27    

0.26    

0.2    

Note that if we use adjusted R2  as our metric, which model is“best” changes! In this case, the model with predictors HSGPA, SATV, and White is considered to be the best (with Ra(2)dj  ≈ 0.27).

As a final note, if you wanted to perform best subset selection using all available predictors, you don’t have to type them all out:

BSS  =  regsubsets(GPA . ,  data=FirstYearGPA,  nvmax=9)

The above code tells R to consider all linear models with GPA as a response variable, and any combination of predictor variables (the . indicates to consider all the predictor variables). This time, we’ve set nvmax=9, and so R will consider all 512 possible models and rank them according to whatever metric we like (for instance, multiple R2  or Ra(2)dj ).

Project Procedure

Remember, your goal is to identify the best linear model for your chosen response variable.  While best subset selection is a good first step in that direction, there may be other considerations depending on your data set. An example procedure that you could use:

1. Plot the numerical variables against the response variable to see if the relationships between each predictor and the response look linear. If not, you should consider applying transformations.

2. Apply transformations, if necessary.

3. Perform best subset selection (BSS) and create plots as described above, both using multiple R2  and Ra(2)dj  as metrics for the "best" model.

4. Although it’s possible that the two "best" models from the previous step are the same, it is likely that they’ll be different. Indeed, it’s likely that one model will be nested inside the other. Compare the two models against each other using a nested F-test.

5. Use the result of your test to choose a "base" model to start with.

6. Then, investigate your "base" model closely.

(a) Are all the predictors significant?

(b) What percentage of variability is explained by each predictor?

(c) Is multicolinearity a concern?

7. Answers to the above questions may suggest your next steps. Should some predictors be dropped from the model? Does your "best" model exclude a predictor you think is important, even though it is not statistically significant?

Example: In the BSS example above, the "best" model (using Ra(2)dj ) didn’t include the Male variable.

If you look at the summary for this model, you can see that Male isn’t a significant predictor (R output

given below).  However, there may be good reason to leave it in.  For instance, in many fields control

variables  (race,  sex,  education level,  socioeconomic status,  etc.)   are often included in models even

when insignificant  to  show  that  they  were  controlled for.  Sometimes, the fact that a predictor is not

significant is important information.  In this case for instance, the dean of the college would be very

happy to be able to show that male and female students don’t have significantly different outcomes in

GPA.

8. Consider whether or not you want to add any interaction terms to the model.

9. Once you’ve decided on a final model, you should describe it statistically as completely as you can (individual t-tests, ANOVA, percentage of variability explained by each variable, etc).

##

##  Call:

##  lm(formula  =  GPA  ~  HSGPA  +  SATV  +  Male  +  White,  data  =  FirstYearGPA)

##

##  Residuals:

##             Min               1Q      Median               3Q             Max

##  - 1 .01819  -0 .28853    0 .03101    0 .26879    0 .89799

##

##  Coefficients:

##                               Estimate  Std .  Error  t  value  Pr(> |t |)

##  (Intercept)  0 .6607630    0 .2899104      2 .279  0 .023642  *

##  HSGPA              0 .5147294    0 .0739244      6 .963  4 .03e-11  ***

##  SATV                0 .0007360    0 .0003583      2 .054  0 .041164  *

##  Male                0 .0511448    0 .0547449      0 .934  0 .351234

##  White               0 .2393174    0 .0708214      3 .379  0 .000864  ***

##  ---

##  Signif .  codes:    0  ’***’  0 .001  ’**’  0 .01  ’*’  0 .05  ’ . ’  0 . 1  ’  ’  1

##

##  Residual  standard  error:  0 .3966  on  214  degrees  of  freedom

##  Multiple  R-squared:    0 .2874,  Adjusted  R-squared:    0 .2741

##  F-statistic:  21 .58  on  4  and  214  DF,    p-value:  5 .716e-15

Note:  The procedure outlined above is only one way to attack the problem of finding the best” linear model.  It is meant to be used as a starting point, but you are free to adapt (or completely change) the procedure to suit your needs. If you think your data set may require a different approach, but you’re unsure where to begin, please contact me so we can discuss!