Project 2 – Best Subset Selection and Suggested Procedure
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Project 2 – Best Subset Selection and Suggested Procedure
Goal:
Once you’ve identified your data set and loaded it into R, your main task will be to build the best possible (in some sense) model for your chosen response variable. As a first step toward selecting the best model for your response variable, you’ll use a systematic procedure called best subest selection.
Best Subset Selection Example
We’ll use a package called leaps to carry out best subset selection. To install this package, run the following R code once:
install .packages("leaps")
This will add leaps to your library of R packages. Once you’ve installed it, you can open the package by using the usual
library(leaps)
For a concrete example, suppose we wanted to perform best subset selection for the FirstYearGPA data set that we’ve seen before in labs. We’ll load in the Stat2Data package
library (Stat2Data) #Load the Stat2Data package
data (FirstYearGPA) #Load the FirstYearGPA data set
colnames (FirstYearGPA) #Print the names of each of the columns in the dataframe
## [1] "GPA"
## [6] "HU"
"HSGPA"
"SS"
"SATV"
"FirstGen"
"SATM"
"White"
"Male"
"CollegeBound"
As you can see, this data set has 10 variables: 3 categorical (Male, FirstGen, and White) and 7 numerical. As usual, we’d like to build a model that can predict the GPA of a first-year college student using all this information.
However, now that we have the tools for multiple linear regression, we could include any combination of predictors in our model. In total, there are 29 = 512 possible models to choose from (for each predictor, we can choose to include it in the model or not). Note that this doesn’t include any interaction terms, either. What a mess! In principle, we could build all of these models and compare them against each other, but it would take a long time to do manually.
Fortunately, we can automate this process by using the regsubsets command. For instance, if we just want to consider the variables HSGPA, SATV, SATM, Male, and White as predictors for GPA, we can use the command:
BSS = regsubsets(GPA ∼ HSGPA+SATV+SATM+Male+White, data=FirstYearGPA, nvmax=4)
A little extra syntax explanation: nvmax tells R the maximum number of predictors we want to include in our model. In summary, the code above tells R to build all linear models with GPA as response and up to four predictors, selected from among: HSGPA, SATV, SATM, Male, White.
Now, let’s graph our results.
plot (BSS, scale= "r2")
0.29
0.28
0.27
0.2
We haven’t encountered this type of graph before, but here’s how to interpret it: the multiple R2 for each model is plotted along the y-axis, and the predictor variables are on the x-axis. A shaded square indicates that a predictor was included in the model, and a white square indicates that it wasn’t. For instance, from the top row of the graph we can infer that the model with the highest multiple R2 included HSGPA, SATV, Male, and White as predictors, and for this model R2 ≈ 0.29. From the next row, we can see that the next best model included only three predictors (HSGPA, SATV, and White), but still had a multiple R2 of about
0.28.
Note that you can use other metrics for evaluating which model is “best.”For instance, we may prefer to
use Ra(2)dj instead, in which case we would use the code:
plot (BSS, scale="adjr2")
0.27
0.27
0.26
0.2
Note that if we use adjusted R2 as our metric, which model is“best” changes! In this case, the model with predictors HSGPA, SATV, and White is considered to be the best (with Ra(2)dj ≈ 0.27).
As a final note, if you wanted to perform best subset selection using all available predictors, you don’t have to type them all out:
BSS = regsubsets(GPA ∼ . , data=FirstYearGPA, nvmax=9)
The above code tells R to consider all linear models with GPA as a response variable, and any combination of predictor variables (the . indicates to consider all the predictor variables). This time, we’ve set nvmax=9, and so R will consider all 512 possible models and rank them according to whatever metric we like (for instance, multiple R2 or Ra(2)dj ).
Project Procedure
Remember, your goal is to identify the best linear model for your chosen response variable. While best subset selection is a good first step in that direction, there may be other considerations depending on your data set. An example procedure that you could use:
1. Plot the numerical variables against the response variable to see if the relationships between each predictor and the response look linear. If not, you should consider applying transformations.
2. Apply transformations, if necessary.
3. Perform best subset selection (BSS) and create plots as described above, both using multiple R2 and Ra(2)dj as metrics for the "best" model.
4. Although it’s possible that the two "best" models from the previous step are the same, it is likely that they’ll be different. Indeed, it’s likely that one model will be nested inside the other. Compare the two models against each other using a nested F-test.
5. Use the result of your test to choose a "base" model to start with.
6. Then, investigate your "base" model closely.
(a) Are all the predictors significant?
(b) What percentage of variability is explained by each predictor?
(c) Is multicolinearity a concern?
7. Answers to the above questions may suggest your next steps. Should some predictors be dropped from the model? Does your "best" model exclude a predictor you think is important, even though it is not statistically significant?
Example: In the BSS example above, the "best" model (using Ra(2)dj ) didn’t include the Male variable.
If you look at the summary for this model, you can see that Male isn’t a significant predictor (R output
given below). However, there may be good reason to leave it in. For instance, in many fields control
variables (race, sex, education level, socioeconomic status, etc.) are often included in models even
when insignificant to show that they were controlled for. Sometimes, the fact that a predictor is not
significant is important information. In this case for instance, the dean of the college would be very
happy to be able to show that male and female students don’t have significantly different outcomes in
GPA.
8. Consider whether or not you want to add any interaction terms to the model.
9. Once you’ve decided on a final model, you should describe it statistically as completely as you can (individual t-tests, ANOVA, percentage of variability explained by each variable, etc).
##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + Male + White, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## - 1 .01819 -0 .28853 0 .03101 0 .26879 0 .89799
##
## Coefficients:
## Estimate Std . Error t value Pr(> |t |)
## (Intercept) 0 .6607630 0 .2899104 2 .279 0 .023642 *
## HSGPA 0 .5147294 0 .0739244 6 .963 4 .03e-11 ***
## SATV 0 .0007360 0 .0003583 2 .054 0 .041164 *
## Male 0 .0511448 0 .0547449 0 .934 0 .351234
## White 0 .2393174 0 .0708214 3 .379 0 .000864 ***
## ---
## Signif . codes: 0 ’***’ 0 .001 ’**’ 0 .01 ’*’ 0 .05 ’ . ’ 0 . 1 ’ ’ 1
##
## Residual standard error: 0 .3966 on 214 degrees of freedom
## Multiple R-squared: 0 .2874, Adjusted R-squared: 0 .2741
## F-statistic: 21 .58 on 4 and 214 DF, p-value: 5 .716e-15
Note: The procedure outlined above is only one way to attack the problem of finding the “best” linear model. It is meant to be used as a starting point, but you are free to adapt (or completely change) the procedure to suit your needs. If you think your data set may require a different approach, but you’re unsure where to begin, please contact me so we can discuss!
2023-04-29