Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


Math 4581 Fall 2022

Computer Lab


To get SPSS:

Sign into myNeu;

Click on Services and Links;

Under Services there should be a Software Downloads link, click on it and download SPSS 28

I have added a thread in Discussions on Canvas with more details.

Reports should be as short as possible, so copy-and-paste only the SPSS results that are necessary.

The data for problems 1-4 comes from a subset of data derived from the 2015 Fuel Consumption Report from Natural Resources Canada. The file is called canada-fuel-2015. You can see the variable names with a quick description if you click on variable view at the bottom left (then go back to data view). If you were inputting data you would have to fill in these columns (especially the measure column which says what type of data it is).

1. Let’s get a basic summary of some of the data

1. Click on the pull-down menu ‘Analyze’ move down to ‘Descriptive Statistics’ and choose ‘Descriptives’.

2. A window should pop up-- move all of the numeric variables (enginesize until co2emissions) into the Variables space;

3. Click on ‘Options’ and check off Kurtosis and Skewness in the new window; mean, sd, min, and max should already be checked—if not check them;

4. Click on Continue and then OK in the original window.

Aside: kurtosis and skewness are used to check if the data is normal—both should be close to 0 if the data is normal; kurtosis measures how close to the center the data is, a positive value says the data is clustered more tightly than if it was normal; skewness measures how symmetric the data is, a positive value says the data is skewed right. A rough guideline is the data is ok if the statistic is less than double the standard error. Does it appear that any of the variables are normal using this criteria? If so, list them.

Answer the questions and hand in this output by copying and pasting into a word document. If this does not work: Select the table and go to ‘File’, then ‘Export.’ Under ‘Objects to Export’ select ‘Selected’ and under ‘Type’ Select ‘Word/RTF.’ Copy the resulting images into a single word document.

2. There are a few ways to look at the summaries of the data. Let’s use a second here:

1. Click on the pull-down item ‘Analyze’ move down to ‘Descriptive Statistics’ and choose ‘Frequencies’;

2. A window should pop up-- move the same variables as those in question 1 into the Variables space;

3. Click on ‘Statistics’ and check off Quartiles, Median, and Range (you can see it also has the statistics from the Descriptive option).

4. Click on Continue

5. Uncheck ‘Display Frequency Tables’ and then click on OK

Hand in this output.

3. Let’s get a few basic graphs

a) The histogram for one of the variables:

1. Go to ‘Graphs’ and choose ‘Chart Builder’.

2. If a dialog pops up saying ‘Before you use this….’ click ‘Okay’ to continue.

3. In the new window, click on Histogram in the lower left and then drag the leftmost graph where indicated;

4. Drag fuelusecity to the x-axis and hit OK.

We should be checking to see if the data is normal.

Give a basic description—how many main peaks are there; are there any outliers; is it roughly symmetric?

Does it seem to be roughly normal?

Now do this again for one more variable. Hand in the two graphs and answer the two questions.

b) Get a boxplot to compare fuelusecity, and fuelusehwy:

1. Go to ‘Graphs’, move down to ‘Legacy Dialogs’ and choose ‘Boxplot’.

2. In the new window, click on ‘Simple’ and check off ‘Summaries of Separate Variables’;

3. Click ‘Define’;

4. Drag fuelusecity, and fuelusehwy to Boxes Represent and press OK.

Hand in the graph.

c) Let’s now get a scatterplot for two variables:

1. Go to ‘Graphs’ and choose ‘’Chart Builder’.

2. In the new window, click on ‘Scatter/Dot’ and drag the top left graph (simple scatter);

3. Drag fuelusecity to the x-axis and fuelusehwy into the y-axis;

4. press OK.

Choose two other variables and hand in these two graphs.

4. Let’s try some regression.

First we will do multiple linear regression with co2emissions as the dependent variable. We’ll try this using two methods:

For the first method there is a preliminary step

a) First get a table of the pairwise correlations to see which variables are strongly correlated:

Click on Analyze, then go down to Correlate and choose Bivariate.

Move all the numeric variables (except Year) into the window Variables. Press OK.

You will use this table for the next part, but it’s too big so go back, choose 4 variables, and get the table with those. Hand in this smaller output.

b) The first method will decide which variables make sense to use in the regression: it might make sense to not use some because of what they are, while others have a strong relation with other variables so it doesn’t make sense to use all of them (for example fuelusecity and fueluseboth). Choose 5 of the variables.

Get the regression line for the variables you chose:

Click on Analyze, go to Regression, and choose Linear;

Put co2emissions into the Dependent Variable box and the other variables you chose into the Independent box;

Click on OK.

Hand in the output (you only need the ANOVA and coefficients tables) and answer:

According to the output, is the model significant at the 5% level (look at Sig (the p-value) in the ANOVA table)?

Which variables are not significant (if any) at the 5% level (look at the p-values in the Coefficients table)?

Note: the p-value is labeled as Sig.

c) Now run regression with co2emissions as the dependent variable but using all numeric variables (except Year).

Now: Find the variables that are not significant at the 5% level, if any. If there are none, you are done. Run regression again after dropping the variable with the largest p-value. Again, find the variables that are not significant at the 5% level.

Repeat until there are no variables that are not significant.

NOTE: You can do this in one step by repeating the first step but changing the Method to ‘Backward’ from ‘Enter’. Hand in the printout of the first and last step (you again only need the ANOVA and coefficients tables).

The data for the next problems are from the 2013 Behavioral Risk Factor Surveillance System (BRFSS).

5. a) Let’s try two-way ANOVA to see whether how many fruits you eat, whether you’re active, and your sex have an impact on BMI.

click on the drop down menu ‘Analyze’; choose ‘General Linear Model’ and ‘Univariate’ which will open a window; move BMI into the dependent variable and fruits, active1, and female into Fixed Factors; click on the Plot tab, put fruit into Horizontal Axis and female into Separate Lines then click on Add and then Continue; Click on OK.

Note: we use two-way ANOVA when we have more than one independent variable (where each of these variables are categorical; if they are quantitative we would use regression instead). We will be testing a few things separately:

(i) are any of the groups different (test to see if the model is significant—look at the Sig for the Corrected Model)?

(ii) are the groups for each independent variable different (in our example, is BMI affected by how many fruits they eat; is it affected by sex; is it affected by how active they are—look at the Sigs for those three lines)?

(iii) is there interaction between the three independent variables? Look at the Sig for fruits*active1, fruits*female, active1*female, and fruits*active1*female.

Hand in the output (we just need the Tests of Between-Subjects Effects and the Plot) and decide if the model is significant, each variable is, and if the interactions are significant at the 5% level (Note: If there is no interaction, the lines in the graph we get should be roughly parallel).

Note: There a couple things that you should do, but I won’t have you do to save some time:

• Since we find that there are some interactions between the variables, it would make sense to see what happens if we only look at pairs.

• We really should get the graphs for all three pairs instead of just the one. This is true for the boxplots in the next part.

b) Get a boxplot for each of the subcategories:

Go to ‘Graphs’ and choose ‘’Chart Builder’.

In the new window, click on ‘Boxplot’ and drag the middle graph (clustered boxplot);

Drag female to the x-axis, BMI into the y-axis, active1 into ‘Cluster on …’ and press OK.

Hand in this graph.

6. Here we will get the logistic regression model for whether a person has arthritis. There are again two ways to do this, but you will just choose one of them.

For the part (4b) method, choose 5 variables that you think are good predictors; for the (4c) method, choose all the types of variables except choose one of: age, under30, age30to64, age65plus; active, active1, activetimes; bmi, bmicat.

a) Get the logistic regression model for the variables you chose:

Click on Analyze, go to Regression, and choose Binary Logistic; Put arthritis into the Dependent Variable box and the other variables into the Covariates box; Under Method, choose Backward LR for the (4c) method and Enter for the (4b) method; Click on the Options tab, choose Display ‘At last step’ (you don’t really have to do this for the 4b method), and click Continue; Click on OK.

Hand in the output (you only need the Omnibus Tests of Model Coefficients and Variables in the Equation tables).

According to the output, is the model significant at the 5% level?

If you used the (4b) method, which variables are not significant (if any) at the 5% level?

If you used the (4c) method, which variables are in the final model (these are in the table, you only need to hand in the tables for the first and last steps)?

b) The odds ratio for a variable in a logistic regression model is e^(coefficient). Notice that these are given in the output for the model.

Of the variables that are significant, which increase the odds of arthritis and which decrease the odds?

Does the odds ratio make sense for each of these variables?

The odds ratio is used when comparing two groups.

An example: we want to see if exercise can prevent a second heart attack. We would do this by breaking patients into two groups, one that is made to exercise and a group that does not. We get the odds of each: odds for exercise would be Pe/(1-Pe) where Pe is the probability that a person in the exercise group has a second heart attack. The odds ratio is then the ratio of the odds for one group over the odds for the other, here the odds for the exercise group over the odds for the control group.

If this number is less than one then that means that exercise reduces the odds of having a second heart attack. If there are many variables, then it will give the odds ratio for an increase in the variable by one unit if the other variables are held constant.

Note: since the odds ratio gives the increase in the odds if the inside variable is increased by one unit, it only makes sense if the variable is numeric or binary (if one of the variables was state of residence where each state was designated by a number (Maine=1, …), it would not make sense since it’s not a real number).