Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MESA 8440: Multivariate Statistics

Assignment #1

(Last Updated September 14, 2023)

Completed assignment is due by the beginning of class (submit online), September 27,2023

Purpose of the Memorandum:

The purpose of this assignment is to ensure that you (and your partner, if you choose) can:

Ø Check the “log-odds is linear in the predictors” assumption embedded in the specification of a logistic-regression model, for the relationship between a dichotomous outcome and one continuous predictor.

Ø Fit a taxonomy of binomial logistic-regression models, collating important statistics.

Ø Conduct GLH tests in selected logistic-regression models, by hand and by computer.

Ø Interpret a fitted logistic-regression model in terms of fitted probabilities, odds and odds-ratios.

Ø Interpret a fitted logistic-regression model by plotting fitted nonlinear trends for prototypical individuals.

You may work collaboratively with a partner on this assignment.  Engage in a full, fair and mutually-agreeable collaboration with your partner and do not simply divide the work up between partners. Discuss and plan your analyses together, debate what you find with each other as you make analytic decisions.  Collaborate on the writing up your responses to our tasks, perhaps each drafting sections and then critiquing and editing each other’s writing actively to improve the quality, before assembling your final product.

We are requesting that you produce a Data-Analytic Memo, not a paper, as a response to this assignment.  In framing the work as a memo, we want to save you production time, facilitate your work with your partner, and focus your attention on core issues, making sure you can execute and interpret each task, rather than fill reams of paper.  Doing this assignment is a critical part of the learning experience in the course, and not an assessment opportunity for us.  We want to provide a high-quality opportunity for you to review, teach and learn the material presented in class.  We are expecting you to turn in short responses that you and your partner have crafted carefully to contain all critical detail, along with the evidentiary material we request.  However, all stated word and sentence limits are guidelines only, not immutable laws. We suggest that you work with your partner to pre-plan your Stata code so that your do file becomes an orderly record of your work.

Each partnership should turn in one copy of the joint memo, each partner will receive the same grade for each part.  Your memo should contain a response to each of the “work products” requested -- double-spaced, in the order listed.  Your response to each work product should begin on a new page and be labeled with both partners’ names (feel free to use headers/footers).

Reading:

Before you begin work on your data-analyses, please:

Ø Required:  Review my presentations for Classes #1, #2, and #3, and their associated handouts and other materials.  Pay attention to the technical language in these handouts, and use it as a model for your own writing.

Dataset:

In this assignment thing, you investigate a congressperson’s support for abortion rights (“a woman’s right to choose”).  The data are contained in the ABORTION.txt file on the course website and are described in the ABORTION_info codebook.  The dataset contains an assessment of the support expressed for the preservation of abortion rights by the members of the 103rd U.S. Congress.

In the dataset, a congressperson’s overall support for a woman’s “right to choose” has been coded from his or her voting history and recorded in the values of dichotomous variable, A_SUPP.  This will be your outcome in the analyses below.

There are several question and control predictors.  For each congress-person, we know: (a) their personal political affiliation (represented currently by dichotomous variable RPUBLCAN, but soon to be recoded below), (b) their own level of religious (Christian) fundamentalism (represented by ordinal variable I_RELIG, which I ask you to treat below as though it were continuous), (c) their gender (represented by dichotomous variable MALE), and (d) whether they are a person of color (represented by dichotomous variable, NONWHT).  All variables are coded as indicated in the codebook.

We also possess aggregate information on the district in which the congressperson was elected, including: (a) the level of prior support for the Democratic-party in their district, as measured by the percentage of voters in the district who voted for candidate Bill Clinton in the Presidential election of 1992 (represented by continuous variable P_CLINTON92), (b) the average level of religious fundamentalism in the district (represented by ordinal variable D_RELIG), (c) the average per capita income in the district (measured by continuous variable D_INC), and (d) the percentage of the district’s populace who are persons of color (represented by the continuous variable P_NONWHT).

Note that – although some of these predictors are measured at the individual level and some at the level of the congressperson’s district – we do not require multilevel modeling to conduct the ensuing analyses because the unit of analysis – that is, the members of congress themselves – are not clustered hierarchically into groups.

Using these data, I ask you to explore how a congressperson’s support for abortion rights depended on his or her own political affiliation and religious fundamentalism, and on the level of prior political support and religious fundamentalism in their district, controlling for selected demographic and background characteristics of the congressperson and the district.

Tasks:

1. Input the data and check it:  

a) Read the abortion-rights dataset into Stata, making sure that you have the correct number of cases and variables, according to the codebook.  

b) Before conducting any analyses, eliminate Congressperson #15174, as he refused to declare an individual religious affiliation and so is missing on this variable.  You can eliminate him by including the following line:

drop if ID==15174

c) Assuming that members of congress are affiliated with one of only two political parties (this is not 100% correct, but simplifies the analysis), redefine the variable that describes the congress-person’s individual Republican-Party affiliation by recoding it to now identify whether the congressperson is a Democrat, rather than a Republican.  Call the new variable DEMCRAT, and create its values by the following code:

gen DEMCRAT=1-RPUBLCAN

This conversion is not a political statement, but simply brings individual and district measures of political affiliation into line with each other, so that later interpretation of interactions becomes easier.  The key issue is to align the directions of the individual and district dimensions of political affiliation to be the same.  (We could have just as easily recoded everything to represent the Republican side of the spectrum and the substantive findings would have been the same).

d) For your own future reference, obtain sensible univariate descriptive statistics on outcome A_SUPP and predictors D_INC, P_NONWHT, MALE, NONWHT, DEMCRAT, I_RELIG, P_CLINTON92 and D_RELIG.

summarize A_SUPP D_INC P_NONWHT MALE NONWHT DEMCRAT I_RELIG P_CLINTON92 D_RELIG

<Nothing to submit>

Note: The initial data-analytic work – in the two questions immediately below -- is not a central part of the main thrust of the overall analysis.  Instead, it illustrates something that you may have suspected -- that logistic-regression analysis is a “modern” extension of “classical” contingency-table analysis.  In fact, as you will see, if you have only a dichotomous outcome and a single dichotomous predictor, you can use either approach to conduct the analysis and obtain exactly the same findings.  However, because logistic-regression analysis provides a regression-like framework, it can be extended easily to more complex analyses (such as including continuous predictors) while the contingency-table framework cannot.  So, in the next question and the one following it, explore for yourself the connection between these two strategies by using them in parallel to investigate a simple bivariate “two-way” relationship between the outcome describing a member of congress’ support for abortion rights, A_SUPP, and their individual political affiliation, DEMCRAT.  After that, I return you to the main analyses.

2. Explore the relationship between a member of congress’ support for abortion rights, A_SUPP, and his or her individual political affiliation, DEMCRAT, using a classical approach, as follows:

a) Display the sample bivariate relationship between outcome A_SUPP and question predictor DEMCRAT graphically and inspect it.  One brief example of STATA code for doing this is:

histogram A_SUPP, discrete frequency by(DEMCRAT,col(1))

But, feel free to do something more adventurous, if you feel like it.

b) Conduct a contingency-table analysis of the same relationship.  In this analysis, for your two-way contingency table, obtain:  (a) the observed cell frequencies, along with their corresponding row and column percentages, (b) the associated Likelihood Ratio (LR) c2 statistic and its accompanying p-value.  STATA code for doing this is:

tabulate A_SUPP DEMCRAT, lrchi column row

You may not have come across the Likelihood Ratio (LR) c2 statistic in the past, as it is born of maximum-likelihood estimation theory.  However, it is akin to the classical Pearson-c2 statistic that you may have used in the past, but has superior statistical properties, and is used in the standard way to test the null hypothesis of no relationship between A_SUPP and DEMCRAT, in the population.  

c) Use the LR c2 statistic to conduct a formal “classical” test of the null hypothesis that a congress-person’s support for abortion rights and individual political-party affiliation are unrelated, in the population.  Provide a complete account of your test.

d) Having rejected the null hypothesis of no population relationship, summarize the strength of the discovered relationship between A_SUPP and DEMCRAT using a sample odds-ratio computed by hand to compare the odds that a Democratic congressperson supports abortion rights (vs. does not support them) to the odds that a Republican congressperson does the same.  (Pay attention to the way that I have framed this statement, as it indicates which odds should be in the numerator and which in the denominator).  In a brief sentence or two, interpret your estimate, assuming that your rejection of the null hypothesis has given you permission to generalize to the population. Provide both your calculation and your interpretation of your estimate.

<Submit your responses to (c) and (d)>

3. Explore the same relationship between a member of congress’ support for abortion rights, A_SUPP, and his or her own political affiliation, DEMCRAT, using a modern approach, by completing the following analyses:

a) Fit a preliminary logistic-regression model in which you regress the outcome, A_SUPP, on the main effect of the single question predictor DEMCRAT.  Call it Model M2, and retain your output for use below.

b) Write down an expression for the fitted model.  From the slope parameter estimate associated with predictor DEMCRAT, estimate a fitted odds-ratio that summarizes the main effect of predictor DEMCRAT on outcome A_SUPP.  Submit your expression, show your working and interpret the fitted odds-ratio briefly in words.

c) Fit an additional “baseline” unconditional logistic-regression model in which you regress outcome A_SUPP on no predictors at all, call it Model M1.

d) By hand, use the GLH strategy to test the null hypothesis that support for abortion rights does not differ by a members of congress’ own political-party affiliation, in the 103rd Congress, in the population.  Provide a full account of your test, including all working.

e) Notice that the results of the test conducted here and the one conducted above, in Q2, are identical.

<Submit your responses to (b) and (d)>.

4. Explore the sample relationship between a members of congress’ support for abortion rights, A_SUPP, and the overall level of prior support for the Democratic-party in their district, represented by continuous predictor P_CLINTON92, as follows:

a) Display the observed sample relationship between dichotomous outcome A_SUPP and continuous predictor P_CLINTON92 as a bivariate scatterplot, with sensible scales on each axis.  The plot does not need to be APA-style.

b) Use the “binning” method illustrated in class to estimate the sample probabilities that a congressperson supported abortion rights within each of a series of “bins” of width five percentage points, defined on the values of P_CLINTON9.  List your sample probabilities, arrayed in their “5% bins”, in a non-APA style table or list so that we can check their values.

c) Display the sample relationship between the new sample probabilities and the values of binned predictor P_CLINTON92, also as a bivariate scatterplot, using the same scales on each axis as the plot in part (a).  Again, the plot does not need to be APA-style.

d) Assemble your pair of plots into a single panel, one beneath the other.  Write a couple of sentences in which you summarize briefly the sample relationship between A_SUPP and P_CLINTON92, citing the sample evidence appropriately.

<Submit responses to (b) and (d)>.

5. With respect to the relationship between a member of congress’ support for abortion rights, A_SUPP, and the level of prior support for the Democratic-party in their district (represented by continuous predictor P_CLINTON92) check the critical functional-form assumption that underpins logistic-regression analysis, as follows:

a) Using the sample probabilities that you obtained above (by averaging the values of A_SUPP within “bins” of five percentage points, defined on continuous predictor P_CLINTON92), estimate the sample odds and log-odds that a congressperson supported abortion rights (vs. not).  List them out in a non-APA-style table or list, along with the corresponding probabilities and binned values of P_CLINTON92, for our inspection.

b) Some of the obtained log-odds values are negative.  Why is this?

c) Display the sample relationship between the sample log-odds of support for abortion rights and the binned values of continuous predictor P_CLINTON92 as a bivariate scatterplot, again with sensible scales on each axis.  Deal appropriately with unwanted infinities.  The plot does not need to be APA-style.

d)  Inspect the plot and assess, by eye, whether the features of this plot confirm that the critical functional-form assumption underpinning a logistic-regression analysis of outcome A_SUPP on predictor P_CLINTON92 has been met.

e) Briefly explain: what the critical assumption is, and whether you believe it has been satisfied (or not) for this outcome and predictor, citing your evidence.

<Submit your responses to (a), (b) (c) and (e) >.

6. Fit a taxonomy of logistic-regression models, with A_SUPP as the outcome variable:

· Model M3 is your “baseline-control model.”   In this model, control for the main effects of the congressperson’s gender and ethnicity, and the socio-economic and ethnic make-up of their districts.  Thus, this model will contain the main effects of four covariates:  MALE, NONWHT, D_INC and P_NONWHT.

· Model M4.  Add to Model M3 the main effects of the religious fundamentalism of the congressperson and of the district, as represented by predictors I_RELIG and D_RELIG.

· Model M5.  Add to Model M4 the main effects of the Democratic-party affiliation of the congressperson and of the district, as described by predictors DEMCRAT and P_CLINTON92.  You will notice that the effects of several predictors are dramatically altered between models M4 and M5 when the predictors describing individual and district political affiliation are added.

· Model M6.  Simplify model M5 by removing all ethnicity and religious covariates, at both the individual and district levels, because you will find in part (b) below that they are no longer needed.  Thus, in model M6, retain only the main effects of predictors D_INC, MALE, DEMCRAT and P_CLINTON92.

· Model M7.  To the previous model, add the two-way interaction -- call it IPAbyDPA -- created by forming the cross-product of the congressperson’s individual Democratic-party affiliation (DEMCRAT) and the prior level of Democratic-party support in their district (P_CLINTON92).  Inspection of the corresponding marginal z- statistic indicates that the two-way interaction has a statistically significant impact on outcome A_SUPP, controlling for other terms in the model.

Now, complete the following:

a) In words, offer a brief conceptual explanation for the dramatic differences in effects that occurred between models M4 and M5.

b) In intermediate model M5, use the STATA test command to conduct a GLH test of the null hypothesis that the main effects of individual and district ethnicity (NONWHT & P_NONWHT) and of individual and district religious fundamentalism (I_RELIG & D_RELIG), on support for abortion rights, are jointly zero in the population.  This is the test that supports the transition from models M5 to M6 above.

c) Create an APA-style table containing fitted logistic-regression models M1 (from Question 3), M5, M6 and M7.  For each of these, include parameter estimates, approximate p-values, -2LL and pseudo-R2 statistics. Make sure your table includes all necessary elements.  Include a brief summary of the important facets of the GLH test you conducted in (b) above, at the bottom of the appropriate column.

<<Submit your responses to (a) and (c)>>

7. Using parameter estimates from final fitted model M7, create a sensible APA-style plot of fitted trend lines for prototypical members of Congress to illustrate how individual political-party affiliation underscores a member of congress’ support for abortion rights, as moderated by the prior political opinions of the voters in their districts.  In creating this plot, set any covariates that you do not display explicitly, nor manipulate directly, to their overall sample averages, except for the individual and district measures of Democratic-party affiliation, DEMCRAT and P_CLINTON92, which you should permit to range over suitable values or set to sensible prototypical values.

<Submit your plot>.

8. As suggested by the figure you have plotted immediately above, the presence of the statistically significant two-way interaction term, IPAbyDPA, in final model M7 has dramatic implications for the substantive story.  Its presence implies that the relationship between a member of congress’ support for abortion rights (A_SUPP) and the level of prior support for the Democratic-party in their district (P_CLINTON92) will differ by their own political affiliation (DEMCRAT).  Explore this difference, as follows:

a) In model M7, separately for a prototypical Democratic and for a prototypical Republican member of congress, use the STATA test command to test the null hypothesis that the relationship between the member of congress’s support for abortion rights (A_SUPP) and the prior support for the Democratic-party in their district (P_CLINTON92) is zero, in the population.  Provide a full account of your test.

b) Working from the fitted model and showing your working, summarize the fitted relationship between the member of congress’s support for abortion rights (A_SUPP) and prior support for the Democratic-party in their district (P_CLINTON92).  Do this separately for a prototypical Democratic and for a Republican member of congress.  For each, express the relationship between A_SUPP and PCLINTON92 in terms of: (i) the fitted “slope” parameter estimates, and (ii) the corresponding fitted odds-ratios.  In the latter, construct your odds-ratio estimates to contrast the odds of abortion support (vs. no support) for prototypical members whose prior district support for the Democratic party differs by ten percentage points.

c) Bring the results of these analyses together in a sentence or two. 

<<Submit your responses to (a), (b) and (c)>>

9. Equivalently, the presence of the statistically significant two-way interaction term, IPAxDPA, in final model M7 also implies that the relationship between a member of congress’ support for abortion rights (A_SUPP) and their own political affiliation (DEMCRAT) will differ by the level of prior support for the Democratic-party in their district (P_CLINTON92).  Explore this difference in two ways, as follows:

a) In model M7, separately for a member of congress elected in a district with a weak prior Democratic-party affiliation (P_CLINTON92=25%) and one who was elected in a district with a strong prior Democratic-party affiliation (P_CLINTON92=50%), use the STATA test command to test the null hypothesis that the relationship between the member of congress’s support for abortion rights (A_SUPP) and their own political affiliation (DEMCRAT) is zero, in the population.  Provide a full account of your test,

b) Again, working from the fitted model and showing your working, examine the fitted relationship between the member of congress’s support for abortion rights (A_SUPP) and his or her own individual political affiliation (DEMCRAT).  Do this separately for a member of congress elected in a district with a weak prior Democratic-party affiliation (P_CLINTON92=25%) and one who was elected in a district with a strong prior Democratic-party affiliation (P_CLINTON92=50%).  For each, express the relationship between A_SUPP and DEMCRAT in terms of: (i) the fitted “slope” parameter estimates, and (ii) corresponding fitted odds-ratios.  

c) Bring the results of these analyses together in a sentence or two. 

<<Submit your responses to (a), (b) and (c)>>

10. Post your Stata “do” file to the Assignment #1 Folder on the course website.

A Final Word

This is a collaborative effort among partners, and so you must be careful to abide by Boston College policies on plagiarism.  If someone makes a contribution to your work, explicitly or implicitly, you cannot be accused of plagiarism if you recognize his or her contribution explicitly in your manuscript.

Thus, I ask you not only to make sure that both partners’ names are readable on each page of your memo, but that you also provide cites at suitable points in your memo, in which you identify contributions from anyone other than your partner (whether enrolled in the class, or not).  There is no penalty for seeking help from anyone, nor for giving help to anyone, providing that help is recognized explicitly.  In fact, I encourage the formation of larger teams to discuss the work, so long as partners then fall back on their own pairings to produce their joint memo and they reveal the explicit authorship of any content that has been shared among groups and partners.  I am particularly concerned that larger groups may craft joint language that they then publish into their own respective collaborations unthinkingly and in doing so appear to have copied from each other.  Recognizing the full authorship explicitly will prevent us from drawing this incorrect conclusion.

The same principle applies if you borrow language from my presentations.  Simply note the source briefly in your work (e.g. “Dougherty, Class Slides #3”), so that – if two groups model their work on the same presentation content – we cannot accuse you of plagiarism.

Finally, please remember that this is not a competition amongst class members.  Every part of everyone’s work will be given a grade that assesses its quality, and not placed on some curve.  I hope this means that the class will function integrally as a team, whose members help each other when requested.