STA304 - Assignment 3
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Predicting Overall Popular Vote of The Liberal Party in the Next
Federal Election in Canada.
STA304 - Assignment 3
November 5, 2021
0.1 Introduction
The next Canadian federal election will be held (Pammett and Dornan 2016) soon. The outcome is of interest
to all citizens and residents of Canada. The federal election is a countrywide election across 10 provinces
and 3 territories to elect members of the federal government of Canada (Pammett and Dornan 2016). In
this analysis we examined individual-level survey data and post-stratified census data to predict the overall
popular vote of the Liberal Party of Canada (also known as Liberals) in this election (Pammett and Dornan
2016). The Liberal Party of Canada is the eldest and longest-serving active federal political party in Canada
(Jeffrey 2010). The party has asserted dominance in federal politics for much of Canada’s history (Clarkson
2014). Liberals witheld power for almost 60 years of the 20th century (Clarkson 2014). The party supports
the ideologies of liberalism, and in general sits at the centre to centre-left of the western political spectrum
(Jeffrey 2010).
The research question is if Liberals would get about the same popular vote in the next federal election as
they did in the last election, which was 33% (Raynauld, Turcotte, and Gillies 2021). Therefore, our research
hypothesis is that Liberals are going to get 33% popular vote in the next federal election. Popular vote is
the same as the total number of votes for a political party (Raynauld, Turcotte, and Gillies 2021). As such,
the aim of this analysis is to predict the percentage of total votes Liberals are going to get in the next federal
election. We chose to use the statistical method multilevel regression with post-stratification. The outcome
variable we were particularly interested in was if a voter would vote for Liberals; it is a binary outcome
(Downes et al. 2018).
We first fit a multivariable, multilevel logistic regression model to predict our outcome variable using a
few demographic variables. Then, we poststratified the selected sample with the variables in the logistic
regression model. We subsequently assigned individuals into different cells based on combinations of these
variables. We then utilized the logistic regression model to predict the probability of voting for Liberals
for each created cell. Finally, we combined the predicted probabilities of all cells to compute the Liberals’
overall popular vote. The survey dataset that was used is the the Canadian Election Study (CES) 2019
- Phone Survey and the census dataset was the 2017 General Social Survey (GSS) on the Family Canada
(2020). Finally, we compared this post-stratified prediction of the popular vote with the hypothesized value
of 33%.
The Data section provides numerical, textual, and graphical description of the census and survey datasets
and important variables in the them. TheMethods section covers statistical methods and analysis techniques
used in this study. The Results section presents and explains our analysis results. The Conclusion section
concludes our study with a summary of crucial findings and a complete commentary and discussion on the
overall study and analysis.
0.2 Data
Census data
The census dataset was retrieved from the 2017 General Social Survey (GSS) on the Family. The 2017 GSS,
conducted from February 2, 2017 to November 30, 2017, is a sample survey of cross-sectional survey design
(Canada 2020). The target population comprised all non-institutionalized persons over 15 years of age, living
in the 10 major provinces of Canada (Canada 2020). The survey uses a novel sampling frame, created in
2013, that encompasses telephone numbers with Statistics Canada’s Address Register, and executes data
collection over phone (landline and cell) (Canada 2020).
The important role family plays in people’s lives cannot be disputed. Today’s family, however, must push
through changing marital, family, and professional trajectories. While our understanding of families in
Canada has improved considerably over the past few years, the future of families remains a topic of great
interest. As we see that families are getting more diverse. The GSS on families intends to inform researchers
on the different types and characteristics of families in Canada to enhance our understanding of families.
(Canada 2020)
The survey collected a large amount of data for each respondent and moreover related information about
each family member of the respondent’s household. The response rate was 52.4%, which is enough to be
representative of the target population (Canada 2020).
Survey data
The survey data was data retrieved the Canadian Election Study (CES) 2019 - Phone Survey (Stephenson
et al. 2020). There were 2 stages of data collection as part of this survey. During the last Canadian federal
election campaign that was held in 2019, telephone interviews were conducted with Canadian citizens over
18 years old (Stephenson et al. 2020). Respondents to this survey were contacted by phone and later
interviewed (Stephenson et al. 2020). The survey included different questions asking for the respondent’s
demographic variables and their perspectives on Canadian politics, opinion on different political parties in
Canada, their voting records, and what party they wanted to vote for in the federal election (Stephenson et
al. 2020).
Data cleaning
We processed data for the variables we chose to use in the MRP analysis. The variables were age, sex, place
of birth, marriage history and province. We chose to round age to the nearest integer in both the census
and the survey datasets. We only retained males and females in both datasets since only one person did not
identify themselves as male or female. We decided to have 2 categories for place of birth: born in Canada
and born outside Canada due to the fact that the majority of respondents in the two datasets were born
in Canada. In both datasets, we categorized marriage history as ever married or not (2 categories). This
effectively and concisely represents a individual’s marriage history. Individuals who were separated, divorced,
common-law or widowed were treated as never married since we treated marriage as an legal, official status.
Both datasets had 10 provinces so no data cleaning was required. The response variable is if the respondent
would vote for Liberals, thus we made it a binary variable that has a “1” if the respondent would vote for
Liberals and a “0” if not.
Belo we provide a detailed description the selected variables.
• The variable age is a numerical and records the age of the respondent.
• Sex is a binary and records if the respondent is male or female. Place of birth is a binary variable and
records if the respondent was born in or outside of Canada.
• Marriage history is a binary variable that records if the respondent was ever married.
• Province is a categorical level with 10 levels, for the 10 provinces of Canada.
• The response variable is binary variable that records if the respondent would vote for Liberals and is
a binary with a “1” if the respondent would vote for Liberals and a “0” if not.
For modeling, we removed variables not listed above from both datasets. We also only retained observations
that did not have any missing values to the above variables since we did not want to handle any bias that
may be induced from imputation of missing values.
Data summaries
Table 1 shows the percentages of males and females in the survey and the census datasets.
The percentage of males in the survey dataset is 0.575 and in the census dataset is 0.456. The percentage
of females in the survey dataset is 0.425 and in the census dataset is 0.544. The percentages of males and
females in the survey dataset are apparently different than those of the census dataset.
Table 1: Proportions of males and females in the survey and the census datasets
Dataset Male percentage Female percentage
Survey 0.575 0.425
Census 0.456 0.544
Table 2 shows the percentages of respondents born in and outside Canada in the survey and the census
datasets.
The percentage of people born in Canada in the survey dataset is 0.858 and in the census dataset is 0.8.
The percentage of people born outside of Canada in the survey dataset is 0.142 and in the census dataset is
0.2. The difference in the percentages of place of birth is minimal between the two datasets.
Table 2: Percentages of place of birth in the survey and the census datasets
Dataset Born in Canada percentage Born Outside Canada percentage
Survey 0.858 0.142
Census 0.800 0.200
Table 3 shows the percentages of marriage history in the survey and the census datasets.
The percentage of people who was ever married in the survey dataset is 0.691 and in the census dataset is
0.697. The percentage of people who was never married in the survey dataset is 0.309 and in the census
dataset is 0.303. The percentages of marriage history are quite the same in the two datasets.
Table 3: Percentages of marriage history for the survey and the census datasets
Dataset Ever married proportion Never married proportion
Survey 0.691 0.309
Census 0.697 0.303
Figure 1 displaus the age distributions in the survey and the census datasets. The age distributions are
very similar except that in the survey dataset there were respondents over 80 years old when in the census
dataset there wasn not any. The age distributions were close to uniform and close to symmetric while not
multi-modal.
In Figure 2 presents popular vote for Liberals in the survey dataset. We observe that almost 25% of the
respondents replied that they would vote for Liberals.
0500
1000
20 40 60 80
Age
Fr
eq
ue
nc
y
0
50
100
150
200
25 50 75 100
Age
Fr
eq
ue
nc
y
Figure 1: Distribution of age in the census dataset (left) and the survey dataset (right).
0%
25%
50%
75%
100%
No Yes
Pe
rc
e
n
ta
ge
Figure 2: Sample popular vote for Liberals in in survey dataset - the distribution.
0.3 Methods
0.3.1 Model Specifics
We employed a multilevel logistic regression model to predict our outcome - if one would vote for Liberals in
the next federal election. Logistic regression is a categorization regression model (Wright 1995). It is useful
for predicting a binary outcome based on a set of predictor variables (Wright 1995). A binary outcome
has only two possible cases — either the event occurs (1) or it does not occur (0). Predictor variables are
those variables that might affect the outcome (Wright 1995). In our situation, it is appropriate since the
aforementioned outcome is binary. The predictor variables in our model were age, sex, marriage history, and
place of birth. These variables in synchronization represent the majority of the different groups and subgroups
within the Canadian population (Hosmer, Lemeshow, and Sturdivant 2000). Using combinations of these
variables we were able to integrate the political opinion of these different groups and more substantially,
the entire Canadian population. The model provides the log odds of the outcome. An odds of an event is
the probability of the event occurring divided by the probability of the event not occurring. The regression
coefficient of a predictor is a quantification of the change in log odds when the predictor changes (Hosmer,
Lemeshow, and Sturdivant 2000). For post-stratification, we predict the probabilities (probability of voting
for Liberals) from the estimated log odds from the logistic regression model.
Multilevel models are statistical models of parameters that differ at more than 1 level, most often an indi-
vidual level and a group level (McCulloch and Neuhaus 2005). Multilevel models are specifically appropriate
for research designs in which data for subjects are organized at more than 1 level. The units of analysis are
often individuals (lower level) who are nested within groups (higher level) (Demidenko 2013). The random
intercept model is the most widely used type of multilevel models (McCulloch and Neuhaus 2005). A random
intercept model is a model where intercepts are assumed to vary, and therefore, the predicted outcome for
each individual is predicted by the intercept of the group the individual belongs in together with individual-
level predictors (Demidenko 2013). In our analysis, our respondents are in Canada and innately grouped
by province based on where they are located. Thus, each province was modeled to have its own random
intercept that is shared by all individuals located in that province.
Model summaries were examined to determine if each predictor is statistically significant in predicting
whether an individual is going to vote for Liberals.
Here we show the equation of the multilevel logistic regression model:
log
(
p
1− p
)
= β0j + β1Xmale + β2Xborn outside Canada + β3Xever married
β0j = r00 + r01 +Wj + µ0j
where
• p is the probability of voting for Liberals
• Xmale = 1 if the respondent is male and = 0 if the respondent is female
• Xborn outside Canada = 1 if the respondent was born outside Canada; = 0 if the respondent was born
in Canada
• Xever married = 1 if the respondent was ever married; = 0 if was never married
• β1 is the difference in log odds of voting for Liberals for males versus females.
• β2 is the difference in log odds of voting for Liberals for those born outside Canada versus those born
in Canada.
• β3 is the difference in log odds of voting for Liberals for those who were ever married versus those who
were never married.
• β0j is the random intercept for the jth province.
• Wj = 1 if the respondent was located in the jth province.
• µ0,j is the statistical noise in the random intercept of the jth province.
0.3.2 Post-Stratification
Post-stratification is a widely used method in sampling and survey analysis for integrating population distri-
butions of variables with survey estimates (Buttice and Highton 2013). The fundamental technique splits up
the sample into cells according to combinations of different variables (each distinct combination formulates a
cell), and calculates a post-stratification estimate based on weighted estimates of each cell (Holt and Smith
1979). Popular estimates include means, proportions and totals. Should weighted estimates of each cell be
estimated by a multi-level regression model, which is often done, the technique becomes multilevel regression
and post-stratification (MRP) (Buttice and Highton 2013).
Post-stratification is appropriate when the distributions of particular variables in the sample do not resemble
those in the underlying population (Holt and Smith 1979). This is often the case when attempting to map
a sub-countrywide or smaller-scale survey to a nationwide or large-scale census, which is how we attempted
to map CES 2019 phone survey to the 2017 General Social Survey (census). We discovered big differences
in percentages of males and females in the survey and the census datasets. Additionally, the distributions
of province and place of birth are a little different. In presence of such differences, post-stratification is a
suitable method.
First we divided individuals in the census dataset into different cells. The cells were created by distinct
combinations of age, sex, place of birth, marriage history, and province. We next predicted the probability
of voting for Liberals in each cell with our multilevel logistic regression model. At the end, we combined the
estimated probabilities into one aggregate, population-wide probability of voting for Liberals, analogous to
the overall popular vote for Liberals, using the formula:
yˆPS =
∑
j Nj yˆj∑
j Nj
here
• Nj is the total number of individuals in the jth cell
• yˆj is the predicted probability of voting for Liberals for the jth cell
All analysis for this report was programmed using R version 4.1.1.
0.4 Results
Table 4 contains the multi-level logistic regression model summary for predicting if an individual would vote
for Liberals. The table has regression coefficient estimates, their standard errors and P-values. Age, place
of birth and marriage history are statistically significant in predicting if an individual is going to vote for
Liberals. For a year increase in age, the log odds of voting for Liberals increases by 0.010. This means that
as individuals become older, they become more inclined to vote for Liberals. Individuals who were born
outside of Canada had 0.567 higher log odds of voting for Liberals compared to individuals who were born in
Canada. This implies that individuals who were born outside Canada were more likely to vote for Liberals
than individuals who were born in Canada. The log odds of voting for Liberals individuals who were ever
married was 0.249 lower than individuals who were never married. This implies individuals who were ever
married were less likely to vote for Liberals than individuals who were never married. The results do not
surprise us since Liberals have been more supportive of less wealthy (those who were never married were
more likely to have a lower household income than those who were married), immigrants and refugees (those
who were born outside Canada) and the elderly by giving them better financial and healthcare support
(Wilson 2011).
Table 4: Multilevel logit regression model summary.
Estimate Standard Error P value
(Intercept) -1.597 0.216 0.000
age 0.010 0.003 0.001
sexMale -0.157 0.091 0.084
birthOutside 0.567 0.122 0.000
marriageYes -0.249 0.110 0.024
The regression and post-stratification estimate of overall popular vote for Liberals in the next Canadian
federal election is 0.258. The results are reasonable and not surprising since firstly, the sample size of the
census dataset is quite smaller than the voting population of Canada. About 66% of Canada’s 27 million
registered voters voted in the 2019 federal election (Raynauld, Turcotte, and Gillies 2021). In addition,
indicating that one would vote for Liberals does not necessarily imply they were going to actually vote for
Liberals; their preferences could have changed.
The post-stratification estimate of the overall popular vote of Liberals is 25.8%, much lower than the hy-
pothesized value of 33.0%. But, this estimate does directly addresses the research question of interest and
attains the survey goal. We aimed to predict the overall popular vote of Liberals and we got an estimate
through MRP. We in addition addressed the hypothesis by comparing our estimate to our hypothesized
value. Overall, our results were extremely useful.
2026-01-17