Bayesian Data Analysis, 2022/2023, Semester 2 Assignment 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Bayesian Data Analysis, 2022/2023, Semester 2
Assignment 2
IMPORTANT INFORMATION ABOUT THE ASSIGNMENT
In this paragraph, we summarize the essential information about this assignment. The format
and rules for this assignment are different from your other courses, so please pay attention.
1) Deadline: The deadline for submitting your solutions to this assignment is the 17 April 12:00 noon Edinburgh time.
2) Format: You will need to submit your work as 2 components: a PDF report, and your R Markdown (.Rmd) notebook. There will be two separate submission systems on Learn: Gradescope for the report in PDF format, and a Learn assignment for the code in Rmd format. You need to write your solutions into this R Markdown notebook (code in R chunks and explanations in Markdown chunks), and then select Knit/Knit to PDF in RStudio to create a PDF report.
The compiled PDF needs to contain everything in this notebook, with your code sections
clearly visible (not hidden), and the output of your code included. Reports without the code
displayed in the PDF, or without the output of your code included in the PDF will be marked as 0, with the only feedback “Report did not meet submission requirements”.
You need to upload this PDF in Gradescope submission system, and your Rmd file in the Learn
assignment submission system. You will be required to tag every sub question on Gradescope.
Some key points that are different from other courses:
a) Your report needs to contain written explanation for each question that you solve, and some numbers or plots showing your results. Solutions without written explanation that clearly demonstrates that you understand what you are doing will be marked as 0 irrespectively whether the numerics are correct or not.
b) Your code has to be possible to run for all questions by the Run All in RStudio, and reproduce all of the numerics and plots in your report (up to some small randomness due to stochasticity of Monte Carlo simulations). The parts of the report that contain material that is not reproduced by the code will not be marked (i.e. the score will be 0), and the only feedback in this case will be that the results are not reproducible from the code.
c) Multiple Submissions are allowed BEFORE THE DEADLINE are allowed for both the report, and the code.
However, multiple submissions are NOT ALLOWED AFTER THE DEADLINE.
YOU WILL NOT BE ABLE TO MAKE ANY CHANGES TO YOUR SUBMISSION AFTER THE DEADLINE.
Nevertheless, if you did not submit anything before the deadline, then you can still submit
your work after the deadline, but late penalties will apply. The timing of the late penalties will be determined by the time you have submitted BOTH the report, and the code (i.e. whichever was submitted later counts).
We illustrate these rules by some examples:
Alice has spent a lot of time and effort on her assignment for BDA. Unfortunately, before
submission, she has accidentally introduced a typo in her code in the first question, and it did
not run using Run All in RStudio. - Alice will get 0 for the questions that do not run in her
code (we will try to run each code block individually), with the only feedback “Results are not reproducible from the code”.
Bob has spent a lot of time and effort on his assignment for BDA. Unfortunately he forgot to
submit his code. - Bob will get no personal reminder to submit his code. Bob will get 0 for
the whole assignment, with the only feedback “Results are not reproducible from the code, as
the code was not submitted.”
Charles has spent a lot of time and effort on his assignment for BDA. He has submitted both
his code and report in the correct formats. However, he did not include any explanations in the report. Charles will get 0 for the whole assignment, with the only feedback “Explanation is missing.”
Denise has spent a lot of time and effort on her assignment for BDA. She has submitted
her report in the correct format, but thought that she can include her code as a link in the
report, and upload it online (such as Github, or Dropbox). - Denise will get 0 for the whole assignment, with the only feedback “Code was not uploaded on Learn.”
3) Group work: This is an INDIVIDUAL ASSIGNMENT, like a 2 week exam for the course. Communication between students about the assignment questions is not permitted. Students who submit work that has not been done individually will be reported for Academic Mis- conduct, that can lead to serious consequences. Each problem will be marked by a single instructor, so we will be able to spot students who copy.
4) Piazza: During the periods of the assignments, the instructor will change Piazza to allow messaging the instructors only, i.e. students will not see each others messages and replies. Only questions regarding clarification of the statement of the problems will be answered by the instructors. The instructors will not give you any information related to the solution of the problems, such questions will be simply answered as “This is not about the statement of the problem so we cannot answer your question.”
THE INSTRUCTORS ARE NOT GOING TO DEBUG YOUR CODE, AND YOU ARE ASSESSED ON YOUR ABILITY TO RESOLVE ANY CODING OR TECHNICAL DIFFI- CULTIES THAT YOU ENCOUNTER ON YOUR OWN.
5) Office hours: There will be two office hours per week (Monday 14:00-15:00, and Wednesdays
15:00-16:00) during the 2 weeks for this assignment. The links are available on Learn / Course Information. I will be happy to discuss the course/workshop materials. However, I will only answer questions about the assignment that require clarifying the statement of the problems, and will not give you any information about the solutions. Students who ask for feedback on their assignment solutions during office hours will be removed from the meeting.
6) Late submissions and extensions: NO EXTENSIONS ARE ALLOWED FOR THIS AS- SIGNMENT, AND THERE IS NO SUCH OPTION PROVIDED IN THE ESC SYSTEM. Students who have existing Learning Adjustments in Euclid will be allowed to have the same adjustments applied to this course as well, but they need to apply for this BEFORE THE DEADLINE on the website
https://www.ed.ac.uk/student-administration/extensions-special-circumstances
by clicking on “Access your learning adjustment”. This will be approved automatically.
Students who submit their work late will have late submission penalties applied by the ESC
team automatically (this means that even if you are 1 second late because of your internet connection was slow, the penalties will still apply). The penalties are 5% of the total mark deduced for every day of delay started (i.e. one minute of delay counts for 1 day). The course instructors do not have any role in setting these penalties, we will not be able to change them.
7) Please make sure to tag all pages in your submission on Gradescope, otherwise we may miss some of your work. Once your upload is complete, tagging does not counts towards your submission time (i.e. you won’t get any late penalties for doing it).
rm (list = ls(all = TRUE))
#Do not delete this!
#It clears all variables to ensure reproducibility
Problem 1
In this problem, we study a dataset about car insurance. This data set is based on one-year vehicle insurance policies taken out in 2004 or 2005. In total, there are 67856 policies, of which 4624 have claims.
require (insuranceData)
## Loading required package: insuranceData
data (dataCar)
#You may need to set the working directory first before loading the dataset #setwd("location of Assignment 1")
#The first 6 rows of the dataframe
print.data.frame (dataCar[1:6,])
## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## ## 1 ## 2 ## 3 ## 4
veh_value exposure clm numclaims
1 .06 0 .3039014 0 0
1 .03 0 .6488706 0 0
3 .26 0 .5694730 0 0
4 .14 0 .3175907 0 0 0 .72 0 .6488706 0 0
2 .01 0 .8542094 0 0
agecat X_OBSTAT_
2 01101 0 0 0
4 01101 0 0 0
2 01101 0 0 0
2 01101 0 0 0
claimcst0 0 0 0 0 0 0
veh_body
HBACK HBACK UTE STNWG HBACK HDTOP
veh_age
3
2
2
2
4
3
gender
F F F F F M
area
C
A
E
D
C
C
## 5
## 6
2 01101
4 01101
0
0
0
0
0
0
Description of the columns.
veh_value: vehicle value in $10000s
exposure: maximum portion of the vehicle value the insurer may need to pay out in case of an incident
claimcst0: claim amount (0 if no claim)
clm: whether there was a claim during the 1 year duration
numclaims: number of claims during the 1 year duration
veh_body types: BUS = bus CONVT = convertible COUPE = coupe HBACK = hatchback
HDTOP = hardtop MCARA = motorized caravan MIBUS = minibus PANVN = panel van
RDSTR = roadster SEDAN = sedan STNWG = station wagon TRUCK = truck UTE =
utility
gender: F- female, M - male
area: a factor with levels A,B,C,D,E, F
agecat: age category, 1 (youngest), 2, 3, 4, 5, 6
You can use either JAGS, Stan, or INLA for this question.
a)[10 marks] Fit a Bayesian logistic regression model on the dataset dataCar with
● clm as response,
● a link function of your choice,
● using veh_value, exposure, veh_body, veh_age, gender, area, and agecat as covariates (you can use categorical covariates by converting integers to factors if appropriate).
Center and scale the non-categorical covariates.
Choose your own prior distributions (do not use default priors), and explain the rationale your prior choices, and ensure that the posterior is not too sensitive to your prior choice [Hint: look at the induced prior on the linear predictor and on the response.]
Compute the posterior means of the model parameters, and discuss the results.
Explanation (min 300 characters in your own words, otherwise -5 marks for insufficient explanation):
b)[10 marks] Fit a Bayesian Poisson regression model on numclaims as response with
● log link function,
● using veh_value, exposure, veh_body, veh_age, gender, area, and agecat as covariates. Center and scale the non-categorical covariates.
Choose your own prior distributions (do not use default priors), and explain the rationale your prior choices, and ensure that the posterior is not too sensitive to your prior choice [Hint: look at the induced prior on the linear predictor and the response.]
Compute the posterior means of the model parameters, and discuss the results.
Explanation (min 300 characters in your own words, otherwise -5 marks for insufficient explanation):
c)[10 marks] Fit a zero-inflated Bayesian Poisson regression model (https://en.wikipedia.org /wiki/Zero-inflated_model) on
● numclaims as response,
● with log link function,
● using veh_value, exposure, veh_body, veh_age, gender, area, and agecat as covariates. Center and scale the non-categorical covariates.
Choose your own prior distributions (do not use default priors), and explain the rationale your prior choices, and ensure that the posterior is not too sensitive to your prior choice [Hint: look at the induced prior on the linear predictor and the response.]
Compute the posterior means of the model parameters, and discuss the results.
Explanation (min 300 characters in your own words, otherwise -5 marks for insufficient explanation):
d)[10 marks] Fit a new model on numclaims in terms of the same covariates to improve on the models in part b) or part c) by considering interactions between covariates, as well as random effects. Describe your new model and justify your choices.
Choose your own prior distributions (do not use default priors), and explain the rationale your prior choices, and ensure that the posterior is not too sensitive to your prior choice [Hint: look at the induced prior on the linear predictor and the response.]
Compute the posterior means of the model parameters, and discuss the results.
Explanation (min 300 characters in your own words, otherwise -5 marks for insufficient explanation):
e)[10 marks] Perform posterior predictive model checks for your models b, c, d (i.e. using replicates).
As test functions, use the number of rows in the dataset with numclaims equal 0, 1, 2, 3, and 4 (5 test functions).
Compute the RMSE values for predicting numclaims based on all 3 models. Discuss the results.
Explanation (min 300 characters in your own words, otherwise -5 marks for insufficient explanation):
Problem 2 - Barcelona study
In this problem, we will use a dataset from the CitieS-Health project that provides insight
into the impact of air pollution on humans. It is comprised of data collected in Barcelona,
Spain, and examines various environmental variables, such as air pollution levels, and their
effects on mental health and wellbeing. In addition to environmental factors, this dataset also captures self-reported survey data on mental health, physical activity, diet habits, and more. From performance in a Stroop test (a type of psychological test evaluating attention capacity and processing speed) to information on total noise exposure at 55 dB - this dataset contains interesting information to understand the link between air pollution and human health.
We start by loading the dataset.
study<-read .csv ( "Barcelona .csv")
head(study)
## Person_ID date_all year month day dayoftheweek hour sadness wellbeing energy
## 1 115 22222 2020 11 3 1 18 14 3 2 ## 2 212 22247 2020 11 28 5 18 4 9 9 ## 3 104 22208 2020 10 20 1 20 1 6 6 ## 4 216 22247 2020 11 28 5 18 2 8 8 ## 5 94 22213 2020 10 25 6 19 12 8 4
## 6 215 22258 2020 12 9 2 20 4 7 7
## stress sleep hours_out physical_activity computer_use on_a_diet alcohol drugs
## 1 5 2 5 No Yes Yes No No ## 2 1 9 5 Yes No No Yes No ## 3 7 9 11 No Yes Yes No No ## 4 1 3 2 Yes No Yes Yes No ## 5 2 8 1 No Yes No No Yes
## 6 ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6 ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6
sick No No No Yes No N |
2023-04-12