Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Bayesian Data Analysis, 2022/2023, Semester 2

Assignment 2

IMPORTANT INFORMATION ABOUT THE ASSIGNMENT

In this paragraph, we summarize the essential information about this assignment. The format

and rules for this assignment are dierent from your other courses, so please pay attention.

1) Deadline: The deadline for submitting your solutions to this assignment is the 17 April 12:00 noon Edinburgh time.

2) Format: You will need to submit your work as 2 components: a PDF report, and your R Markdown (.Rmd) notebook. There will be two separate submission systems on Learn: Gradescope for the report in PDF format, and a Learn assignment for the code in Rmd format. You need to write your solutions into this R Markdown notebook (code in R chunks and explanations in Markdown chunks), and then select Knit/Knit to PDF in RStudio to create a PDF report.

The compiled PDF needs to contain everything in this notebook, with your code sections

clearly visible (not hidden), and the output of your code included. Reports without the code

displayed in the PDF, or without the output of your code included in the PDF will be marked as 0, with the only feedback Report did not meet submission requirements”.

You need to upload this PDF in Gradescope submission system, and your Rmd le in the Learn

assignment submission system. You will be required to tag every sub question on Gradescope.

Some key points that are dierent from other courses:

a) Your report needs to contain written explanation for each question that you solve, and some numbers or plots showing your results. Solutions without written explanation that clearly demonstrates that you understand what you are doing will be marked as 0 irrespectively whether the numerics are correct or not.

b) Your code has to be possible to run for all questions by the Run All in RStudio, and reproduce all of the numerics and plots in your report (up to some small randomness due to stochasticity of Monte Carlo simulations). The parts of the report that contain material that is not reproduced by the code will not be marked (i.e. the score will be 0), and the only feedback in this case will be that the results are not reproducible from the code.


c) Multiple Submissions are allowed BEFORE THE DEADLINE are allowed for both the report, and the code.

However, multiple submissions are NOT ALLOWED AFTER THE DEADLINE.

YOU WILL NOT BE ABLE TO MAKE ANY CHANGES TO YOUR SUBMISSION AFTER THE DEADLINE.

Nevertheless, if you did not submit anything before the deadline, then you can still submit

your work after the deadline, but late penalties will apply. The timing of the late penalties will be determined by the time you have submitted BOTH the report, and the code (i.e. whichever was submitted later counts).

We illustrate these rules by some examples:

Alice has spent a lot of time and eort on her assignment for BDA. Unfortunately, before

submission, she has accidentally introduced a typo in her code in the rst question, and it did

not run using Run All in RStudio. - Alice will get 0 for the questions that do not run in her

code (we will try to run each code block individually), with the only feedback Results are not reproducible from the code”.

Bob has spent a lot of time and eort on his assignment for BDA. Unfortunately he forgot to

submit his code. - Bob will get no personal reminder to submit his code. Bob will get 0 for

the whole assignment, with the only feedback Results are not reproducible from the code, as

the code was not submitted.”

Charles has spent a lot of time and eort on his assignment for BDA. He has submitted both

his code and report in the correct formats. However, he did not include any explanations in the report. Charles will get 0 for the whole assignment, with the only feedback Explanation is missing.”

Denise has spent a lot of time and eort on her assignment for BDA. She has submitted

her report in the correct format, but thought that she can include her code as a link in the

report, and upload it online (such as Github, or Dropbox). - Denise will get 0 for the whole assignment, with the only feedback Code was not uploaded on Learn.”

3) Group work: This is an INDIVIDUAL ASSIGNMENT, like a 2 week exam for the course. Communication between students about the assignment questions is not permitted. Students who submit work that has not been done individually will be reported for Academic Mis- conduct, that can lead to serious consequences. Each problem will be marked by a single instructor, so we will be able to spot students who copy.


4) Piazza: During the periods of the assignments, the instructor will change Piazza to allow messaging the instructors only, i.e. students will not see each others messages and replies. Only questions regarding clarication of the statement of the problems will be answered by the instructors. The instructors will not give you any information related to the solution of the problems, such questions will be simply answered as This is not about the statement of the problem so we cannot answer your question.”

THE INSTRUCTORS ARE NOT GOING TO DEBUG YOUR CODE, AND YOU ARE ASSESSED ON YOUR ABILITY TO RESOLVE ANY CODING OR TECHNICAL DIFFI- CULTIES THAT YOU ENCOUNTER ON YOUR OWN.

5) Oce hours: There will be two oce hours per week (Monday 14:00-15:00, and Wednesdays

15:00-16:00) during the 2 weeks for this assignment. The links are available on Learn / Course Information. I will be happy to discuss the course/workshop materials. However, I will only answer questions about the assignment that require clarifying the statement of the problems, and will not give you any information about the solutions. Students who ask for feedback on their assignment solutions during oce hours will be removed from the meeting.

6) Late submissions and extensions: NO EXTENSIONS ARE ALLOWED FOR THIS AS- SIGNMENT, AND THERE IS NO SUCH OPTION PROVIDED IN THE ESC SYSTEM. Students who have existing Learning Adjustments in Euclid will be allowed to have the same adjustments applied to this course as well, but they need to apply for this BEFORE THE DEADLINE on the website

https://www.ed.ac.uk/student-administration/extensions-special-circumstances

by clicking on Access your learning adjustment”. This will be approved automatically.

Students who submit their work late will have late submission penalties applied by the ESC

team automatically (this means that even if you are 1 second late because of your internet connection was slow, the penalties will still apply). The penalties are 5% of the total mark deduced for every day of delay started (i.e. one minute of delay counts for 1 day). The course instructors do not have any role in setting these penalties, we will not be able to change them.

7) Please make sure to tag all pages in your submission on Gradescope, otherwise we may miss some of your work. Once your upload is complete, tagging does not counts towards your submission time (i.e. you wont get any late penalties for doing it).

rm (list  =  ls(all  =  TRUE))

#Do not delete this!

#It clears all variables to ensure reproducibility

Problem 1

In this problem, we study a dataset about car insurance. This data set is based on one-year vehicle insurance policies taken out in 2004 or 2005. In total, there are 67856 policies, of which 4624 have claims.

require (insuranceData)

##  Loading  required  package:  insuranceData

data (dataCar)

#You may need to set the working directory first before loading the dataset #setwd("location of Assignment 1")

#The first 6 rows of the dataframe

print.data.frame (dataCar[1:6,])

##    ##  1 ##  2 ##  3 ##  4 ##  5 ##  6 ##    ##  1 ##  2 ##  3 ##  4

veh_value    exposure  clm  numclaims

1 .06  0 .3039014      0                  0

1 .03  0 .6488706      0                  0

3 .26  0 .5694730      0                  0

4 .14  0 .3175907      0                  0 0 .72  0 .6488706      0                  0

2 .01  0 .8542094      0                  0

agecat                       X_OBSTAT_

2  01101        0        0        0

4  01101        0        0        0

2  01101        0        0        0

2  01101        0        0        0

claimcst0 0 0 0 0 0 0

veh_body

HBACK HBACK  UTE STNWG HBACK HDTOP

veh_age

3

2

2

2

4

3

gender

F F F F F M

area

C

A

E

D

C

C

##  5

##  6

2  01101

4  01101

0

0

0

0

0

0

Description of the columns.

veh_value: vehicle value in $10000s

exposure: maximum portion of the vehicle value the insurer may need to pay out in case of an incident

claimcst0: claim amount (0 if no claim)

clm: whether there was a claim during the 1 year duration

numclaims: number of claims during the 1 year duration

veh_body types: BUS = bus CONVT = convertible COUPE = coupe HBACK = hatchback

HDTOP = hardtop MCARA = motorized caravan MIBUS = minibus PANVN = panel van

RDSTR = roadster SEDAN = sedan STNWG = station wagon TRUCK = truck UTE =

utility

gender: F- female, M - male

area: a factor with levels A,B,C,D,E, F

agecat: age category, 1 (youngest), 2, 3, 4, 5, 6

You can use either JAGS, Stan, or INLA for this question.

a)[10 marks] Fit a Bayesian logistic regression model on the dataset dataCar with

clm as response,

a link function of your choice,

using veh_value, exposure, veh_body, veh_age, gender, area, and agecat as covariates (you can use categorical covariates by converting integers to factors if appropriate).

Center and scale the non-categorical covariates.

Choose your own prior distributions (do not use default priors), and explain the rationale your prior choices, and ensure that the posterior is not too sensitive to your prior choice [Hint: look at the induced prior on the linear predictor and on the response.]

Compute the posterior means of the model parameters, and discuss the results.

Explanation (min 300 characters in your own words, otherwise -5 marks for insucient explanation):

b)[10 marks] Fit a Bayesian Poisson regression model on numclaims as response with

log link function,

using veh_value, exposure, veh_body, veh_age, gender, area, and agecat as covariates. Center and scale the non-categorical covariates.

Choose your own prior distributions (do not use default priors), and explain the rationale your prior choices, and ensure that the posterior is not too sensitive to your prior choice [Hint: look at the induced prior on the linear predictor and the response.]

Compute the posterior means of the model parameters, and discuss the results.

Explanation (min 300 characters in your own words, otherwise -5 marks for insucient explanation):

c)[10 marks] Fit a zero-inated Bayesian Poisson regression model (https://en.wikipedia.org /wiki/Zero-inated_model) on

numclaims as response,

with log link function,

using veh_value, exposure, veh_body, veh_age, gender, area, and agecat as covariates. Center and scale the non-categorical covariates.

Choose your own prior distributions (do not use default priors), and explain the rationale your prior choices, and ensure that the posterior is not too sensitive to your prior choice [Hint: look at the induced prior on the linear predictor and the response.]

Compute the posterior means of the model parameters, and discuss the results.

Explanation (min 300 characters in your own words, otherwise -5 marks for insucient explanation):

d)[10 marks] Fit a new model on numclaims in terms of the same covariates to improve on the models in part b) or part c) by considering interactions between covariates, as well as random eects. Describe your new model and justify your choices.

Choose your own prior distributions (do not use default priors), and explain the rationale your prior choices, and ensure that the posterior is not too sensitive to your prior choice [Hint: look at the induced prior on the linear predictor and the response.]

Compute the posterior means of the model parameters, and discuss the results.

Explanation (min 300 characters in your own words, otherwise -5 marks for insucient explanation):

e)[10 marks] Perform posterior predictive model checks for your models b, c, d (i.e. using replicates).

As test functions, use the number of rows in the dataset with numclaims equal 0, 1, 2, 3, and 4 (5 test functions).

Compute the RMSE values for predicting numclaims based on all 3 models. Discuss the results.

Explanation (min 300 characters in your own words, otherwise -5 marks for insucient explanation):

Problem 2 - Barcelona study

In this problem, we will use a dataset from the CitieS-Health project that provides insight

into the impact of air pollution on humans. It is comprised of data collected in Barcelona,

Spain, and examines various environmental variables, such as air pollution levels, and their

eects on mental health and wellbeing. In addition to environmental factors, this dataset also captures self-reported survey data on mental health, physical activity, diet habits, and more. From performance in a Stroop test (a type of psychological test evaluating attention capacity and processing speed) to information on total noise exposure at 55 dB - this dataset contains interesting information to understand the link between air pollution and human health.

We start by loading the dataset.

study<-read .csv ( "Barcelona .csv")

head(study)

##      Person_ID  date_all  year  month  day  dayoftheweek  hour  sadness  wellbeing  energy

##  1              115        22222  2020        11      3                        1      18            14                  3            2 ##  2              212        22247  2020        11    28                       5      18             4                  9            9 ##  3              104        22208  2020        10    20                        1      20              1                  6            6 ##  4              216        22247  2020        11    28                       5      18              2                  8            8 ##  5                94        22213  2020        10    25                       6      19            12                  8           4

##  6               215         22258  2020         12      9                          2      20               4                    7             7

##      stress  sleep  hours_out  physical_activity  computer_use  on_a_diet  alcohol  drugs

##  1            5          2                  5                               No                   Yes             Yes           No        No ##  2            1          9                  5                             Yes                     No               No         Yes        No ##  3            7          9                11                               No                   Yes             Yes           No        No ##  4            1          3                  2                             Yes                     No             Yes         Yes        No ##  5            2          8                  1                               No                   Yes               No           No      Yes

##  6 ##    ##  1 ##  2 ##  3 ##  4 ##  5 ##  6 ##    ##  1 ##  2 ##  3 ##  4 ##  5 ##  6 ##    ##  1 ##  2 ##  3 ##  4 ##  5 ##  6 ##    ##  1 ##  2 ##  3 ##  4 ##  5 ##  6 ##    ##  1 ##  2 ##  3 ##  4 ##  5 ##  6 ##    ##  1 ##  2 ##  3 ##  4 ##  5 ##  6

sick

No No No Yes No N