Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Summer Examination Period 2020  May - Semester B

ECS7005Risk and Decision-Making for Data Science and AI

Question 1

a) What is a linear regression model?                                                           [3 marks]

b) Table 1 shows data on observed average monthly road fatalities in a Northern European country

Table 1 Monthly Road Fatalities

Month

Average              temperature (C)

Fatalities

Jan

5

190

Feb

4

180

March

9

210

April

13

250

May

16

220

June

20

280

July

22

310

August

24

340

September

19

270

October

11

240

November

8

190

December

6

210

By plotting this data what can you conclude about the statistical association between observed monthly temperatures and number of fatalities?

[5 marks]

c) How would the statistical association in part b) be represented in a linear regression model (you do not have to provide the parameters of the model, only the general     form and you should refer to your solution of Question 1 part b)?

[4 marks]

d) Based on the data and analysis in Question 1 parts b) and c) a newspaper runs a report with the headline “Driving in winter conditions reduces the risk of a fatal     accident”. Explain why this headline is misleading.

[5 marks]

e) Draw a causal model that better explains the observed data in Table 1.

[8 marks]

Question 2

a) What is Simpson’s paradox? [4 marks]

b) A study into the effectiveness of a new migraine drug is carried out on 800 adults,

400 of whom receive the drug and 400 of whom receive a placebo. Patients reported whether taking the drug stopped their migraine. The results were:

ALL PATIENTS

Drug

50%   (200/400)

Placebo

40%   (160/400)

However, when patients were categorized by sex into male and female, the results were:

MALE

Drug

60%   (180/300)

Placebo

70%   (70/100)

FEMALE

Drug

20%   (20/100)

Placebo

30%   (90/300)

Explain why this is an example of Simpson’s paradox.                         [4 marks]

c) Explain what a confounding variable is, and give an example from the above study

[4 marks]

d) Explain how the paradox’ has occurred (i.e. why it is not a real paradox after all)

[5 marks]

e) Draw the 3-node Bayesian Network causal model that explains the paradox showing the probability tables for each node                                                      [8 marks]

Question 3

a) Pearl used a “ladder of causation” to explain why data-driven algorithms and           classical statistics alone cannot achieve “true artificial intelligence” . State the simple words used by Pearl to describe the three rungs on the ladder and where, on the    ladder, data-driven algorithms and classical statistics can reach.        [4 marks]

b) By using a simple medical example, explain the terms “association”, “intervention” and counterfactual” used to describe different types of reasoning and how they    relate to Pearl’s ladder of causation.                                                     [6 marks]

c) A University has collected data on a large number of its students to determine           whether the amount that a student spends on books influences their final degree      performance. The data shows that increased spending does improve performance.   However, the data also shows that students who went to grammar or private schools tend to spend more on books and also achieve better degrees.

The observational data enables them to build a Bayesian Network model as shown in Figure 1. There is a proposal to buy £1000 worth of books for each student.  How      would you use the model to determine whether such an intervention would increase

the number of students passing their degree.                            [6 marks]

 

Figure 1 Student performance Bayesian Network (with node states shown alongside nodes)

 

d) A particular student Glenda achieved a 2i at this University. We know that Glenda  spent less only £30 on books but we do not know the type of school where she       received her education. How would we use the Bayesian Network to answer the     question Would Glenda have achieved a 1st class degree if she had spent at least £1000 on books instead of £30”?

[9 marks]


Question 4

a) A study finds that 2 out of every 100,000 adults who do not eat red meat regularly contract disease D, whereas 3 out of every 100,000 adults who do eat red meat   regularly contract disease D. Using this data:

i)  calculate the relative risk increase of contracting disease D for regular meat eaters compared to non regular meat eaters.                                 [3 marks]

ii) calculate  the absolute risk increase of contracting disease D for regular meat eaters compared to non regular meat eaters.                                 [3 marks]

b) Assume there is a causal link between eating red meat regularly and contracting disease D (and that there are no confounding variables). Then:

i)  what does the increase in relative risk tell us about the probability of an adult contracting disease D                                                                     [3 marks]

ii) what does the increase in absolute risk tell us about the probability of an adult being a regular meat eater                                                             [4 marks]

c) How does an influence diagram differ from a causal model, and what is it used for?

[4 marks]

d) Draw an influence diagram that could be used to determine the optimal decision strategy for the following problem:

There is an expensive but accurate test to diagnose a particular disease. There   is then the option to operate if the doctor is confident the patient has the disease. There is a high utility of operating when the patient does really have the disease but a negative utility of operating when the patient does not have the disease .

[8 marks]

SOLUTIONS

Questions 1

a)   A statistical model of the form Y=aX+b (where a and b are constants) that approximates the relationship between two variables X and Y based on pairs of data for X and Y.   [3 marks]

b)   There is clearly a very strong positive correlation between temperature and fatalities [5 marks]

 

c)    Fatalities = a*Temp + b  where a is the gradient of the line of best fit and b is the intercept on the Fatalities axis (about 150)     [4 marks]

d)    “Driving in winter conditions lessens risk of fatal accident” implies there is a causal relationship between temperature and fatalities, i.e. if that decreasing the temperature causes fewer fatal   accidents. In reality there are underlying hidden causal explanations such as the number of        journeys made (fewer in winter months than summer months) and the fact that people drive    slower when road conditions are bad (in winter) meaning fewer dangerous accidents [5 marks]

e)   A causal model that helps explain the observed data [8 marks]

 

Question 2

a)   Simpson’s paradox is a statistical phenomenon whereby data seem to support a particular hypothesis when the data is aggregated across all subcategories but supports the opposite hypothesis for each subcategory.  [4 marks]

b)   For the people overall there is a higher percentage of recovery when using the drug compared to those who don’t (50% compared to 42%). However, when we restrict the analysis to each              subcategory of people (male and female) there is higher percentage of recovery when NOT using the drug compared to those who do.  [4 marks]

c)    A confounding variable X is one which – when hidden – can lead to false conclusions about the      effect of Y on some outcome Z because it influences both Y and Z. Sex is a confounding variable in the above study [4 marks]

d)   Although an equal number of males and females were in the study and although an equal number of people took the drug and placebo, it is clear from the data that far fewer females took the drug than males. In other words ‘sex’ influences whether or not a person took the drug. Clearly, from    the data, males are more likely to recover generally than females. Hence sex influences both          whether or not the drug was taken and the outcome. If an equal number of males and females      were given the drug the paradox would not be possible (however, this does not mean there is not some other confounder like age etc) [5 marks]

e)   3-node Bayesian network causal model that explains the paradox with probability tables for each node                                  [8 marks]

 

Question 3

a)    Rung 1: “Seeing”   Rung 2: “Doing”  Rung 3 (3 mark – I each): “Imagining” Purely data-driven algorithms and classical statistics can reach only Seeing” rung 1   [1 mark]

b)   Association: Does use of this drug lead to improved recovery rates [2 marks]

Intervention: If I use this drug what is the probability I will recover. [2 marks]

Counterfactual: What I use the drug and recover would I still have recovered if I had not used the drug. [2 marks]

c)    First we need to note the marginal probability of ‘fail’ in the original model. Then we need to cut the link from ‘Type of school’ to ‘spending’:

 

In this revised model enter “>1000” as an observation and run the model

Observe the revised probability of fail. If it is less than the previous marginal then the intervention would be a success.

d)   Create the twin network’ model by copying the nodes Spending and Final degree:

 

Cut the link from Type of School to Spending in the counterfactual world. Enter the observations “<50”  and “2i” in the real world model.

Run the model – this will update the Type of School”

Enter the observation “>1000” in the counterfactual world and run the model again. Look at the probability of “1” in the final degree of the counterfactual world.   [9 marks]

Question 4

a)

i.   The relative risk increase is (0.00003 minus 0.000002) divided by 0.00002 which is ½ i.e. 50% increase compared to non meat eaters  [3 marks]

ii.   The absolute risk increase is 0.00003 minus 0.000002 which is is 0.0001 i.e. 1 in

100,000 or 0.001% [3 marks]

b)

i.   The 0.001% increase in absolute risk is the increased probability of contracting the disease if a person who is not a meat eater starts eating meat regularly.  [3 marks]

ii.   The 50% relative risk increase corresponds to the percentage increase in the  posterior probability that a person is a meat eater if you know the person has contracted the disease.  [4 marks]

c)    An influence diagram is a causal model with additional types of nodes, namely decision nodes     (which represent possible interventions) and utility nodes. It is used to take account of costs and benefits and hence determine optimal decision strategies. [4 marks]

d)

 

e)