Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Summer Examination Period 2021 — May — Semester B

ECS7005P Risk and Decision-Making for Data Science and AI

Question 1

A new virus is affecting the population. People who have the virus will normally have specific symptoms such as a cough and the loss of the sense of taste and/or smell.

It is estimated that 1 in 5 of people who suffer these symptoms have the virus and 1 in 2000 people without these symptoms have the virus.

A test for the virus has the following accuracy

· For people with symptoms, the true positive rate is 90% and the false positive rate is 5%

· For people without symptoms, the true positive rate is 80% and the false positive rate is 1%

Answer the following questions:

a) If we know that 5% of the population have symptoms, what percentage of the population has the virus?       [2 marks]

b) What is the probability that a person with symptoms will test positive?       [2 marks]

c) What is the probability that a person without symptoms will test positive?       [2 marks]

d) A person with symptoms tests positive. What is the probability they have the virus?       [2 marks]

e) A person with symptoms tests negative. What is the probability they have the virus?       [2 marks]

f) A person without symptoms tests positive. What is the probability they have the virus?       [2 marks]

g) A person without symptoms tests positive and is subject to an additional test. Assuming that a second test is independent of the first, what is the probability they test positive in this second test?       [4 marks]

h) A person without symptoms tests positive in both the first and second test. What is the probability they have the virus?       [4 marks]

[Question 1 Total: 20 marks]

Question 2

Table 1 summarizes the results from an observational study into the effectiveness of two drugs A and B for treating migraine

 

Patients aged < 50

Patients aged 50+

 

Effective

Non-effective

Effective

Non-effective

Drug A

420

80

70

30

Drug B

85

15

150

50

The ‘success rate’ is the percentage of effective outcomes.

Answer the following questions:

a) What was the ‘success rate’ for Drug A for the study participants overall?       [1 mark]

b) What was the ‘success rate’ for Drug B for the study participants overall?       [1 mark]

c) What was the ‘success rate’ for Drug A for the study participants aged < 50?    [1 mark]

d) What was the ‘success rate’ for Drug B for the study participants aged < 50?    [1 mark]

e) What was the ‘success rate’ for Drug A for the study participants aged 50+?     [1 mark]

f) What was the ‘success rate’ for Drug B for the study participants aged 50+?     [1 mark]

g) What can you conclude from the above results?         [2 marks]

h) Name the paradox evident in this study.       [1 mark]

i) What is the main cause of the paradox in this example?         [3 marks]

j) Draw the causal model that explains the data and write down the probability tables for each node in that model.       [6 marks]

k) How would you amend the model to one that avoids the paradox?       [2 marks]

l) By doing what you proposed in k) (or by other means) estimate the ‘true’ success rate for each drug for the whole population.                     [4 marks]   

m) Suppose you know that a patient took Drug A and the outcome was not effective. We don’t know the patient’s age, but we want to answer the counterfactual question; “Would the outcome have been effective if this patient had taken Drug B instead of Drug A?”.  In your answer to this question provide a sketch of a causal model that supports your reasoning.       [6 marks]

[Question 2 Total: 30 marks]

Question 3

It is known that about 2.3% of people who have sleeping disorders have severe insomnia (defined as going more than 36 hours without being able to sleep at all)  

A study of 1000 people who have sleeping disorders discovered that tea-drinkers (classified as those who drink more than 2 cups of tea a day) are more likely to suffer severe insomnia.

 

Tea-drinkers

Not tea-drinkers

Severe insomnia

9

14

Other sleeping disorders

291

686

Total

300

700

a) Answer the following about people with sleeping disorders:

i) What is the relative increase in risk of having severe insomnia for tea drinkers compared to non-tea drinkers?         [3 marks]

ii) What is the absolute increase in risk of having severe insomnia for tea drinkers compared to those who are not tea-drinkers?       [3 marks]

b) Suppose we know that 10% of the population have sleep disorders. Of those with sleeping disorders, 30% are tea—drinkers. Of those with no sleeping disorders only 20% are tea drinkers. Answer the following questions about the whole population:

i) What is the relative increase in risk of having severe insomnia for tea-drinkers compared to those who are not tea-drinkers?       [5 marks]

ii) What is the absolute increase in risk of having severe insomnia for tea drinkers compared to those who are not tea-drinkers?       [5 marks]

Hint: you should assume a population size of 100,000 and create two tables like above for people with and without sleep disorders. 

c) What paradox could be triggered if you used the above 1000-person study to make inferences about the risk of severe insomnia caused tea-drinking to the entire population?       [2 marks]

d) Which of the following headlines is the most misleading?       [2 marks]

i) “Study shows people with sleeping disorders should consider cutting down on the amount of tea they drink”.

ii) “Drinking more than 2 cups of tea a day more than doubles the risk of having the most severe form of sleep disorder”.

iii) “People with sleeping disorders who drink more than 2 cups of tea a day are at increased risk of the most severe sleep deprivation”.

iv) “Drinking more than 2 cups of tea a day may lead to severe sleep deprivation”.

[Question 3 Total: 20 marks]

Question 4

The following algorithm is ‘learnt’ from a subset of the dataset of passengers on the Titanic cruise liner which sank after hitting an iceberg on 15 April 1912:

If Sex = “Male” then Probability (survive) = 0.2

If Sex = “Female” and Class = 1 or 2 then Probability (survive) = 0.8

If Sex = “Female” and Class = 3 then Probability (survive) = 0.6

The relevant information in the different test dataset is summarized as:

 

Male

Female Class 1 or 2

Female Class 3

Survived

75

75

60

Did not survive

225

15

50

Based on this test set data, the accuracy of the algorithm for cut-off value 0.1 can be represented in the following format, where “YES” means survive and “NO” means not survive.

 

Number predicted YES

Number predicted NO

Total

Number YES’s

210

0

210

Number NO’s

290

0

290

This enables us to compute:

Sensitivity:100%; Specificity: 0%; False positive rate: 100%;  Accuracy:42%

a) For each of the different cut-off values 0.5, 0.7, 0.9 complete the following table and fill in all the missing ?? values

 

Number predicted YES

Number predicted NO

Total

Number YES’s

??

??

210

Number NO’s

??

??

290

Sensitivity: ??%; Specificity: ??%; False positive rate: ??%; Accuracy: ??%

You will need to complete three tables and in each case the sensitivity specificity, false positive and accuracy percentages (8 marks each). [24 marks]

b)  Sketch the ROC curve for this algorithm.       [6 marks]

[Question 4 Total: 30 marks]

Solutions 

Question 1

a) If we know that 5% of the population have symptoms, what percentage of the population has the virus?  (0.05 x 0.2)+(0.95 x 0.0005) = 0.010475 = 1.0475%  [2 marks]

b) What is the probability a person with symptoms will test positive? 22%  [2 marks]

c) What is the probability a person without symptoms will test positive? 1.04%  [2 marks]

d) A person with symptoms tests positive. What is the probability they have the virus? 81.8%  [2 marks]

e) A person with symptoms tests negative. What is the probability they have the virus? 2.6% [2 marks]

f) A person without symptoms tests positive. What is the probability they have the virus? 3.8%  [2 marks]

g) A person without symptoms tests positive. Assuming that a second test is independent of the first, what is the probability they test positive in a second test? 4.04%  [4 marks]

h) A person without symptoms tests positive in both the first and second test. What is the probability they have the virus? 76.2%  [4 marks]

Question 2

a) Drug A overall?  81.7%  [1 mark]

b) Drug B overall?  78.3%  [1 mark]

c) Drug A for the study participants aged < 50?  84%  [1 mark]

d) Drug B for the study participants aged < 50?  85%  [1 mark]

e) Drug A for the study participants aged 50+?  70%  [1 mark]

f) Drug B for the study participants aged 50+?  75%  [1 mark]

g) in each age subcategory Drug B was more effective than drug A, but overall Drug A was more effective  [2 marks]

h) Simpson’s paradox [1 mark]

i) Age is a confounder. There were fewer older people in the study and older people were more likely to take Drug B than Drug A [3 marks]

j) The model [6 marks]

 

k) Cut the link into node “Drug” [2 marks]

l) A: 79.3%   B: 81.7%   [4 marks]   

m)  [6 marks]

 

Question 3  (TOTAL 20 marks)

a) People in the study

i) Tea drinkers  3% non-tea drinkers 2%, so 50% relative risk increase. [3 marks]

ii) Absolute risk increase is 1%    [3 marks]

b) Whole population

 

Sleep disorders (10,000)

No Sleep disorders (90,000)

 

Tea drinkers (3,000)

Non-tea drinkers (7,000)

Tea drinkers (18,000)

 Non-tea drinkers (72,000)

Most Severe

90

140

0

0

Not most severe

2100

6,860

18,000

72,000

 

Sleep disorders (100,000)

 

Tea drinkers (21,000)

Non-tea drinkers (79,000)

Most Severe

90

140

Not most severe

20910

78,860

i) 90 out of 21,000 tea drinkers (=0.4286%) have the most severe form of sleep deprivation; 140 out of 79,000 non-tea drinkers (=0.1772%) have the most severe form of sleep deprivation So relative risk increase is (0.4286-0.1772)/0.1772= 142% [5 marks]

ii) But absolute risk increase is just 0.25%     [5 marks]

c) Berkson’s or Collider paradox [2 marks]

d) (ii) is the most misleading? [2 marks]

Question 4

a) The accuracy for cut-off value 0.5 is:

 

Number predicted YES

Number predicted NO

Total

Number YES’s

135

75

210

Number NO’s

65

225

290

Sensitivity: 64%

Specificity: 78%

False positive rate: 22%

Accuracy:72%

b) The accuracy for cut-off value 0.7 is:

 

Number predicted YES

Number predicted NO

Total

Number YES’s

75

135

210

Number NO’s

15

275

290

Sensitivity: 36%

Specificity: 95%

False positive rate: 5%

Accuracy:70%

c) The accuracy for cut-off value 0.9 is:

 

Number predicted YES

Number predicted NO

Total

Number YES’s

0

210

210

Number NO’s

0

290

290

Sensitivity: 0%

Specificity: 100%

False positive rate: 0%

Accuracy: 58%

d) ROC curve for this algorithm