关键词 > STATS101/108

STATS 101/108 - Past exam SS23

发布时间：2024-06-12

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STATS 101/108 - Past exam SS23

Question 1

Q1a

Top stories (articles) published on a news website during 2018 and 2022 were explored, with a focus on stories where the words ‘drink’ or ‘alcohol’ appeared in the headline.

Two different researchers (Researcher A and Researcher B) each took a random sample of 200 of these stories and created data sets with their sample data.

The researchers now each want to estimate the mean number of characters used in headlines (headline_length) for top stories published on the news website.

Use the plots above to answer the following questions.

The mean headline length for Researcher A's sample is approximately.

The standard deviation of headline length for Researcher B's sample is approximately.

The variability of the data in Researcher A's sample is the variability of the data in Researcher B's sample.

If both researchers construct VIT bootstrap confidence intervals, Researcher A's will be Researcher B's.

Q1b

Previous research related to headlines for new stories for this website has led to the following claim.

The mean number of characters used in all headlines for this news website is 51.6.

Another researcher (Researcher C) took a random sample of top stories where the words ‘drink’ or ‘alcohol’ appeared in the headline and used data from this sample to construct a VIT bootstrap confidence interval to estimate the mean number of characters used in headlines (headline_length).

The bootstrap confidence interval had the following limits: (51.1, 54.8).

Use this information to decide if each of the following statements are TRUE or FALSE.

According to the VIT bootstrap confidence interval, it is plausible that the mean number of characters used in all headlines for this news website is 61 The claim is supported by the VIT bootstrap confidence interval. The population referred to in the claim is the same as the population the sample was taken from.

Q1c

Researcher A is interested in how often the word ‘alcohol’ appeared in the headlines for top stories from this news website.

Researcher B is interested in how often the words 'drink' or 'alcohol' appeared together.

Out of Researcher A's 200 headlines, 121 used only the word 'alcohol'.

Out of Researcher B's 200 headlines, 41 used both the words 'drink' and 'alcohol'.

If both researchers constructed VIT bootstrap confidence intervals for a proportion based on their sample data, which researcher would have a confidence interval that was more precise at estimating the relevant population proportion?

Q1d

Sometimes, when you read the news, it can feel like it’s all doom and gloom!

Researcher A’s sample of 200 headlines was analysed for sentiment. According to the sentiment analysis method used, 62.5% of the headlines had negative sentiment.

The same sample data was used to construct a bootstrap confidence interval for the proportion of all news headlines containing the words ‘drink’ or ‘alcohol’ that had negative sentiment.

Use the VIT output above to:

1. write an interpretation of the confidence interval, and

2. evaluate the claim that for this news website that “most headlines that include the words ‘drink’ or ‘alcohol’ in the headline are negative”.

Q1e

Researcher C’s sample of headlines was also analysed for sentiment. The sample data was then used to construct a 95% confidence interval for the proportion of headlines that had negative sentiment, using the Confidence Interval Calculator as shown below.

Note that the sample proportion and sample size have been deliberately hidden for Researcher C.

Describe TWO ways in which the sample taken in this question is different from the sample taken by Researcher A in the previous question (Q1d), by comparing the confidence intervals constructed by Researcher A and Researcher C.

Q1f

Recall that Researcher A has decided to focus on top stories from a news website that have headlines that include the words ‘drink’ or ‘alcohol’.

Researcher A was interested in whether using a different method for sentiment analysis (Method 2) could result in an estimate for the proportion of all news headlines with a negative sentiment that was higher or lower than what is estimated using the initial method for sentiment analysis (Method 1).

The researcher used both Method 1 and Method 2 to analyse the sentiment of each headline in their sample.

The proportion of headlines that were negative according to Method 1 was then compared to the proportion of headlines that were negative according to Method 2, and the difference between the two sample proportions was calculated.

A 95% confidence interval for the difference between these two proportions for all headlines was generated using the Confidence Interval Calculator. The limits of this confidence interval are (-0.025, 0.161).

Use the information above to answer the following questions.

The point estimate used for the confidence interval was.

The sample proportion based on Method 1 is than the sample proportion based on Method 2.

We claim that the proportion of all headlines that are negative according to Method 1 is higher than the proportion of all headlines that are negative according to Method 2.

The researcher would have selected the sampling situation when using the Confidence Interval Calculator to generate the confidence interval.

Question 2

Q2a

A study asked university students based in the US and Australia to complete a survey related to alcohol consumption.

The study reported that with 95% confidence, the underlying mean number of drinks consumed by Australian university students on a typical drinking day is somewhere between 4.8 and 6.

What is the margin of error for this confidence interval? (round your answer to one decimal place).

Q2b

The recruitment materials for the study invited university students who were aged 18 years and above and who drank alcohol to complete an anonymous online survey.

Students based in the US and Australia were recruited using:

posts on the course management sites of courses

posters placed around the campuses

students enrolled in psychology courses, who were compensated for completion using course credit

The survey included questions such as:

How often do you have a drink containing alcohol? (Never, Monthly or less, 2 to 4 times a month, 2 to 3 times a week, 4 or more times a week)

How often do you have 6 or more standard drinks on one occasion? (Never, Less than monthly, Monthly, Weekly, Daily or almost daily)

Use this information to decide if each of the following statements are TRUE or FALSE.

The students who complete the survey will be a representative sample of all university students in the US and Australia.

According to the five main types of survey questions covered in Chapter 6, both of these questions are of the same type.

It will be hard to quantify if students are able to accurately recall how much they drink.

Q2c

The university students from the US were asked about their alcohol consumption using two different versions of a question.

In Version A, students were asked: How many drinks do you have on a typical day when drinking?

In Version B, students were asked: How many standard drinks do you have on a typical day when drinking?

Each student answered one version of the question in the survey, and the version they were asked was randomly allocated.

The study found that the mean number of drinks reported was 6.58 when the question asked about standard drinks and the mean number of drinks reported was 4.70 when the question referred to drinks.

The difference between these two means was found to be statistically significant at the 5% level.

What additional information or analysis would you need before you could determine whether or not this result has practical importance?

Q2d

The university students from Australia were also asked about their alcohol consumption using two different versions of a question (see Q2c).

The variable num_drinks is the number of drinks reported in answer to the question.

The variable question_version defines which version of the survey question the student was asked (drinks, standard_drinks).

A two sample t-test was carried out in iNZight to test for a difference between the mean number of drinks reported for Australian students, when asked about “drinks” compared to when they were asked about “standard drinks”.

The iNZight Lite output from this test is below.

Use the information above to answer the following questions.

The observed difference between the means, when comparing drinks to standard_drinks in this order, is (round to three decimal places), which indicates that the mean number of drinks reported is when the survey question asks about “drinks” compared to when it asks about “standard drinks”.

From carrying out the two sample t-test using the variables num_of_drinks and question, the p-value is (round to 4 decimal places, or enter 0.0000 if it's very small).

This makes sense as of the differences in the confidence interval are positive.

Q2e

The confidence interval from the previous question (Q10) was:

Which of the following is the better interpretation of this confidence interval:

Option A: With 95% confidence, we estimate that the difference between the underlying mean number of drinks reported when the survey question used “drinks” rather than “standard drinks” is somewhere between -0.1 and 1.3 drinks.

Option B: With 95% confidence, we estimate that the underlying mean number of drinks reported when the survey question used “drinks” is somewhere between 0.1 drinks lower and 1.3 drinks higher than the underlying mean number of drinks reported when the survey question used “standard drinks”.

Q2f

Other variables created using the responses from the students who participated in the study include:

age_group (the age of the student in years: 18 to 20, 21 to 22, 23 to 25, 26 or older)

country (the location of each student: Australia, US)

degree_specialisation (the major or specialisation for their degree: psychology, other)

time_since_last_drink (the number of hours since the student last had a drink)

Consider what types of analysis and hypothesis tests would be appropriate to use with these variables.

Use the information provided to decide if each of the following statements are TRUE or FALSE.

Analysis of the variable degree_specialisation would involve proportions.

Analysis of the variable age_groups would involve means.

A two-sided alternative hypothesis based on these variables could be “The underlying mean time_since_last_drink is higher for students from Australia compared to students from the US”.

Q2g

Recall that the study reported that with 95% confidence, the underlying mean number of drinks consumed by Australian university students on a typical drinking day is somewhere between 4.8 and 6. Around 100 responses from Australian university students were used to construct this confidence interval.

If the study had instead involved around 400 responses, given that the sample standard deviation did not change, the 95% confidence interval for the underlying mean number of drinks consumed by Australian university students on a typical drinking day would have been:

about 1/4 as wide

about 4 times wider

about 1/2 as wide

pretty much unchanged

Question 3

Q3a

A study explored whether there was an effect of alcohol on perceptions of attractiveness.

Around 100 participants, who were social drinkers (consumed alcohol at social events) were randomly allocated to drink either an alcoholic or non-alcoholic drink. Both the participant and researcher were unaware of the alcohol content of the drink.

After consuming their drink, each participant was shown 20 photos of real people unknown to them. For each person, they were asked to rate their attractiveness on a seven-point scale, ranging from very unattractive to very attractive.

Use the information about the study to decide if each of the following three statements are TRUE or FALSE.

Blinding was not used in this study.

A control group was used in this study.

A placebo was not used in this study.

Q3b

Suppose that for this study μ is defined as the underlying mean attractiveness rating for participants who consumed an alcoholic drink, and μ is defined as the underlying mean attractiveness rating for participants who consumed a non-alcoholic drink

Which of the following would be a correct formal statement of the null hypothesis for a two sample t-test involving these parameters?

μ1 - μ2 ≠ 0

μ1 - μ2 < 0

μ1 - μ2 = 0

μ1 - μ2 > 0

Q3c

Around three years later, the researchers conducted another study related to the potential effect of alcohol on perceptions of attractiveness.

The study involved participants from two different pubs. After some time drinking at the pub, participants were asked to rate the attractiveness of male and female faces on a tablet device.

After each participant had rated the faces, the amount of alcohol in their breath was measured.

The researchers created three groups of participants for analysis, based on the amount of alcohol measured in their breath: low alcohol, medium alcohol, high alcohol.

Use the information provided above to answer the following questions.

What kind of study was this?

What is the explanatory variable?

Is the explanatory variable best described as a treatment variable or a factor of interest?

Based on the design of the study and supporting statistical evidence (e.g. an appropriate hypothesis test), could a claim be made for these participants that the amount of alcohol consumed causes changes to the attractiveness ratings given?

Q3d

The main conclusion from the first study discussed in Q3a and Q3b was “There was evidence of a drink effect (p-value = 0.031), with higher ratings of attractiveness after drinking alcohol compared to no alcohol.”

The main conclusion from the second study discussed in Q3c was “There was no evidence of a relationship between alcohol consumption and perception of attractiveness (p-value = 0.236).”

By referring to specific features of the two study designs, discuss why the two studies did not produce the same results.

Question 4

Q4a

Data about wines was scraped from a website that provides wine ratings.

Some of the variables recorded were:

category: the category of wine (red, white, other)

rating: the number of points the wine was given, with 100 being the highest rating possible. (Only wines rated between 80 & 100 points are on the website.)

fruit_description: whether fruit or fruity was mentioned in the description of the wine (fruit mentioned, fruit not mentioned)

A two sample t-test was conducted using the variables rating and fruit_description. Some of the output from iNZight Lite is given below.

Note the p-value has been deliberately hidden in the output above.

The visualisation below shows a T distribution based on df = 228.11, with the p-value for this test represented as the blue area shaded under the curve.

Which of the following probability statements corresponds to the p-value for this hypothesis test?

pr(T > 0.987) = 0.081

pr(T > 0.987) = 0.162

pr(T > 0) = 0.162

pr(T > 0) = 0.081

Q4b

A wine enthusiast was interested in exploring whether the rating of wines changes with category, using the data referred to in the previous question (Q4a).

iNZight Lite was used to create a plot with the variables rating and category and comparison intervals were added for the means.

An ANOVA F-test was also conducted using the data and the output from this is shown below.

Use the information provided above to answer the following questions.

Which group has the highest sample mean for rating?

Which group has the largest sample standard deviation for rating?

The p-value for this ANOVA F-test provides evidence that the rating of wines changes with category.

We claim that red wines have the highest underlying mean rating.

Q4c

In the assignment for Chapter 9 - Variation , you carried out a ANOVA F-test to explore the research question: “Does ### depend on ###?”

Note that, in the research question above, the context of the F-test has been replaced by “###”. This is because this question is from a past semester’s exam and the assignment for Chapter 9 changes each semester.

When you completed this assignment, you were not asked to check any of the conditions for using an ANOVA F-test. Two of these conditions are independence and equal variance.

Discuss these conditions for an ANOVA F-test in the context of the data you used for your ANOVA F-test in the Chapter 9 Assignment.

Question 5

Q5a

Researchers based in the US investigated whether alcohol drinking patterns were associated with alcohol-related risks, such as drinking causing family/friend problems or issues with employment. Data was collected through face to face interviews in the homes of the participants.

Some of the variables measured based on the participant self-reporting were:

fam_friend_problems: Whether their drinking has caused any problems with their family of friends (Yes, No)

drinks_per_occasion: How many drinks they typically consumed at an occasion/event (Up to two drinks, Three or four drinks, Five or more drinks)

employment_issues: Whether their drinking had interfered with their work or job (Yes, No)

The two-way table of counts was created using the variables fam_friend_problems and drinks_per_occasion.

According to this data, how many times as likely is someone who consumes five or more drinks per occasion to have family/friend problems compared to someone who consumes up to two drinks?

(round your answer to one decimal place).

Q5b

The researchers used the variables employment_issues and drinks_per_occasion to carry out a chi-square test for independence using iNZight Lite. Some of the output from their analysis is shown below.

Under the null, what proportion/percentage of participants who drink five or more drinks per occasion were expected to have employment issues?

% Round your answer to one decimal place.

Q5c

The output of the chi-square test for independence (equal distributions) is shown again below.

Use this test output to answer the following questions.

The chi-square-test statistic is relatively indicating a relatively discrepancy between what we see in the data and what we would expect to see if the distribution of employment_issues did not depend on drinks_per_occasion.

We have evidence that employment_issues independent from drinks_per_occasion.

Q5d

In the Chapter 10 notes you read about a study that used data on paediatric emergency department admissions to investigate inflatable bouncer injuries.

Two additional variables that were recorded in that study were:

hospitalisation - whether or not the child required hospitalisation to treat their fracture

fracture_site - the bone which was fractured

Write a suitable research question and null hypothesis involving the variables fracture_site and hospitalisation.

Question 6

Q6a

A common type of study related to alcohol is to compare the amount of alcohol consumed to a person’s Blood Alcohol Level (BAL) at some point after alcohol was first consumed. BAL refers to the amount of alcohol in a person's bloodstream, measured in milligrams per 100 millilitres of blood.

Two different studies in Australia and the US were conducted using university students as participants. In both studies, participants were provided with small glasses of wine (120 mL) during a 45 minute 'social' event, and were able to drink as many glasses as they wanted during this time.

Both studies collected data on the number of glasses of wine consumed (num_wines) and the BAL for each participant 60 minutes after they first started drinking.

The data collected was used to create the plots shown below. Sample A refers to the data collected from the Australia study and Sample B refers to the data collected from the US study.

Use the plots above to answer the following questions.

Sample B has a linear association compared to Sample A, with a linear correlation of around

The residuals/prediction errors for Sample A are generally than Sample B, and the RMSE (Root Mean Square Error) for Sample A is around.

Q6b

In a modified version of the study, the US researchers used beer instead of wine.

The data collected was then used to create a model to predict BAL after one hour, using num_beers. A linear trend was fitted to the sample data, as shown in the plot below.

The equation of the line fitted is: BAL = 5.4 + 6.4 * num_beers

Using a linear model based on this fitted line, a person that consumed 6 beers and had an actual BAL of 41 would have a residual/prediction error of (round your answer to one decimal place).

Q6c

The Australian researchers also carried out a modified version of the study using beer instead of wine.

The data collected was used to conduct a test for no association between BAL and num_beers using iNZight Lite.

Use this information to decide if each of the following statements are TRUE or FALSE.

The slope of the fitted line is 10.0 (rounded to one decimal place).

At the 1% level of significance, there is evidence of an association between BAL and num_beers.

With 95% confidence, on average, every additional two beers consumed is associated with an increase in BAL of somewhere between 10.6 and 29.4 mg/100 mL.

Q6d

Researchers based in Aotearoa New Zealand conducted a similar study, with the purpose of developing a model to predict a person’s Blood Alcohol Level (BAL) based on the number of beers consumed.

Instead of using the number of beers as the explanatory variable, they created a new variable that took into account how much the person weighed, called drink_weight_ratio.

The drink to weight ratio is calculated by dividing the millilitres of beer consumed by body weight in kilograms e.g. 500 ml divided by 85 kg = 5.9 (rounded to one decimal place).

The scatterplot below was created using data from this study and simple linear regression was used to fit a linear model.

The prediction model developed to generate prediction intervals for BAL based on drink_weight_ratio was:

predicted BAL = -3.421 + 3.858 * drink_weight_ratio ± (2 * RMSE)

Describe a potential issue with using this model to generate prediction intervals for BAL. Discuss how this issue will affect the accuracy and/or precision of any predictions, using specific features of the data.