关键词 > DATA1001/ENVX1002
DATA1001 / ENVX1002 Foundations of Data Science / Introduction to Statistical Methods A Semester 1 Main, 2019
发布时间:2023-05-15
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
A Semester 1 Main, 2019
DATA1001 / ENVX1002
Foundations of Data Science / Introduction to Statistical Methods
1.(a) Consider two lab classes of the same size for DATA1001. In class A, the passing
rate for Project 1 is 20% and 10% for women and men respectively, and in class B, the rates are 50% and 40% respectively.
It is claimed:“The combined passing rate across the two classes must be higher for women than it is for men.” Comment.
(b) |
Given quantitative data on air quality from 2 measuring stations (in the Central Coast and the Illawarra), without code propose one way that you could construct a clustered barchart. Sketch an example. |
(c) |
Explain the significance of Anscombe’s Quartet. |
(d) A standard pack of 52 cards has one queen of spades.The pack is shuffled, and then five cards are dealt off the top of the pack. Find the chance that the 5th card dealt is the queen of spades. Justify your answer.
(e) |
A company finds that on average their employees have 10 ‘sick days’ per year. They hope to reduce the number of sick days, by introducing more flexible working arrangements. They select a simple random sample of 100 employees and find after introducing the new arrangements, that those employees had on average 9 ‘sick days’ that year, with a sample SD of 5. Formulate a hypothesis and test using a box model. pt(-2 ,99) ## [1] 0 .02411985 |
2. Spotify is a popular music streaming platform that allows users to listen to music on their devices.
Kahn is having a 21st party and wants to investigate what music he should play for his guests. He downloads the data set spotify from Kaggle.com, which is a public data
platform that is owned by Google. The data was scraped from the Spotify API wrapper in November 2018.
dim(spotify)
## [1] 116372 17
head(spotify,2)
## artist_name track_id
## 1 YG 2RM4jf1Xa9zPgMGRDiht8O
## 2 YG 1tHDG53xJNGsItRA3vfVgs ## track_name ## 1 Big Bank feat . 2 Chainz, Big Sean, Nicki Minaj ## 2 BAND DRUM (feat . A$AP Rocky)
acousticness danceability
0 .743 0 .846
## duration_ms energy instrumentalness key liveness loudness mode ## 1 238373 0 .339 0 1 0 .0812 -7 .678 1 ## 2 214800 0 .557 0 8 0 .2860 -7 .259 1 ## speechiness tempo time_signature valence popularity
## 1 0 .409 203 .927 4 0 .118 44
## 2 0 .457 159 .009 4 0 .371 10
(a) (i) How many songs are in the data set?
(ii) Outline one possible limitation with using this data.
(b) Kahn is interested in the average length of songs on Spotify.
(i) What type of variable is duration ?
class(spotify$duration)
## [1] "integer"
spotify$duration = spotify$duration/(60*1000) # convert to minutes
(ii) Give 3 observations from the following summaries.
summary(spotify$duration)
## Min . 1st Qu . Median Mean 3rd Qu . Max .
## 0 .05338 2 .73415 3 .36288 3 .54244 4 .00448 93 .50033
boxplot(spotify$duration, horizontal =T)
0 20 40 60 80
(c) Kahn wonders whether the mode of the song (major or minor) affects the average length of songs on Spotify. What does he discover?
boxplot(spotify$duration ~ spotify$mode, horizontal =T) # 0 = minor; 1= major
0
20
40
60
80
(d) Kahn in interested in how many songs encourage people to dance. What does he discover?
hist(spotify$danceability) # 1 = most dance -able
Histogram of spotify$danceability
0.0 0.2 0.4 0.6 0.8 1.0
A Semester 1 Main, 2019 Page 14 of 19
(e) Kahn is interested in whether certain keys produce ‘happier’ songs, where a
of 1 = happiest and 0 = saddest. What does he discover?
3. Kahn wants to draw up a playlist for the party.
(a) (i) Suggest a research question that Khan could be investigating below. What
does he discover?
0.0 0.4 0.8
spotify$danceability
cor(spotify$danceability, spotify$loudness)
## [1] 0 .4192092
(ii) Khan predicts that the loudness for a song with danceability score 0.5 is
x18.31 + 14.36 × 0.5. Is the formula correct? How useful do you think this would be in practice? Why?
##
## Call:
## lm(formula = spotify$loudness ~ spotify$danceability) ##
## Coefficients:
spotify$danceability
14 .36
(b) The Italian word ‘vivace’ refers to a song which is played at 156-176 beats per minute.
Approximately what percentage of Spotify songs are ‘vivace’?
Histogram of spotify$tempo
0.010
0.004 0.000 |
|
0 156 176 250 Tempo |
(c) Assume the danceability score can be modelled by a normal curve with mean 0.6 and SD 0.2. By sketching a picture, calculate the chance of randomly selecting a Spotify song with loudness between 0.4 and 1.
(d) Kahn takes a random sample of 100 songs.
(i) What does Kahn discover from the following code
set .seed(1)
library(dplyr)
spotify1 = sample_n(spotify, 100)
subset(spotify1$artist_name, spotify1$loudness == min(spotify1$loudness))
## [1] Gabriel Faur
## 32105 Levels: _tag _XPRESSWINDOW -MASA Works DESIGEN- -ness -ToBy- . . .
(ii) Explain what the following table represents, and how it could be modelled by
a biased coin.
table(spotify1$mode) # 1= major; 0 = minor
##
## 0 1
## 31 69
A Semester 1 Main, 2019 Page 18 of 19
4.(a) It is claimed that songs in a minor key sound more ‘sad’ . Test this claim using the
spotify1 data, where a high score of valence indicates a ‘happier’ song. Use a = 0.05.
Valence (1=major, 0=minor)
|
|
|
|
|
|
|
|
|
|
|
|
0.0 0.2 0.4 0.6 0.8
##
## Welch Two Sample t-test
##
## data: spotify1$valence by spotify1$mode
## t = 0 .2404, df = 71 .292, p-value = 0 .8107
## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval:
## -0 .09065477 0 .11551293
## sample estimates:
## mean in group 0 mean in group 1
## 0 .4522581 0 .4398290
(b) It is claimed that there are an equal number of Spotify songs in each of the 12 keys.
Test this claim using the spotify1 data. Use a = 0.01.
##
## Chi-squared test for given probabilities
##
## data: table(spotify1$key)
## X-squared = 29 .84, df = 11, p-value = 0 .001679
(c) In your own words, explain what a p-value is, its role in hypothesis testing, and any potential issues.