关键词 > DATA1001/ENVX1002

DATA1001 / ENVX1002 Foundations of Data Science / Introduction to Statistical Methods A Semester 1 Main, 2019

发布时间：2023-05-15

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

A Semester 1 Main, 2019

DATA1001 / ENVX1002

Foundations of Data Science / Introduction to Statistical Methods

1.(a) Consider two lab classes of the same size for DATA1001. In class A, the passing

rate for Project 1 is 20% and 10% for women and men respectively, and in class B, the rates are 50% and 40% respectively.

It is claimed:“The combined passing rate across the two classes must be higher for women than it is for men.” Comment.

(b)

Given quantitative data on air quality from 2 measuring stations (in the Central Coast and the Illawarra), without code propose one way that you could construct a clustered barchart. Sketch an example.

(c)

Explain the signiﬁcance of Anscombe’s Quartet.

(d) A standard pack of 52 cards has one queen of spades.The pack is shuﬄed, and then ﬁve cards are dealt oﬀ the top of the pack. Find the chance that the 5th card dealt is the queen of spades. Justify your answer.

(e)

A company ﬁnds that on average their employees have 10 ‘sick days’ per year. They hope to reduce the number of sick days, by introducing more ﬂexible working arrangements. They select a simple random sample of 100 employees and ﬁnd after introducing the new arrangements, that those employees had on average 9 ‘sick days’ that year, with a sample SD of 5. Formulate a hypothesis and test using a

box model.

pt(-2 ,99)

## [1] 0 .02411985

2. Spotify is a popular music streaming platform that allows users to listen to music on their devices.

Kahn is having a 21st party and wants to investigate what music he should play for his guests. He downloads the data set spotify from Kaggle.com, which is a public data

platform that is owned by Google. The data was scraped from the Spotify API wrapper in November 2018.

dim(spotify)

## [1] 116372 17

head(spotify,2)

## artist_name track_id

## 1 YG 2RM4jf1Xa9zPgMGRDiht8O

## 2 YG 1tHDG53xJNGsItRA3vfVgs ## track_name ## 1 Big Bank feat . 2 Chainz, Big Sean, Nicki Minaj ## 2 BAND DRUM (feat . A$AP Rocky)

acousticness danceability

0 .743 0 .846

## duration_ms energy instrumentalness key liveness loudness mode ## 1 238373 0 .339 0 1 0 .0812 -7 .678 1 ## 2 214800 0 .557 0 8 0 .2860 -7 .259 1 ## speechiness tempo time_signature valence popularity

## 1 0 .409 203 .927 4 0 .118 44

## 2 0 .457 159 .009 4 0 .371 10

(a) (i) How many songs are in the data set?

(ii) Outline one possible limitation with using this data.

(b) Kahn is interested in the average length of songs on Spotify.

(i) What type of variable is duration ?

class(spotify$duration)

## [1] "integer"

spotify$duration = spotify$duration/(60*1000) # convert to minutes

(ii) Give 3 observations from the following summaries.

summary(spotify$duration)

## Min . 1st Qu . Median Mean 3rd Qu . Max .

## 0 .05338 2 .73415 3 .36288 3 .54244 4 .00448 93 .50033

boxplot(spotify$duration, horizontal =T)

0 20 40 60 80

(c) Kahn wonders whether the mode of the song (major or minor) aﬀects the average length of songs on Spotify. What does he discover?

boxplot(spotify$duration ~ spotify$mode, horizontal =T) # 0 = minor; 1= major

(d) Kahn in interested in how many songs encourage people to dance. What does he discover?

hist(spotify$danceability) # 1 = most dance -able

Histogram of spotify$danceability

0.0 0.2 0.4 0.6 0.8 1.0

A Semester 1 Main, 2019 Page 14 of 19

(e) Kahn is interested in whether certain keys produce ‘happier’ songs, where a

of 1 = happiest and 0 = saddest. What does he discover?

3. Kahn wants to draw up a playlist for the party.

(a) (i) Suggest a research question that Khan could be investigating below. What

does he discover?

0.0 0.4 0.8

spotify$danceability

cor(spotify$danceability, spotify$loudness)

## [1] 0 .4192092

(ii) Khan predicts that the loudness for a song with danceability score 0.5 is

x18.31 + 14.36 × 0.5. Is the formula correct? How useful do you think this would be in practice? Why?

## Call:

## lm(formula = spotify$loudness ~ spotify$danceability) ##

## Coefficients:

spotify$danceability

14 .36

(b) The Italian word ‘vivace’ refers to a song which is played at 156-176 beats per minute.

Approximately what percentage of Spotify songs are ‘vivace’?

Histogram of spotify$tempo

0.010

0.004

0.000

0 156 176 250

Tempo

(c) Assume the danceability score can be modelled by a normal curve with mean 0.6 and SD 0.2. By sketching a picture, calculate the chance of randomly selecting a Spotify song with loudness between 0.4 and 1.

(d) Kahn takes a random sample of 100 songs.

(i) What does Kahn discover from the following code

set .seed(1)

library(dplyr)

spotify1 = sample_n(spotify, 100)

subset(spotify1$artist_name, spotify1$loudness == min(spotify1$loudness))

## [1] Gabriel Faur

## 32105 Levels: _tag _XPRESSWINDOW -MASA Works DESIGEN- -ness -ToBy- . . .

(ii) Explain what the following table represents, and how it could be modelled by

a biased coin.

table(spotify1$mode) # 1= major; 0 = minor

## 0 1