关键词 > DATA1001/ENVX1002

DATA1001/ENVX1002 Foundations of Data Science/Introduction to Statistical Methods Semester 1 Main, 2018

发布时间:2023-05-15

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Semester 1 Main, 2018

DATA1001/ENVX1002

Foundations of Data Science/Introduction to Statistical Methods

1. In late 2012, Transport for NSW introduced the Opal Card, which is a credit-card sized smartcard for paying for public transport in Sydney and nearby areas. A passenger taps “on”when they begin their journey, and taps“o↵”when they reach their destination.

 

Data was collected over a week in July 2016 for the following 6 variables:

• mode of public transport - bus, train, light rail and ferry.

• date - in yyyymmdd format, eg 20160730 is 30/07/2016.

• tap type - “on”and“o↵”.

• time - in 24hr time collected in 15 minute intervals.

• location - denoted by postcode and names of train stations, ferry wharves and light rail stops.

• count - the number of taps“on”or“o↵”.

opal = read .csv("data/opal .csv")

dim(opal)

## [1] 215630      6

head(opal,4)

##   mode     date tap  time  loc count

## 1  bus 20160730  on 02:30 2000   415

## 2  bus 20160730  on 02:30 2135    18

## 3  bus 20160730  on 02:30   -1    24

## 4  bus 20160730  on 02:30 2010    31

(a)   (i) Is this data a population or sample? Explain.

 

 

 

(ii)

 

 

 

Did the data collection come from a controlled experiment or an observational study? What di↵erence does that make to any conclusions drawn?

(iii) Who owns this data? Suggest one possible ethical issue.

(b)   (i)

 

 

 

How many observations/records are there?          

(ii)

What type of variable is  mode ?                    

(iii)

Why might the location of the 3rd subject be recorded as  -1 ?

 

 

 

 

(iv)

 

 

 

 

What would a value of“0”represent in the

 

 

 

 

 

 

 

 

column?

(c)   Here we focus on the counts for the tap-ons over the week.  Would the mean or

median be a better summary of centre? Explain.

##    Min . 1st Qu .  Median    Mean 3rd Qu .    Max .

##    18 .0    29 .0    50 .0   109 .4   104 .0 14396 .0

(d)   Here we focus on the tap-ons in Redfern over the week.   Make 3 comments, in

context.

Tapons in Redfern

lightrail

2.(a)   Suppose the number of people caught not having an Opal card at Redfern station

each weekday can be modelled by a Normal curve with mean = 10 and SD = 3.

Find the chance that the number of people caught on a certain day is more than 13, by showing clear working on the curve below.

 

(b)   Transport for NSW wants to model the total number of tap-ons for buses and trains

in the morning peak-hour between 5am-7:45am (a 165 minute period).

Tapons in the morning peakhour

150

Time (minutes since 5am)

(i)   What do you notice?

(ii)   Would a linear regression model be appropriate? Explain.

(iii)   If you tted a linear regression model to the tap-ons for trains, what would

the residual plot look like?

 

 

(iv)

 

 

Suppose we t a quadratic model to the tap-ons for trains. Predict how many taps occcur at 7am.

(Note: Leave your answer as an expression. Don’t evaluate.)

time2 = time^2

lm(opaltime_m$train~time + time2)

 

##

## Call:

## lm(formula = opaltime_m$train ~ time + time2)

##

## Coefficients:

## (Intercept)         time        time2

##   22515 .736       61 .633        6 .955

 

 

 

 

 

3.(a)   In your own words, explain why the Central Limit Theorem is important.

(b)   A fair coin is tossed 100 times.

(i)   Draw a box model to represent the number of heads.

(ii)   Show that the expected value and standard error of the number of heads are

50 and 5 respectively.

(iii)   By annotating the Normal curve, show that the chance of getting between 35

and 50 heads is approximately 50%.

(iv)   Is it valid to use a Normal curve here? Explain.

(c)   Choice magazine wants to investigate how often Sydney commuters are late to work.

(i)   Describe a possible survey method.

(ii)   Discuss 2 possible limitations or sources of bias.

(iii)   Propose a biased question.

4. Transport for NSW is interested in comparing the tap-ons for all 4 modes of travel.

(a)  Consider the di↵erence between tap-ons for ferry and light rail.

Note:  ferryon$count  is the tap-ons for ferry.

##

##  Welch Two Sample t-test

##

## data:  log(ferryon$count) and log(raillighton$count)

## t = 8 .3386, df = 5636 .1, p-value < 2 .2e-16

## alternative hypothesis: true difference in means is not equal to 0

## 95 percent confidence interval:

##  0 .1138784 0 .1838814

## sample estimates:

## mean of x mean of y

##  3 .886913  3 .738033

(i) Why might the test have been performed on the  log  of the data?

(ii) Perform an appropriate hypothesis test.

H:

A:

T:

P:

(b)   It is claimed that the proportions of tap-ons for bus:ferry:lightrail:train is 40:5:5:50.

Respond to this claim by using a hypothesis test.

##       bus     ferry lightrail     train

## 41 .462751  1 .645960  1 .229094 55 .662195

##

##  Chi-squared test for given probabilities

##

## data:  total_tapon_prop

## X-squared = 5 .7886, df = 3, p-value = 0 .1224

H:

A:

T:

P:

C:

(c)  Consider the di↵erence between tap-ons and tap-o↵s. What story does this output tell us?

total_tapon-total_tapoff

bus     ferry lightrail     train

32879     -7335      2249    -29138

5. You are a data scientist reporting to a client on the Opal card data. Choose your client and define the purpose of the report. Discuss the limitations of the data and 2 interest-

ing insights, using evidence from previous questions. Suggest an action for the client. Client:                                                                        

Purpose of Report:                                                                                                            

Limitations of data:

Interesting Insights:

1.

2.

Suggested Action:

This page will not be marked - it is for your working.