STATS 769 Advanced Data Science Practice SEMESTER 2, 2019
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STATS 769
SEMESTER 2, 2019
STATISTICS
Advanced Data Science Practice
(Time allowed: TWO Hours)
INSTRUCTIONS
. Attempt ALL questions.
. Total marks are 80.
. Calculators are NOT permitted.
1. [10 marks]
The following code its a logistic regression model and calculates the accuracy of the model.
> glmFit <- glm(am ~ qsec, mtcars, family="binomial")
> glmPred <- as. numeric(predict(glmFit, type="response") > .5) > table(pred=glmPred, obs=mtcars$am)
obs
pred 0 1
0 17 10
1 2 3
> mean(glmPred == mtcars$am)
[1] 0.625
(a) Write R code to split the data into a training set and a test set, it the model
on the training set, and calculate the accuracy on the test set.
Do not use any add-on packages (i.e., only use base R functions). [4 marks]
(b) Explain why a more lexible model like k-nearest neighbours can sometimes result in a worse test error than a less lexible model like simple logistic re- gression. [3 marks]
(c) Explain the advantage of using k-fold cross-validation to estimate accuracy compared to using the validation set approach above (a single training/test split)? [3 marks]
2. [10 marks]
This question relates to 11 CSViles in a directory called /course/data. austintexas. gov/. Each ile contains a header line and information on 5000 electric vehicle trips.
$ ls /course/data. austintexas. gov/trips-*. csv
/course/data. austintexas. gov/trips-2018-10. csv
/course/data. austintexas. gov/trips-2018-11. csv
/course/data. austintexas. gov/trips-2018-12. csv
/course/data. austintexas. gov/trips-2018-4. csv
/course/data. austintexas. gov/trips-2018-5. csv
/course/data. austintexas. gov/trips-2018-6. csv
/course/data. austintexas. gov/trips-2018-7. csv
/course/data. austintexas. gov/trips-2018-8. csv
/course/data. austintexas. gov/trips-2018-9. csv
/course/data. austintexas. gov/trips-2019-1. csv
/course/data. austintexas. gov/trips-2019-2. csv
(a) Explain what the following shell code is doing and write down any output from the code.
$ head -1 /course/data. austintexas. gov/trips-2018-4. csv > trips. csv
$ for i in /course/data. austintexas. gov/trips-*. csv
> do
> tail -5000 $i >> trips. csv
> done
$ wc -l trips. csv [5 marks]
(b) The following output shows the irst few lines of the ile
/course/data. austintexas. gov/trips-2018-4. csv
(each of the CSV iles has the same structure).
$ head /course/data. austintexas. gov/trips-2018-4. csv
"type","duration","distance","hour","day","month","year" "scooter",679,1363,14,2,4,2018
"scooter",120,418,7,2,4,2018
"scooter",196,915,15,2,4,2018
"scooter",100,122,19,5,4,2018
"scooter",954,2181,12,4,4,2018
"scooter",618,15,8,3,4,2018
"scooter",1164,1837,11,4,4,2018
"scooter",51,31,10,2,4,2018
"scooter",288,1154,9,2,4,2018
Write shell code to extract the irst value (the type column) from every row, except the frst, for all of the CSV iles and store the result in a new ile called trip-types. txt.
The following output shows what the ile trip-types. txt should look like.
$ wc -l trip-types. txt
55000 trip-types. txt
$ head trip-types. txt
"scooter"
"scooter"
"scooter"
"scooter"
"scooter"
"scooter"
"scooter"
"scooter"
"scooter"
"scooter" [5 marks]
3. [10 marks]
(a) Given the following XML document, "pets. xml", write R code to read the XML document and select all of the month elements for which pets adopted is more than 130, using a single XPath expression.
<?xml version="1.0" encoding="UTF-8"?> <response> <row> <row _uuid="00000000-0000-0000-AF9A-401551B08E58"> <month>Jan</month> <pets_adopted>129</pets_adopted> </row> <row _uuid="00000000-0000-0000-F7B9-E37345BC66E7"> <month>Mar</month> <pets_adopted>126</pets_adopted> </row> <row _uuid="00000000-0000-0000-ADAB-310B0A2E551C"> <month>Feb</month> <pets_adopted>151</pets_adopted> </row> <row _uuid="00000000-0000-0000-D539-79AF5550719D"> <month>Apr</month> <pets_adopted>128</pets_adopted> </row> <row _uuid="00000000-0000-0000-0CF1-7C7A0DE7534B"> <month>May</month> <pets_adopted>143</pets_adopted> </row> </row> </response> |
The result of your code would look like this:
{xml_nodeset (2)}
[1] <month>Feb</month>
[2] <month>May</month> [5 marks]
(b) A MongoDB database called starwars contains a large number of documents for all of the characters in the Star Wars universe. Each record in the database has the same structure as the ile "luke. json", shown below.
{ "name": "Luke Skywalker", "height": "172", "mass": "77", "hair_color": "blond", "skin_color": "fair", "eye_color": "blue", "gender": "male", "homeworld": "https://swapi. co/api/planets/1/ ", "films": [ "https://swapi. co/api/films/2/ ", "https://swapi. co/api/films/6/ ", "https://swapi. co/api/films/3/ ", "https://swapi. co/api/films/1/ ", "https://swapi. co/api/films/7/ " ] } |
Write R code to query the starwars database and extract the name, height, and gender for the irst 5 records with height greater than 170.
The output of your code would look like this:
name height gender
1 Luke Skywalker 172 male
2 Darth Vader 202 male
3 Owen Lars 178 male
4 Biggs Darklighter 183 male
5 Obi-Wan Kenobi 182 male [5 marks]
4. [10 marks]
(a) Explain the following output. Why does this data frame take up approxi- mately 16 kilobytes of storage? Why does it not take up exactly 16 kilobytes of storage?
> object. size(data. frame(x=integer(1000),
+ y=numeric(1000),
+ z=logical(1000)))
16896 bytes [3 marks]
(b) Explain what “garbage collection” means in R. Why is it necessary? [3 marks]
(c) Explain the diference between the R function gc() and the function profmem() (from the profmem package). [4 marks]
5. [10 marks]
(a) This question uses a ile called trips. csv containing 100,000 electric vehicle trips. The ifth column of data in the ile, called Trip. Distance, records the distance of each trip (in metres).
The following code, “code A”, calculates the maximum trip distance and mea- sures the memory used during the calculation.
> gc(reset=TRUE)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 515919 27.6 940480 50.3 515919 27.6
Vcells 986453 7.6 49161520 375.1 986453 7.6
> trips <- read. csv("trips. csv")
> maxDist <- max(trips$Trip.Distance, na. rm=TRUE)
> maxDist
[1] 15096088
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 643492 34.4 1168576 62.5 681312 36.4
Vcells 2954524 22.6 39329216 300.1 11535163 88.1
The following code, “code B”, also calculates the maximum trip distance, but uses a diferent approach.
> gc(reset=TRUE)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 643557 34.4 1168576 62.5 643557 34.4
Vcells 2955121 22.6 31463372 240.1 2955121 22.6
> f <- file("trips. csv", "r")
> header <- readLines(f, n=1)
> maxDist <- -Inf
> for (i in 1:10) {
+ tripSet <- read. csv(f, nrow=10000, header=FALSE) + maxDist <- max(maxDist, tripSet[,5], na. rm=TRUE) + gc()
+ }
> close(f)
> maxDist
[1] 15096088
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 644050 34.4 1168576 62.5 701169 37.5
Vcells 3060113 23.4 10309916 78.7 3779812 28.9
Explain what each line of code is doing, which approach (“code A” or “code B”) uses the most memory, and why one approach uses more memory. [10 marks]
6. [10 marks]
(a) The following output shows the CSV iles in a directory called Data.
$ ls Data/*. csv
Data/1987. csv
Data/1988. csv
Data/1989. csv
Data/1990. csv
Data/1991. csv
Data/1992. csv
Data/1993. csv
Data/1994. csv
Data/1995. csv
Data/1996. csv
Data/1997. csv
Data/1998. csv
Data/1999. csv
Explain why the “real” times in the following timings are so diferent.
NOTE: you do not need to know the result of the call to grep to answer this question.
$ time -p (grep LAX Data/*. csv | wc -l)
7361191
real 4.70
user 3.04
sys 3.00
$ time -p (grep LAX Data/*. csv > temp. txt && wc -l temp. txt)
7361191 /tmp/temp. txt
real 7.59
user 2.87
sys 2.48 [4 marks]
(b) The ile "trips-2018-4. csv" contains 5,000 electric vehicle trips.
> trips <- read. csv("/course/data. austintexas. gov/trips-2018-4. csv")
The following code, “code A”, generates a vector folds that divides the rows into 10 groups of 500.
> rows <- 1:nrow(trips)
> indices <- NULL
> for (i in 1:10) {
+ index <- sample(1:length(rows), 500)
+ indices <- c(indices, list(rows[index]))
+ rows <- rows[-index]
+ }
> folds <- numeric(5000)
> for (i in 1:10) {
+ folds[indices[[i]]] <- i
+ }
The following output shows the result after the code has been run.
> head(folds)
[1] 7 1 10 1 9 4
> table(folds)
folds
1 2 3 4 5 6 7 8 9 10
500 500 500 500 500 500 500 500 500 500
The following code, “code B”, performs exactly the same calculation.
> folds <- sample(rep(1:10, length. out=nrow(trips)))
Describe at least two ways that “code B”is more e伍cient than “code A”. [3 marks]
(c) Explain why we might sometimes deliberately choose to write code that is less e历cient or runs more slowly. [3 marks]
7. [10 marks]
(a) The following code reads 11 CSV iles containing data on electric vehicle trips and calculates the average trip distance for each day of the week for each ile.
> years <- rep(2018:2019, c(9, 2))
> months <- c(4:12, 1:2)
> filenames <- paste0("/course/data. austintexas. gov/trips-", + years, "-", months, ". csv")
> flights <- lapply(filenames, read. csv)
> dayMeans <- lapply(flights,
+ function(x) tapply(x$distance, x$day, mean)) > round(do. call(rbind, dayMeans))
0 1 2 3 4 5 6
[1,] 2753 -12872 1547 1566 1604 1814 1970
[2,] 3506 3069 2159 2112 1941 2183 2927
[3,] 2510 1754 1602 1657 1646 1269 2152
[4,] 2021 1656 1626 1510 1645 1639 2088
[5,] 1992 1542 1294 1378 1451 1459 1724
[6,] 1338 1159 1156 1067 1154 1132 1447
[7,] 1353 970 928 898 944 1268 1473
[8,] 1520 1122 1110 1330 1229 1346 1558
[9,] 1445 1494453 1438 1188 1088 1195 1424
[10,] 1673 1250 1311 1132 1300 1405 1532
[11,] 1660 1273 1200 19886 1094 1282 1509
Write R code to perform the same calculation, but using parallel execution as much as possible.
You can assume that the code will run on one of the virtual machines for this course, i.e., you have 20 cores available, but you should think about a sensible number of cores to use. [6 marks]
(b) Explain the diference between the function mclapply() and the function parLapply(). [4 marks]
8. [10 marks]
(a) The following code and output shows an R session using the sparklyr package to perform calculations with Apache Spark.
> sc <-
+ spark_connect(master = "local",
+ spark_home="/course/spark/spark-2.1.0-bin-hadoop2.7") > flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
> delay_result <- flights_tbl %>%
+ group_by(tailnum) %>%
+ summarise(count = n(), dist = mean(distance),
+ delay = mean(arr_delay)) %>%
+ filter(count > 20, dist < 2000, ! is. na(delay))
> head(delay_result)
# Source: spark<?> [?? x 4]
tailnum count dist delay
<chr> <dbl> <dbl> <dbl>
1 N3ALAA 63 1078 . 3.59
2 N657JB 285 1286 . 5.03
3 N3GKAA 77 1247 . 4.97
4 N562JB 315 1244 . 11.0
5 N4WRAA 21 1010 . 2.32
6 N563JB 274 1352 . 8.70
> object. size(delay_result)
20832 bytes
> delay <- collect(delay_result)
> head(delay)
# A tibble: 6 x 4
tailnum count dist delay
<chr> <dbl> <dbl> <dbl>
1 N3ALAA 63 1078 . 3.59
2 N657JB 285 1286 . 5.03
3 N3GKAA 77 1247 . 4.97
4 N562JB 315 1244 . 11.0
5 N4WRAA 21 1010 . 2.32
6 N563JB 274 1352 . 8.70
> object. size(delay)
261792 bytes
Explain what the copy to() and collect() functions are doing and why the delay result and delay objects have diferent sizes . [5 marks]
(b) The following code and output shows an R session using the rmr2 package to perform calculations with Hadoop MapReduce.
> library(rmr2)
> tripCSV <-
+ make. input. format("csv", sep=",",
2023-10-17