STATS 769 Data Science Practice SEMESTER 2, 2019
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STATS 769
TEtM TEST - SEMESTEt 2, 2019
STATISTICS
Data Science practice
1. [10 marks]
This question makes use of the R data frame trips, which is shown below.
> trips
type duration distance hour day month year
1 scooter 187 308 18 0 7 2018
2 scooter 822 1828 20 4 7 2018
3 scooter 221 646 23 0 7 2018
4 scooter 299 626 20 0 7 2018
5 scooter 636 2612 11 0 7 2018
6 scooter 283 278 13 1 7 2018
7 scooter 2213 3351 16 0 7 2018
8 bicycle 2276 5601 15 6 7 2018
9 scooter 349 565 19 1 7 2018
10 scooter 758 1373 16 6 7 2018
(a) write an R function, testError(), to perform the following steps:
(i) Randomly select one row of the data frame trips to act as a test set. The remainder of the data frame (nine rows) will act as a training set.
(ii) Fit a simple linear regression to predict duration from distance using the training set.
(iii) use the itted model to predict duration for the test set.
(iv) calculate (and return) the squared diference between the prediction and the real duration in the test set.
your function would be used like this:
> testError()
[1] 22399.27 [7 marks]
(b) Explain what the following R code is doing.
> sqrt(mean(sapp1y(1:100, function(i) testError()))) [3 marks]
2. [10 marks]
(a) Explain what the following shell code is doing and write down the result of running the code.
head -1 trips. csv > subset. csv
grep scooter trips. csv >> subset. csv
Wc -l subset. csv
The contents of the Csv ile "trips. csv" is shown below.
"type","duration","distance","hour","day","month","year" "scooter",187,308,18,0,7,2018 "scooter",822,1828,20,4,7,2018 "scooter",221,646,23,0,7,2018 "scooter",299,626,20,0,7,2018 "scooter",636,2612,11,0,7,2018 "scooter",283,278,13,1,7,2018 "scooter",2213,3351,16,0,7,2018 "bicycle",2276,5601,15,6,7,2018 "scooter",349,565,19,1,7,2018 "scooter",758,1373,16,6,7,2018 |
(b) Explain the meaning of the following Makefile. what is the purpose of each line of code?
report. html: report. Rmd Rscript -e "rmarkdoWn::render(\"report. Rmd\")" |
Describe the result of running the following shell code (assuming that the Makefile shown above is in the current directory and there is also a ile report. Rmd in the current directory).
touch report. Rmd
make
make
The content of the ile report. Rmd is shown below.
# A report 、、、{r} mean(read. csv("trips. csv")$distance) 、、、 |
3. [10 marks]
(a) Explain the meaning of the following XQuery expression. what is the purpose of each line of code?
{
for $i in doc("pets. xml")//row/row
let $n := number($i/pets-adopted)
where $n < 200
order by $n
return $i/month
}
[5 marks]
(b) Given the following XML document, "pets. xml", write down the result of evaluating the XQuery expression above.
<?xml version="1.0" encoding="UTF-8"?>
|
4. [10 marks]
This question relates to the the JSON ile)"luke. json")shown below.
{ "name": "Luke skywalker", "height": "172", "mass": "77", "hair-color": "blond", "skin-color": "fair", "eye-color": "blue", "gender": "male", "homeworld": "https://swapi. co/api/planets/1/", "films": [ "https://swapi. co/api/films/2/", "https://swapi. co/api/films/6/", "https://swapi. co/api/films/3/", "https://swapi. co/api/films/1/", "https://swapi. co/api/films/7/" ] } |
(a) write down the result of the following R code.
> 1ibrary(二son1ite)
> fromJSON(readLines("1uke.json")) [3 marks]
(b) A MongoDB database called starwars contains a large number of documents
for all of the characters in the Star wars universe. Each record in the database has the same structure as the ile "luke. json".
write R code to query the starwars database and extract the name)height) and mass for the irst 5 records with gender equal to male.
The output of your code would look like this:
name height mass
1 Luke skywalker 172 77
2 Darth vader 202 136
3 owen Lars 178 120
4 Biggs Darklighter 183 84
5 obi-wan kenobi 182 77 [7 marks]
2023-08-22