STAT 240: Introduction to Data Science

Final Exam Spring 2021


This final exam consists of 5 questions. The last question is choose your own adventure (you must upload only one of the two options). Aspects of this final exam must be handed in through crowdmark. This final exam open book and take home and due Wednesday April 28th at 6:30PM PST. You may work on this final exam for any amount of time between Wednesday April 21st at 3:30PM PST and the deadline. You may access any texts, notes or lectures while completing this final exam. You may access resources on the internet provided that you don’t communicate any aspect of this final exam, distribute the final exam material in any way, or confer with other students or third parties regarding this final exam; as formalized below. This final exam is out of 40 marks.


Honour Code

In taking this final exam you are required to affirm your willingness to abide with the course policies. By signing your name below, you affirm that you are abiding by the following honour code:

I understand that the following activities are prohibited and will be considered cheating. I agree that I will not participate in any of the following activities:

• Looking at or copying from another student’s exam or materials while writing the exam.

• Conferring with other students.

• Having someone else take the exam in your place.

• Distributing the exam materials in any way or discussing final exam materials with anyone in any form or media.

The above honour code is an undertaking for students to abide by both individually and collectively. You must uphold both the spirit and letter of this honour code. Please sign this honour code and upload it to crowdmark.


Signature:

Full Name:

Student Number:

Please complete all questions below and provide your solutions on crowdmark. For questions that require code, provide all your code. You may use any function or package in completing these questions, with no limitations.


Question 1: JSON (10 marks)

a) What does JSON stand for, in the context of databases? (1 mark)

b) Is JSON considered to be a NoSQL database format? (1 mark)

c) List a drawback of, and a benefit of, using a NoSQL database format. (2 marks)

d) List a drawback and benefit of using a relational database such as MySQL. (2 marks)

e) Which of the following files are valid JSON files? (4 marks) 


file1.json:

[

{ a : 1, b : 2, c : 3 }

]

file2.json:

{

[ a : 1, b : 2, c : 3 ]

}

file3.json:

[

{ "a" : 1, "b" : 2, "c" : 3 }

]

file4.json:

{

[ "a" : 1, "b" : "2", "c" : 3 ]

}


Question 2: JSON to SQL (5 marks)

Consider the following R code and fill in the blanks between # <<< and # >>> so that the postcondition is satisfied. Consider using the packages rjson and sqldf. (5 marks)

library(rjson)

library(sqldf)

# Precondition: 'outfile' is a valid filename string, and 'name' if it is provided is a valid SQLite tablename string. 'infile' is the filename of a JSON file containing a list. The first item of the list is a list of strings specifying column names. Subsequent items of the list are lists of numerical values. Each item of the list has the same number of elements. For example, the following file is in the format specified by this precondition:


infile.json:

[ [ "C1", "C2", "C3" ],

[ 1.1, 2.2, 3.3],

[ 4.4, 5.5, 6.6] ]

# Postcondition: 'outfile' is an SQLite database with a table faithfully representing the data in infile. In the above example this would be a table with three columns with three columns "C1", "C2", "C3" and two entries given by (1.1,2.2,3.3) and (4.4,5.5,6.6). If the parameter 'name' is provided then the name of the table is the value of 'name'. If the parameter 'name' is not provided then the name of the table is "default".


convert = function(infile, outfile, name = "test") {

    data = fromJSON(file = infile)

    # <<<


    # >>>

    db = dbConnect(SQLite(), dbname = outfile)

    dbWriteTable(conn = db,

        name = name,

        value = result,

        row.names = FALSE,

        overwrite = TRUE)

}


Question 3: Data Science (5 marks)

In 240 characters or less, mention a field that you are interested in, and describe how data science will make an impact on that field. (5 marks)


Question 4: A Histogram (10 marks)

Consider a dataset of real values specified by the following variable in R:

x = c(0.2, 0.25, 0.1, 0.7, 0.6, 0.6, 0.61, 0.8)

Plot a histogram of the dataset such that the histogram has 4 equally sized bins spanning exactly the range of x. The bins must extend to the edges of the x-axis. The x-label must be indicated by x. The y-label must be indicated by the label Counts. The bins (aside from the first) must be right inclusive. (10 marks)


Question 5: Choose Your Own Adventure

Solve one of the two options below. Submit to crowdmark either in the Option A or the Option B slot according to your choice. Only submit to one of the two Option slots and leave the other Option slot blank.


Option A: Regular Expressions (10 marks)

The human genome is a sequence of around three billion “basepairs” (a long string with the letters A, T, C, G) distributed over 22 autosomal chromosomes and 2 sex chromosomes. In the X chromosome (a sex chromosome), fragile X syndrome is indicated (roughly speaking) by contiguous repeats of CGG or AGG bracketed immediately on the left by GCG and on the right by CTG (the bracketed portion must be repeats of the triplets CGG or AGG without any intervening letters). The number of such repeats of CGG or AGG is roughly correlated with the chance of having fragile X syndrome.

For example, in this sequence: “GATGCGAGGCGGCGGCGGCGGAGGCGGCTGTACA” 7 repeats of CGG or AGG are bracketed on the left by GCG and on the right by CTG. And in this sequence: “GCGAGGCTG” one repeat of CGG or AGG is bracketed on the left by GCG and on the right by CTG. And in the following three sequences: “GATGCGCTGTACA”, “TTTT”, “GCGTAGGTCGGCTG” no contiguous repeats of CGG or AGG are bracketed on the left by GCG and on the right by CTG (in the last sequence, AGG occurs between GCG and CTG but there are intervening T letters that are not part of immediately bracketed contiguous repeats of CGG or AGG).

a) Write a function named fragile in R that takes a string and uses regular expressions to return the number of contiguous repeats of CGG or AGG (with no intervening letters that are not part of contiguous repeats) that are immediately bracketed by GCG on the left and CTG on the right. Demonstrate your function on the 5 strings above (showing that you return 7, 1, 0, 0, 0, respectively) and also on a few other strings of your choosing. (5 marks)

b) Imagine you are given an R dataframe in a variable called data with columns seq and fragile. The column seq indicates a sequence for a subject (a string). The column fragile indicates whether or not the subject has fragile X syndrome (an integer: 0 = they do not have fragile X syndrome, 1 = they have fragile X syndrome). Write code to add a column to the dataframe named count. The variable count should indicate the output of your fragile function for each of the strings in the seq column (i.e., the number of contiguous and uninterrupted CGGs and AGGs bracketed, as described above). Then, write code to perform a linear regression with fragile as the y-variable (dependent variable) and count as the x-variable (independent variable) and write code to print out the regression coefficient (slope) for count (Y = MX + B, Y is the fragile X indicator, X is the count, M is the regression coefficient for count, and B is a constant: print out M). Demonstrate your code on an example dataframe that you construct yourself. Hint: Use the R function lm. (4 marks)

c) For the question above, we have considered a linear regression for fragile on count. The variable fragile has range in the set {0, 1} and the variable count has range in the non-negative integers. Is this an appropriate thing to do, in order to determine the effect of the repeats on fragile X syndrome? Why or why not? (1 mark)


Option B: Rolling Linear Regression for Local Weather (10 marks)

This question pertains to predicting the weather locally and on a small timescale. We will investigate two ways of predicting the next day’s average temperature in Vancouver based on historical data. The two ways are as follows: 1) A linear interpolation (linear regression) based on the previous three days, extrapolating to the next day (the day right after those three days). 2) A prediction that the next day’s average temperature in Vancouver is exactly the same as the average temperature in Vancouver of the immediately preceding day. We will score these predictions based on root mean squared error (RMSE). This is the L-2 norm between the predicted and actual average daily temperature values divided by the number of values (if y is a vector, and y0 is a prediction of y then the RMSE between y and y0 is sqrt(mean((y-y0)^2)) in R code---note that y and y0 must have the same length).

a) Download daily climate data (Climate Daily/Forecast/Sun) from https://vancouver.weatherstats.ca/download.html including the 7-day period between Monday the 12th of April 2021 and Sunday 18th of April 2021 (inclusive). Extract this 7-day period (and provide it in a listing) and report the mean average hourly temperature (avg_hourly_temperature) for all 7 days, and the standard deviation of the average hourly temperature for all 7 days. (4 marks)

b) For each day between Thursday 15th of April 2021 and Sunday 18th of April 2021 (inclusive), fit a linear regression model (using the R function lm for example) with x-values “1, 2, 3” and y-values given by the avg_hourly_temperature for the three days immediately preceding that day. Then, predict the avg_hourly_temperature for that day by extrapolating the linear regression for the x-value “4”. What are the predictions for the 4 days under question? And what is the RMSE between the predictions and the actual avg_hourly_temperature values? (2 marks)

c) Consider again the 4 days between Thursday 15th of April 2021 and Sunday 18th of April 2021 (inclusive). What if we predict the avg_hourly_temperature for each day using the avg_hourly_temperature of the previous day (i.e., predict tomorrow’s avg_hourly_temperature with today’s avg_hourly_temperature)? Compute and report the RMSE between these predictions and the actual avg_hourly_temperature values. This is the martingale assumption (this assumption is often applied in finance). (2 marks)

d) What’s better for predicting weather, according to this analysis: The predictions using the three-day rolling linear regression (b), or the predictions using the previous day’s value (c)? Provide a reason as to why this may be the case. (2 marks)