IRE2004 Homework 1 - Estimating Means from Samples
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Homework 1 - Estimating Means from Samples
IRE2004
Figure 1: The Billy Bishop Toronto City Airport (YTZ)
Introduction
This assignment applies basic inferential statistics in R. You learned these statistical techniques in IRE1002 or a similar course: computing means and variances of a sample, calculating the confidence interval around a mean estimate, and analyzing the relationship between sample size and confidence intervals. It may be helfpul to refer to your notes or the textbook from that course.
You will submit your work as an R script (a file ending in .R) on Quercus. A template script file is provided on Quercus: [studentno]_ [lastname]_ [firstinitial]_hw1 .R Open this file in RStudio. It will appear in your ‘Source’ pane. Fill in your answers to the exercises in the assigned sections. Then save the file, using youw own student number and name, and submit that file as your completed homework.
1 Computing the mean and variance of a sample
The following seven temperature readings were taken from the Billy Bishop Toronto City Airport weather station at 1 P.M. for the first seven days of September 2018: 20, 22, 23, 23, 24, 22, and 20 degrees C.
For the following exercises, you can only use the following operators and functions in R.
● the assignment operator to save things to memory: = or <-
● math operators like +, -, *, /, and ˆ
● these functions: c() and sum()
1.1 Calculate the mean by hand.
Use = (or <-) and c()to save the temperature readings in a vector named temps. Then compute the sample mean by hand. (Don’t use R’s built-in mean() function.) The sample mean () is the sum of all observations divided by the count of observations:
αi
n
1.2 Calculate the sample variance of temps by hand.
The sample variance formula is the sum of squared differences between each observation and the mean (calculated above), divided by the total number of observations minus 1:
(αi _ ) 2
n _ 1
Step 1. Store the sample mean in a new variable temps_mean
Step 2. Calculate a vector of differences (αi _ ) and store in temps_diffs
Hint: if you tell R to subtract a number from a vector, it will subtract that number from each element of the vector.
Step 3. Calculate the vector of squared differences, store in temps_diffs_squared
Hint: if you tell R to square a vector, it will square each element of that vector.
Step 4. Calculate the sum of squared differences, store in temps_ssd
Step 5. Divide the sum of squared differences by (n _ 1) to obtain the sample variance estimate. Store in temps_var_byhand
What is the sample variance of the week of temperature observations?
1.3 Calculate sample variance by hand in one line of code.
You can do all five steps above in just one line of R code. Write one line of R code that calculates the sample
variance using only temps, sum(), and the operators listed above. (Hint: use brackets to control the order of operations.)
Thankfully, we don’t always have to compute these sample statistics by hand. R has built-in functions mean() and var(). You can check your work above using R’s built-in functions.
2 The precision of mean estimates: confidence intervals
How warm was it on average at Billy Bishop Airport in September 2018? One way to answer this question is to draw a random sample of temperature readings from that month and calculate the mean. However, the accuracy (precision) of that estimate depends on the size of your sample. The bigger your sample, the more accurate your mean estimate will be.
Here we encounter a random sample of temperature readings from this month and estimate both the mean temperature and the uncertainty of our estimate. One way to describe that uncertainty is with a confidence interval.
2.1 Read September temperatures sample into R
The file sample_temps_sep .csv on Quercus contains a random sample of all hourly temperature readings
from the month of September 2018. You can read this data into R by placing the file in your working directory
and using the read_csv() function, which is part of the package tidyverse. The Tidyverse supplies many of the tools we will use in this course. You should install this package on your system. You only need to install it once. In the future, you can load it into memory using library().
install .packages( !tidyverse !) # You only need to do this once, ever
library(tidyverse) # Do this every new R session
After executing the above, use read_csv() to read the CSV into a variable named sep. Note the underscore (not period) in read_csv().1
2.2 Summarize the temperature data: sample size, mean, and variance
Type sep into the console to see a summary of the data. There are two columns in this dataset:
● sep$datetime - the date and time of the temperature reading
● sep$temp - the temperature reading in Celsius
You can access columns in a dataset using the $ operator, as shown above. Store the temperature variable from the September sample in a new vector: temps_sep. Then compute three things:
● Sample size: How many observations are in the data?
● Sample mean
● Sample variance
2.3 Estimating uncertainty: the confidence interval around your mean estimate
In the previous step, you estimated the mean temperature in September 2018, based on this sample. How precise is that estimate? This problem walks your through a confidence interval calculation using the t distribution.
Step 1. Compute the standard error of the mean estimate.
Above we calculated both the mean () and variance (s2 ) of a sample. The standard error of a mean estimate is given by the following formula:
=
^s2 |
^n |
. . . where s2 is the sample variance and n is the number of observations in the sample.
Compute the standard error of the mean estimate of sample temperatures in September, store in se_sep. (You can compute square roots in R using the function sqrt().)
Step 2. Combine the standard error and the t distribution to compute the 95% confidence interval
We will generally use the Student t distribution to put confidence intervals around mean estimates. You can
look up the upper and lower bounds of the t distribution using the qt() function.2
qt(0.975 , df = length(temps_sep) - 1 )
## [1] 2 .009575
This shows us how many standard deviations we need to move down the t distribution until there is only 2.5% of probability density remaining to the right. If we set this as the bound in both directions, there is exactly
2.5% on the left + 2.5% on the right = 5% of probability outside our range. The remaining probability between these bounds will be 95%. These bounds define our 95% confidence interval.
We multiply the standard error of the mean computed in Step 1 (s) by 2.0095752 to obtain the upper and
lower bounds of the 95% confidence interval, defined as distance from the mean. Calculate this and store in error_sep.
The upper bound of the 95% confidence interval is the sample mean plus the distance calculated above. The lower bound is the sample mean minus this. What are the bounds of the 95% confidence interval of our
estimate of the mean temperature in September 2018?
Again, you can check your work by comparing the confidence interval computed above to one of R’s built-in functions: t .test(temps_sep)
3 Increasing sample size to increase precision of mean estimates
One way to generate a more precise estimate of the mean temperature in September 2018 would be to increase the size of our random sample. In this problem, we set a precision goal and draw a new sample that will generate a mean estimate of the desired precision.
3.1 Calculate the needed sample size
What sample size would we need to reduce the confidence interval to 士〇.5 degrees Celsius? Using the equation defining the confidence interval above, work backwards to calculate the sample size n. Assume that:
● As you increase sample size, the sample variance (s2 ) remains the same as you calculated in the previous problem.
● Use the same approximation of the 95% bounds of the t distribution as calculated above using qt().
You will need to do some algebra to rearrange the confidence interval equation to solve first for a target standard error s , and then use this to solve for a target sample size (n).
3.2 Draw and analyze a new random sample
Draw a sample of that size from the population of September temperature estimates. The file
all_temps_sep .csv on Quercus contains all the temperature measurements from the Billy Bishop monitoring station from Sep 2018. Read it using read_csv() as shown below. Then use sample() to draw a random sample of those measurements, replacing ??? below with your target sample size for obtaining the new confidence interval, calculated in the previous step.
all_temps_sep = read_csv ( !all_temps_sep .csv!)
temps_sep_big = sample(all_temps_sep$temp, size = ???)
After drawing your random sample, calculate the sample mean, standard error of the mean, and confidence interval by hand. You can follow similar steps to those in the previous problem. Compare your confidence interval to that generated by t .test() to confirm you did it correctly. (Note that because the code above draws a random sample, your results be different vary each time you run your code. Each classmate will also get slightly different results.)
3.3 Did you hit your confidence interval target?
The goal was to generate a mean estimate that had a confidence interval of 士〇.5〇 degrees Celsius. Did you achieve this level of precision? If not, please explain the possible reasons here. (There are good reasons you might not hit your goal precision.)
Notes
The notes section contains additional information about the problem set.
Controlling random samples
Computers generally draw samples that are only quasi-random. They use a “seed” number that varies over time (e.g. the date-time) and then feed that seed to an equation that produces an as-good-as-random result.
This allows us to control a random sample to produce the same result every time, which is useful if you don’t
want your sample to change every time you run your code. (For example, working through Problem 3). To ensure that your call to sample() produces the same sample every time, use set .seed():
set .seed([any number here])
myvec = sample( . . .)
Weather data
The weather data in this problem set was obtained from the Billy Bishop weather station using the R package
riem: https://ropensci.github.io/riem/articles/riem_package.html. The following code shows how the data for this problem set was obtained, cleaned, and sampled.
install .packages( !riem !)
library(riem)
library(tidyverse)
weathernet = riem_networks()
View (weathernet) # look for weather networks in Canada
on .stations = riem_stations("CA_ON_ASOS") # Ontario network
View (on .stations) # look for Toronto weather stations
# get Billy Bishop Airport (YTZ) 2018 weather data (obtained 12/22/2018)
weather .ytz = riem_measures ( "CYTZ" , date_start = "2018-01-01" , date_end = "2018- 12-21")
# relevant data:
# valid - date
# tmpf - temp in farenheit
# Add Celsius temps - tmpc
weather .ytz$tmpc = (weather .ytz$tmpf - 32) * 5/9
# Problem 1 Data: 1 PM measures from first 7 days of Sep 2018 .
# Filter: only observations from first 7 days of Sep 2018
d .week = filter(weather .ytz,
valid >= as .Date ( !2018-09-01 !) &
valid < as .Date ( !2018-09-08 !))
# Filter: only the observations taken at 1 PM (1300 hours)
# grepl() does pattern matching, see ?grepl
d .week .1pm = filter(d .week, grepl( ! 13:00 ! , d .week$valid))
# temperatures in Celsius from first seven days of Sep 2018
print(d .week .1pm$tmpc)
# Problems 2 and 3 .
# All September readings for Problem 3
d .sep = filter(weather .ytz,
valid >= as .Date ( !2018-09-01 !) &
valid < as .Date ( !2018- 10-01 !))
d .sep = d .sep[,c ( !valid ! , !tmpc !)] # keep only the date and temperature d .sep = rename (d .sep, datetime = valid, temp = tmpc)
write_csv (d .sep, !all_temps_sep .csv!) # save to file
# Sample of September temperature readings for Problem 2
set .seed(0716) # needed to generate same random sample each time
d .sep .sample = sample_n (d .sep, 50)
#d . sep . sample
write_csv (d .sep .sample, !sample_temps_sep .csv!)
2023-01-18