Big Data Analytics: Assignment 1 – Hurricane Sandy and Flickr


Big Data Analytics: Assignment 1 – Hurricane Sandy and Flickr

The exercise is designed to walk you through key parts of the technical work you will do for your final project, so you can make sure you understand the programming concepts you need now, rather than during your final project. The tasks build on what you have learnt in the R seminars, and you can use and edit the code you wrote there to help you solve this assignment.

Please  note that later parts of this assignment build on earlier parts – so  make sure you get

started straight away.

Along with this description of the assignment, we have made two data files available for you – the mysteriously named“3922327258060dat.txt”and“3922327258060doc.txt”. See Part 3 to find out what these are.

In addition, we have provided an R file called“Assignment1Answers.R”. To complete this assignment,

you should edit this file, and submit it back to us for assessment . You should not change the name

of the file, or submit any other files for assessment. We have marked an area at the top of the file

where you can write your student number.

Each task involves you editing a function that is already in the R file . Only edit between the marked

lines - do not change the name of the function or the arguments to the function . We need to be able to

find functions with these exact names and arguments when we mark your code to give you credit for

your work.

At the top of the file, you will find an area marked for editing where you can add library() calls to load any packages needed for your solution to work. Please do not add install.packages() calls to your code. We will make sure that we have installed all the packages that your code loads.     At the bottom of the R file, there is also some code that you should not edit (highlighted as such). Please do not change the code at the bottom of the file to solve the assignment.

These points are important. If you do make changes where we have indicated that you should not, you will not get credit for your efforts and your code may not work as you expect when we test it.

You also need to make sure that there are no errors when you type the following into the R console:


If this command produces errors, we will not be able to mark your answers for this assignment at all.

With all of those warnings out of the way! - we hope this helps you develop and test your skills. Have fun!

Getting started

To start this exercise, load the assignment answers file like this:


You should not see any errors or warnings when you do this.

Goal of your investigation

Humans around the world are uploading increasing amounts of information to social media services such  as Twitter, Flickr and Instagram.  To  what  extent  can  we  exploit  this  information  during catastrophic events such as natural disasters, to gather data about changes to our world at a time when good decisions must be reached quickly and effectively?

The subject of your current investigation is Hurricane Sandy, a hurricane that devastated portions of the Caribbean  and the  Mid-Atlantic  and  Northeastern  United  States  during  late  October  2012. As  a hurricane approaches, air pressure drops sharply. Your goal is to begin an investigation into whether a relationship might exist between the progression of Hurricane Sandy, as measured by air pressure, and user behaviour on the photo-sharing site Flickr.

If there were a simple relationship between changes in air pressure, and changes in photos taken and then uploaded to Flickr, then perhaps further investigation of these social media data would give insight into problems resulting from a hurricane that are harder to measure using environmental sensors alone. This might include the existence of burst pipes, fires, collapsed trees or damaged property. Such information could be of interest both to policy makers charged with emergency crisis management, and insurance companies too.

Part 1: Acquiring the Flickr data (3%)

Hurricane Sandy, classified as the eighteenth named storm and tenth hurricane of the 2012 Atlantic hurricane season, made landfall near Atlantic City, New Jersey at 00:00 Coordinated Universal Time

(UTC) on 30 October 2012.

You have decided to have a look at how Flickr users behaved around this date, from the beginning of 29 October 2012, to the end of 1 November 2012. In particular, you are going to look at data on photos uploaded to Flickr with the text“hurricane sandy”. When were photos with this text taken?

In this task, you want to write code which would let you download hourly counts from Flickr of the

number of photos taken and uploaded to Flickr with labels which include the text“hurricane sandy”, for

the period 29 October 2012 00:00 to 01 November 2012 23:59 .

To solve the first problem of downloading the hourly photo counts, let's break it down into a few sub- problems:

A)  Working out what the URL would be for the first JSON page for one hour

B)  Using this URL to retrieve data from Flickr on how many photos were taken in a given hour with the specified text

C)  Writing some code to download data for all of the hours you are interested in, using the code from step B

1A. Building a URL to get one hour's data on Flickr (1%)

First, we want to work out what URL we need to use to download one JSON page of data on Flickr photos with the text“hurricane sandy”which were taken in a given hour.

To do this, edit the function buildFlickrURLOneHour. This function has two arguments:

-    startOfHour: a POSIXct date and time object, to specify the beginning of the hour you

would like data for, and

-    text: the text which should be attached to the photograph.

You can see an example of the kind of date and time that would be passed to startOfHour by typing


into the R console. This should print

[1] "2012-10-29 01:00:00 UTC"

This is an example date and time which we have defined at the bottom of the file – see if you can find it. It is a POSIXct date and time object so unlike Date objects, you can store the time too. Don't change the code at the bottom of the file! You could create your own test dates too however.               You need to change the function to take those two arguments and create the URL which you would use to download one JSON page of data on Flickr photos in an hour, starting at the time specified by startOfHour, with the text attached as specified by text.

You can test your function by typing in:

buildFlickrURLOneHour(startOfHour=testStartShort, text="hurricane sandy")

Importantly however, note that the function should work for other times and text values too (indeed, we will test this).

Hint 1: You wrote a function that was very similar to this in Week 3 's R seminar and the exercises. You do need to change that code however. What is the key difference between what this function is doing and what the relevant function in the exercise was doing? What might you want to change?

Hint 2: Make sure you are not using a temporary API key, as this will not work when we mark your code. If your API key is correct (and not temporary), you should be able to see it at this URL:


If no keys show up at this URL, go back to page 2 of Week 3 s R seminar exercise and follow the instructions to sign up for an API key.

Hint 3: In downloading data, you only need to be concerned about min_taken_date and max_taken_date - you can ignore min_upload_date and max_upload_date.

Hint 4: The time taken on a photo is in the photographers local time. For the purposes of this exercise, dont worry about time zones just use the times that Flickr specifies.

1B. Downloading one hour of data from Flickr and counting photos (1%)

Excellent now you have a URL for one hour's data. Now you need to work out how to get that data into R and extract the photo count.

To  do  this,  edit  the  function  downloadFlickrDataOneHour.  Again,  this  function  has  two arguments:

-    startOfHour: a POSIXct date and time object, to specify the beginning of the hour you would like data for, and

-    text: the text which should be attached to the photograph.

You need to change the function to take those two arguments and download one JSON page of data on Flickr photos in an hour, starting at the time specified by startOfHour, with the text attached as specified by text. You then need to work out the total number of photos that were taken in that hour, from the one page which has been returned. You can use your buildFlickrURLOneHour to create the URL you need.

To gain credit for your work on this function, you need to return a data frame with one row and two columns.  The  first  column  should  be  called  Date,  and  the  second  column  should  be  called PhotoCount. The columns must be in this order.

For the one row of data you create:

-    the Date column should contain a POSIXct date and time object which specifies the beginning

of the hour you have downloaded data for, and

-     the  PhotoCount column should contain a count of the total number of photos which were

taken in that hour and uploaded to Flickr with the text specified by text. This count will be a

number, and so data in this column should be numeric.

You can test this function by typing in:


text="hurricane sandy")

but again, note that it should work for other times and text values too.

Hint: Remember that you learnt in Week 3’s R seminar that the data from the Flickr API can be a bit noisy. If you get slightly different numbers on running this command multiple times, dont worry. See the note at the bottom ofpage 5 in the Week 3 R seminar.

1C. Working out photo counts for multiple hours of data from Flickr (1%)

Brilliant – you know how to work out the number of photos taken in one hour. Now you want to write some code that can work out the number of photos taken in each of a sequence of hours, and put it all together in a data frame for you.

To  do  this,  edit  the  function  downloadFlickrDataMultipleHours.  This  function  has  three arguments:

-    minHourStart: a POSIXct date and time object, to specify the beginning of the first hour you would like to download data for,

-    maxHourEnd: a POSIXct date and time object, to specify the end of the last hour you would like to download data for, and

-    text: the text which should be attached to the photographs.

You need to change the function to take those three arguments, and get counts of the number of Flickr photos with the text specified by text that were taken in each of the hours between minHourStart and maxHourEnd. You can use your  downloadFlickrDataOneHour to get the count of photos for each hour you specify.

To gain credit for your work on this function, you need to return a data frame with a row for each hour and two columns. The first column should be called Date, and the second column should be called PhotoCount. The columns must be in this order. For each row:

-    the Date column should contain a POSIXct date and time object which specifies the beginning of the hour you have downloaded data for, and

-    the PhotoCount column should contain a count of the total number of photos which were taken in that hour and uploaded to Flickr with the text specified by text. Note that these counts will be numbers, and so data in this column should be numeric (not text).

The data frame should be ordered such that the first row has the earliest time, and the last row has the latest time.

To save time in testing, let's test this function on a short period of time to start with – for example, 01:00:00 UTC to 04:59:59 UTC on 29 October 2012.

We already have the first time saved in testStartShort. At the bottom of the file, we have saved the second time in testEndShort for you. Don't edit the code at the bottom of the file!

You can test this function by typing in:



text="hurricane sandy")

but again, note that it should work for other times and text values too.

Hint: Again, remember that you learnt in Week 3’s R seminar that the data from the Flickr API can be a bit noisy. If you get slightly different numbers on running this command multiple times, don’t worry. See the note at the bottom of page 5 in the Week 3 R seminar.

Part 2: Processing the Flickr data (2%)

The hurricane might not be the only influence on the number of photos people take. Perhaps people take more photos at the weekend or at certain times of day, for example.

We should account for this by finding out how many photos were taken in total during each hour, and using this data to normalise our hourly counts of“hurricane sandy”photographs.

To solve this problem, let's split this task up into two sub-problems:

A)  Downloading  counts  for  both“ hurricane  sandy”photos  and  all  photos  taken  for  a  given sequence of hours

B)  Using this data to normalise thehurricane sandycounts

2A. Downloading counts for hurricane sandy photos and all photos taken (1%)

You've already written some code to download counts of photos taken in a given hour with specified text attached to them.

Go back to the Flickr API Explorer for flickr.photos.search:


What value of "text" do you need to specify to get data on all photos taken? (Important hint: you don't need to remove this parameter completely from the URL.)

Use this information to write code to download data on both how many“hurricane sandy”Flickr photos were taken in one hour, and how many photos were taken in that hour in total.

To do this, edit the function downloadAllFlickrData. This function has two arguments:

-    minHourStart: a POSIXct date and time object, to specify the beginning of the first hour you would like to download data for, and

-    maxHourEnd: a POSIXct date and time object, to specify the end of the last hour you would like to download data for.

You need to change the function to take those two arguments, and create a data frame with three

columns, and a row for each of the hours between minHourStart and maxHourEnd .

The first column should be called Date, the second column should be called SandyPhotoCount, and the third column should be called AllPhotosCount. The columns must be in this order.

For each row:

-    the Date column should contain a POSIXct date and time object which specifies the beginning of the hour you have downloaded data for,

-    the SandyPhotoCount column should contain a count of the total number of photos which were taken in that hour and uploaded to Flickr with the text“hurricane sandy”attached. Note that this count will be a number, and so data in this column should be numeric (not text).

-    the AllPhotosCount column should contain a count of the total number of photos which were taken in that hour and uploaded to Flickr. Note that this count will be a number, and so data in this column should be numeric (not text).

You can use your downloadFlickrDataMultipleHours function to download both the“hurricane sandy”counts and the total counts, if you specify the right values for text. You then need to combine these counts to make the data frame described above. Data in the photo count columns should again be numeric.

You can test this function by typing in:



but again, note that it should work for other times too.

Once you've got this working, save the data for this short period in shortFlickrData as follows, so that you can use it to solve the next task.

shortFlickrData <- downloadAllFlickrData(minHourStart=testStartShort,


Hint 1: The command merge will help you combine the datasets.

Hint 2: Once again, remember that you learnt in Week 3’s R seminar that the data from the Flickr API can be a bit noisy. This is particularly obvious when requesting data on the full number of photographs in an hour, probably because this is a particularly large number. If you get slightly different numbers on running this command multiple times, don’t worry. See the note at the bottom of page 5 in the Week 3 R seminar.

2B. Normalising the hurricane sandy photo counts (1%)

You now want to use this data to calculate normalised counts of“hurricane sandy”photographs. To do this, edit the function normaliseFlickrCounts. This function has one argument:

-    allFlickrData: a data frame created by downloadAllFlickrData which has hourly counts of both how many“hurricane sandy”Flickr photos were taken in a given hour, and how many photos were taken in that hour in total.

You can use the shortFlickrData data frame you created above to test this function.

You need to change the function to take this argument, and add an extra column to this data frame called SandyNormalised. For each row, this column should contain the result of dividing the value in SandyPhotoCount by the value in AllPhotosCount.

The function should then return the whole data frame, which should now have four columns, called Date, SandyPhotoCount, AllPhotosCount, and SandyNormalised. The columns must be in this order. The data frame should have the same number of rows as it had before.

Again, data in the columns SandyPhotoCount, AllPhotosCount, and SandyNormalised will be numbers, and so data in these columns should be numeric (not text).

You can test this function by typing in:


but again, note that it should work for other data frames created by downloadAllFlickrData too.

If you have got all of this working, then now is the time to try downloading all of the Flickr data you need for the assignment, for the period 29 October 2012 00:00 to 01 November 2012 23:59.

At the bottom of the file, we have saved the time at which this period begins in testStartFull, and the time at which this period ends in testEndFull. Don't edit the code at the bottom of the file!         You can save the full data set by typing this into the console:

fullFlickrData <- downloadAllFlickrData(minHourStart=testStartFull,


If everything is working correctly, then this should take around  10 minutes on a good broadband connection.

You can now add the normalised counts by typing this into the console:

fullFlickrData <- normaliseFlickrCounts(allFlickrData=fullFlickrData)

Well done – you have successfully acquired and processed the Flickr data. Now for the environmental data!

Part 3: Processing the environmental data (1%)

As  a  hurricane  approaches  an  area,  atmospheric  pressure  falls.  We  can  therefore  use  data  on atmospheric pressure as a measure of the hurricane’s progress.

3A. Accessing data on atmospheric pressure

Hurricane Sandy made landfall very close to Atlantic City in New Jersey.

Data  on  atmospheric  pressure  is  made  available  by  the  National  Oceanic  and  Atmospheric Administration in the US. Here is their website:


We have retrieved hourly atmospheric pressure observations for all of you from NOAA’s Atlantic City weather station, from the beginning of 29 October 2012 to the final hour of 1 November 2012 – the same as the Flickr data.

The results of this order are attached to this assignment in exactly the form that NOAA provided them, in the file“3922327258060dat.txt”.

Open this file and have a look at the data.

Hint: What format is this data in?

Save this file in your R working directory.

Hint: Look at previous R seminar sheets or ask Google if you can't remember what your R working directory is.

To use this data in the next tasks, find the variable noaaFilename in your R answers file. Change the value of this variable to the name of the data file.

3B. Reading in the atmospheric pressure data

Now you've got the atmospheric pressure data, you need to read it in. Have another look at the file. The information you require is the date on which each reading was taken, the time at which it was taken, and the atmospheric pressure measurement.

The column headings might be a little tricky to understand. NOAA has provided some documentation to try and help with this. This is in the file“3922327258060doc.txt”. Open this documentation file up and take a look at the guidance NOAA has given you.

Use the documentation file to identify which columns contain the data you require in the data file. You want to read in the data file and return a data frame containing these three columns.

To do this, edit the function readNOAAData. This function has one argument:

-    filename: the name of the NOAA data file.

You need to change this function to read in the file at the location given by filename, extract the three columns you are interested, and rename them, so that you can return a data frame with a row for each atmospheric pressure reading and the following three columns:

-    Date: the date at which the reading was taken. By default, R will read this in as a number. You should leave it as a number for now. Each of the dates will have the following format: 20121029

-    Time: the time at which the reading was taken. By default, R will read this in as a number. You should leave it as a number for now. The times will look a little strange, due to them being numbers: for example, you will see times such as 0 (midnight), 100 (1am) and 200 (2am).

-    AtmosPressure: the atmospheric pressure reading for the given date and time.

The columns must be in the order they are listed above.

You can test this function by typing in:


but again, note that the function should work for any filename. (Specifically, the function needs to work for us if we give it a path for a data file stored on our computers, potentially with a different name.)         If you've got this working, save the data in noaaData as follows, so that you can use it to solve the next task.

noaaData <- readNOAAData(filename=noaaFilename)

If your code is correct, when you type in the following command


you should see the following output:

'data.frame':  93 obs. of  3 variables:

$ Date         : int  20121029 20121029 20121029 20121029 20121029

20121029 20121029 20121029 20121029 20121029 ...

$ Time         : int  0 100 200 300 400 500 600 700 800 900 ...

$ AtmosPressure: num  1000 999 999 998 998 ...

Hint 1: In one of the seminar exercises, you read in data in a very similar format. You need to change your code a little bit to be able to read this file in however.

Hint 2: Remember to rename your column names once you've loaded the data!

Part 4: Combining the Flickr and environmental data (2%)

Now you have the Flickr data and the environmental data. To start to investigate how these data sets relate, you need to combine these datasets into one table.

For each hour from the beginning of 29 October 2012 to the end of 1 November 2012, you have both a normalised count of the number of  Hurricane Sandy Flickr photos taken, and a measurement of atmospheric pressure in Atlantic City. However, the atmospheric pressure data uses a different format than the Flickr data for specifying the date and time.

You need to work out how to change the format of the atmospheric pressure date and time, so that it matches the format used in the Flickr data, and then use the dates to combine the datasets.

To solve this problem, let's split this task up into two sub-problems:

A)  Changing the format of the atmospheric pressure date and time

B)  Combining the Flickr and environmental datasets

4A. Changing the format of the atmospheric pressure date and time (1%)

You just had a look at the structure of the noaaData data frame by using the following command:


Have a look at the structure of the fullFlickrData data frame too:


You can see that dates in the Flickr dataset are in POSIXct format, such that the date and time are both stored in the Date column.

You need to take the date and time information in the noaaData data frame and combine it to create a new column with POSIXct date-time objects in the noaaData data frame.

To do this, edit the function changeNOAADateFormat. This function has one argument:

-    noaaData: the atmospheric pressure data you read in with readNOAAData

You need to change this function to process the date and time information in noaaData, and for each reading, create a  POSIXct object that represents the time and date of the reading. These new timestamps should got in a new column called DateTime. You should then remove the old Date and Time columns.

The function should return a data frame with a row for each reading in noaaData, and the following two columns:

-    AtmosPressure: the atmospheric pressure readings from noaaData, and

-    DateTime: the time and date of each reading as a POSIXct object.

The columns must be in the order they are listed above.

You can test this function by typing in:


If you've got this working, save the processed data in noaaData as follows, so that you can use it to solve the next task.

noaaData <- changeNOAADateFormat(noaaData=noaaData)

If your code is correct, when you type in the following command


Suzy Moat and Tobias Preis

you should see the following output:

'data.frame':  93 obs. of  2 variables:

$ AtmosPressure: num  1000 999 999 998 998 ...

$ DateTime     : POSIXct, format: "2012-10-29 00:00:00" "2012-10-29

01:00:00" "2012-10-29 02:00:00" ...

Hint 1: The formatC function looks a bit complicated, but is a simple way of converting a number into a string with a certain number of characters (e.g., 4). Might this be useful to process the times?

Hint 2: Remember the as.POSIXct() function it will come in handy here.

4B. Combining the Flickr and environmental datasets (1%)

Great you’ve got data on Hurricane Sandy related photos from Flickr, and data on atmospheric pressure from NOAA. If you wanted to carry out an analysis of how these two datasets related, you would want to put them into the same data frame.

To do this, edit the function mergeFlickrAndNOAAData. This function has two arguments:

-    allFlickrData: a data frame containing counts of Flickr photos which you created using normaliseFlickrCounts, and

-    noaaData:  a  data  frame  containing  atmospheric  pressure  data  which  you  created  using changeNOAADateFormat

You  want  to  change  this  function  to  merge  the  datasets  specified  by  allFlickrData and noaaData and  create  one   data frame, where  each  row  represents  one  hour,  and  contains  a measurement of the atmospheric pressure and the normalised count of Hurricane Sandy Flickr photos. This is one line of code (on top of the return function).

The function should  return a data frame with a  row for every atmospheric  pressure  reading you downloaded, and the following five columns:

-    Date: the time and date of the hour for which you have Flickr data and atmospheric pressure data,

-    SandyPhotoCount: a count of the total number of photos which were taken in that hour and uploaded to Flickr with the text“hurricane sandy”attached, as in allFlickrData,

-    AllPhotosCount: a count of the total number of photos which were taken in that hour and uploaded to Flickr, as in allFlickrData,

-    SandyNormalised: the normalised count of“hurricane sandy”Flickr photos for that hour, as in allFlickrData, and

-    AtmosPressure: the atmospheric pressure reading for that hour, as in noaaData

The columns must be in the order they are listed above.

You can test this function by typing in:

mergeFlickrAndNOAAData(allFlickrData=fullFlickrData, noaaData=noaaData)

If you've got this working, save the processed data in allData as follows, so that you can use it to solve the next task.

allData <- mergeFlickrAndNOAAData(allFlickrData=fullFlickrData,


Part 5: Visualising the data (2%)

Excellent – you have all the data!

The first thing you would want to do when working out whether there was a relationship between the two data sets is to plot the data. For now, let’s make one graph of the Flickr data and one graph of the atmospheric pressure data, and make sure they are clear and easy to read.

5A. Visualising the Flickr data (1%)

It would be helpful to see a line graph of the normalised counts of“hurricane sandy”Flickr photographs across time.

To do this, edit the function createFlickrPlot. This function has one argument: </