Statistical Methods for Big Data (MATH70072)
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Statistical Methods for Big Data ( MATH70072)
Coursework Assignment
Introduction
This is the coursework assignment associated with the Big Data in Statistics module. A report response is to be submitted by 1700hrs on 4th May 2022. An electronic copy of the report (in Word, PDF or iPython notebook format) should be submitted via the module’s Blackboard page.
Data
The dataset is a modified version of the VAST 20161 Mini-Challenge 1 data. Some information taken from the VAST 2016 website is repeated here, to ensure that these instructions are self-contained.
At the end of 2015, a [fictitious] growing organisation, GAStech, moved into a new, state-of-the-art, three-story building near to their previous location. The new office is built to the highest energy efficiency standard, but as with any new building, there are still several heating, ventilation, and air conditioning (HVAC) issues to work out. The building is divided into several HVAC zones. Each zone is instrumented with sensors that report building temperatures, heating and cooling system status values, and concentration levels of various chemicals such as Carbon Dioxide (abbreviated CO2) and Hazium (abbreviated Haz), a recently discovered and possibly dangerous chemical. CEO Sten Sanjorge Jr. has read about Hazium and requested that these sensors be included. However, they are very new and very expensive, so GAStech can afford only a small number of sensors.
With their move into the new building, GAStech also introduced new security procedures, which staff members are not necessarily adopting consistently. Staff members are required to wear proximity (prox) cards while in the building. The building is instrumented with passive prox card readers that cover individual building zones. The prox card zones do not generally correspond with the HVAC zones. When a prox card passes into a new zone, it is detected and recorded. As part of the deal to entice GAStech to move into this new building, the builders included a free robotic mail delivery system. This robot, nicknamed Rosie, travels the halls periodically, moving between floors in a specially designed chute. Rosie is equipped with a mobile prox sensor, which identifies the prox cards in the areas she travels through.
The building is partitioned into different zones, across three floors, as depicted in the three figures below.
There are four datasets provided, covering May 31 to June 13, 2016. The data are as follows:
• Fixed proximity sensor data reading employees’ prox cards (prox-fixed.csv);
• Mobile proximity sensor data (from Rosie) reading employees’ prox cards (prox-mobile.csv);
• Environmental conditions of the building (bldg-measurements.csv) – see Annex A for further details;
• Hazium concentration within the building (f1z8-haz.csv), containing the Hazium concentration on floor 1, zone 8.
Acquiring the data
These instructions assume that you have successfully completed Exercise 1 of Week 1. If you have not done so then please complete this exercise before proceeding with the coursework.
Please log on to bazooka. A unique dataset is to be generated for each student, using the following commands:
$ cd ~/bd-sp-2017
$ cd coursework
$ chmod +x * .py
$ ./process_data .py /tmp/coursework/prox-fixed .csv \
/home/USERNAME/bd-sp-2017/data/prox-fixed .csv
$ ./process_data .py /tmp/coursework/prox-mobile .csv \
/home/USERNAME/bd-sp-2017/data/prox-mobile .csv
$ ./process_data .py /tmp/coursework/bldg-measurements .csv \
/home/USERNAME/bd-sp-2017/data/bldg-measurements .csv
$ ./process_data .py /tmp/coursework/f2z2-haz .csv \
/home/USERNAME/bd-sp-2017/data/f2z2-haz.csv
$ cd ../data
$ ls -la
Note that USERNAME will need to be replaced with your actual username.
You should see four new files in the data directory corresponding to the four data files (prox- fixed.csv, prox-mobile.csv, bldg-measurements.csv, f2z2-haz.csv). Please run the following commands and record the output of each command at the top of your coursework report submission.
$ md5sum prox-fixed.csv
$ md5sum prox-mobile .csv
$ md5sum bldg-measurements .csv
$ md5sum f2z2-haz.csv
Create a folder in HDFS called coursework. You should now upload these four data files to your coursework folder on HDFS.
Map Reduce
For questions 1-4 below, write a Map Reduce program to compute the required answer. Your response to each of these questions should consist of three components: (1) your answer to the question; (2) the Shell command used to execute the Map Reduce program; (3) Python code developed and used to compute the answer. The code will be checked for execution quality, so please ensure that the code is self-contained and executable. (Marks will be deducted for code that does not execute using the commands provided via component (2).)
1. Using both prox-fixed and prox-mobile datasets, produce a diagram that displays the number of staff members present in the building on each day (i.e. number of unique prox-ids on each day)? NB: The x-axis may be marked with day number (i.e. 0, 1, 2, …) from the beginning of the dataset. [8 marks]
2. Using the prox-fixed dataset, what is the (floor, zone) of the most visited location in the building? [5 marks]
3. Using both datasets, what is the prox-ID of the most active staff member (i.e. the staff member with the greatest number of prox card readings) on 2nd June 2016? [5 marks]
4. Using the bldg-measurements dataset, produce a time series plot of the average hourly “Total Electric Demand Power” . (This should be a single plot, with the x-axis denoting hour of day, with a range of 0hrs-23hrs.) What does this plot indicate about power usage throughout the day? [5 marks]
Spark
With the exception of question 8, for the following questions, write a sequence of Spark commands (that are executed in the Spark REPL) to compute the required answer. For each question, the full sequence of Scala commands should be pasted into your submission, together with the computed answer, and any other information requested. The code will be checked for execution quality, so please ensure that the code is self-contained and executable. (Marks will be deducted for code that does not execute using the sequence of commands provided in your coursework submission.)
5. Parse the prox-fixed.csv data fie into an RDD[ProxReading], where ProxReading is defined as: case class ProxReading(timeStamp: org.joda.time.DateTime, id: String, floorNum:
String, zone: String). In this class, timestamp corresponds to a joda DateTime object2, id corresponds to prox-id, floorNum corresponds to the floor number, zone corresponds to the zone id. [2 marks]
6. Using the prox-fixed dataset, what is the (floor, zone) of the most visited location in the building across the complete dataset? [3 marks]
7. Using both datasets, what is the prox-ID of the most active staff member (i.e. the staff member with the greatest number of prox card readings) on 7th June 2016? [3 marks]
8. Provide a concise summary of your experiences writing Map Reduce programmes and Spark commands for questions 6 and 7. Comment on the differences between the two computational platforms. [2 marks]
9. Construct an appropriate RDD from the bldg-measurements.csv data, containing the
“Date/Time” column (number 1) and “F_2_Z_1 VAV REHEAT Damper Position” column (number 193). What is the date and time of the first occurrence of the F_2_Z_1 VAV REHEAT Damper Position being fully open, i.e. the earliest date and time of variable “F_2_Z_1 VAV REHEAT Damper Position” being set to its maximum value of 1.0? [3 marks]
10. A rogue employee is believed to be increasing the Hazium concentration in the building by modifying the Reheat Damper position (“F_2_Z_1 VAV REHEAT Damper Position”). By using the Spark package MLlib3 or other Spark command sequence, demonstrate a statistical association between the Hazium concentration (from f2z2-haz.csv) and the “F_2_Z_1 VAV REHEAT Damper Position” variable. Provide a concise summary of your statistical findings, using diagrams where appropriate.
[10 marks]
11. By using the (fixed) proximity location data, determine the employee IDs for those that entered the Server Room (the HVAC control location) prior to the sudden increase in Hazium concentration at the end of the dataset (i.e. employees in the Server Room on 10th June 2016). [3 marks]
General
12. Write a question, that could appear in next year’s coursework paper, which tests a student’s understanding of the opportunities and problems with using Big Data technology. [10 marks]
13. Identify, download, and perform a statistical analysis of any suitable data that is available on the internet, and write a one-page summary of your findings. Your analysis should use Hadoop, Spark, or both tools. Please note that the data need not be “big” – the question is intended to assess your approach to the analysis, and how you utilise Big Data technology in performing a statistical analysis. [20 marks]
14. Please read the following research paper:
https://statistics.fas.harvard.edu/files/statistics-2/files/statistical_paradises_and_paradoxes.pdf
a. Write a short (less than one side of A4) synopsis of the paper, extracting the key statistical contributions of the paper. [15 marks]
b. Discuss how the key points raised in the paper could be relevant (if at all) to the statistical analysis performed in question 13, linking to other research if and where appropriate. [25 marks]
2 http://joda-time.sourceforge.net/apidocs/org/joda/time/DateTime.html
3 https://spark.apache.org/docs/1.2.1/mllib-guide.html
2022-04-21