关键词 > COMPSCI4064/5088

COMPSCI 4064 (H) / 5088 (M) Big Data: Systems, Prog & Mgt

发布时间：2023-08-01

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

TODO DATE AND TIME

Duration: H-Level: 1.5 hours, M-Level: 2 hours

Additional time: 30 minutes

Timed exam – fixed start time

DEGREE of MSc

Big Data: Systems, Prog & Mgt

COMPSCI 4064 (H) / 5088 (M)

(H-Level: Answer the first 3 Questions, M-Level: Answer all 4 Questions)

This examination paper is an open book, online assessment and is worth atotal of 75 marks.

1. (a) Context: Consider the following scenario. You are working at the British Broadcasting Company (BBC) and are responsible for preparing on-demand video recordings from live programming. An hour of broadcast material uses around 45GB of storage space, and the BBC runs at peak times 35 concurrent channels. Audio and video channels are stored separately, and around half of the recorded content have subtitles in multiple languages. You need to take the high-quality recordings, encode and downsample them to 6 resolution and bit-rate options. The BBC has a target of having on-demand videos published to their iPlayer offering within 60 minutes of the live recording ending.

Question: The preparation of these videos constitutes a big data task. List the dimen- sions/properties of the task that makes it a big data problem and discuss why.

Guidance: The question is worth 6 marks. There are 3 marks for correctly identifying the dimensions/properties and 3 marks for the associated discussion (1 mark per correct point made). [6]

Context: When building a solution to this problem, you might tackle it using either batch or stream processing.

Question: Discuss which is a better solution and why.

Guidance: The question is worth 4 marks. You must state what type of solution you are arguing for, and then provide 2 distinct points for why this solution is superior to the alternative, 2 marks per point. For each point you need to contrast against the alternative to receive the marks. [4]

Context: The BBC decides to develop a horizontally scalable batch processing solution using Hadoop MapReduce. The design for the solution is as follows. For each live video recording, that video will be split into batches of image frames to be processed, where each batch ends in a key frame. The batch of frames is sent to an encoder, which downsamples each frame to a lower resolution. Once all frames in a batch have been processed, they are combined together to create a key-frame block, which also involves crops out (removes) static content between frames to enable compression. The encoded video is a sequence of these keyframe blocks.

Question: Describe using MapReduce terminology how a single video would be processed.

Guidance: The question is worth 14 marks. You are expected to describe 7 steps. There are 7 marks for correctly identifying the steps, and 7 marks for describing what happens within each step. [14]

2. (a) Context: A start-up company is designing a new low-cost cloud compute platform using ApacheSpark. The core idea of the start-up is to allow members of the public to sell compute time on their mobile phones when those devices are idle, where the compute capacity of multiple phones can be bundled and sold on to a business for profit. The start-up company runs a series of Spark Masters on their own servers hosted in their headquarters, where the

Spark Workers are deployed on thousands of mobile phones spread out over the UK.

Question: Discuss why the company might have chosen to use Apache Spark rather than Hadoop as its big data processing framework.

Guidance: This question is worth 8 marks. 4 points are expected, where one mark is available for providing a relevant difference between Spark and Hadoop and 1 mark is available for why this is important in the context of this application. [8]

Question: The company runs a series of tests and discovers that often tasks are getting executed multiple times, which is wasting resources. Given your understanding of how Spark executes jobs, what reasons are there that might cause this to happen?

Guidance: The question is worth 9 marks. We expect you to identify three different possible causes. For each, there is 1 mark for describing a valid cause, 1 mark for specifying where in the Spark job execution this might happen, and 1 mark for discussing why this is a particular problem for the company’s scenario. [9]

Context: An engineer at the company suggests that moving to an Apache YARN based management system might reduce the need for services managed at the headquarters.

Question: Is this true and explain why. Then provide a short discussion on how this would affect the robustness of the companies solution.

Guidance: This question is worth 6 marks. We expect you to specify whether the statement is true [1 Mark], and then summarize why [2 Marks]. You should then include a paragraph describing how using YARN changeshow applications are managed using Spark [2 Marks], and how this affects robustness in this scenario [1 Mark]. [6]

3. (a) Context: The School of Computing Science is preparing to purchase a new compute cluster and data storage solution to support undergraduate and PhD students. The system is expected to support 1000 undergraduate students and 100 PhD students. A typical undergraduate student is expected to require 50GB of file storage, while a PhD student is expected to need 1TB of file storage. Undergraduate students rarely access their data, but when they do, read/write latency is important. Meanwhile, PhD students typically are running experiments on large data, which requires high read throughput of large blocks of data.

Question: The School is considering buying a RAID1 solution. Why might the School want a RAID6 solution over a RAID1 solution?

Guidance: This question is worth 4 marks. 2 marks are available for describing the differences between RAID1 and RAID6 and 2 marks are available for valid reasons for why RAID6 is superior for this scenario. [4]

Question: A member of staff suggested that instead of using RAID-based storage they should instead use an HBase installation. Is this a good idea given the requirements?

Guidance: This question is worth 4 marks. You are expected to make two points in your answer, marks are awarded for highlighting the difference between the two solutions and why this is important given the requirements. [4]

(c) Context: The School buys 50 compute nodes with 24 CPU cores per node, each with an internal 1TB SSD for use with a Cassandra installation that was requested by one of the research groups.

Question: When configuring Cassandra should the School use virtual nodes or not?

Guidance: This question is worth 5 marks. 2 marks are for describing what virtual nodes do, 1 mark is for correctly identifying whether this is needed and 2 marks are for explaining why. [5]

4. (a) Context: The typical replication factor in HDFS is 3, meaning that each block will be stored on at least 3 nodes. Keeping in mind the various protocols employed by HDFS (read, write,replication, recovery, etc.):

Question: What would you expect the effects of a higher replication factor to be on write throughput?

Guidance: This question is worth 5 marks, of which 2 marks are for identifying the key aspect of the HDFS that affects write throughput, 2 marks are for describing how changing the replication factor affects this and 1 mark is for stating what effect a higher than 3 replication factor would have. [5]

(b) Question: What would you expect the effects of a higher replication factor to be on read latency?

Guidance: This question is worth 4 marks, of which 2 marks are for identifying the key aspect of the HDFS that affects read latency and 2 marks are for describing how changing the replication factor affects this. [4]

Question: How would you expect a higher replication factor to affect the NameNode and DataNodes in terms of resource usage?

Guidance: This question is worth 6 marks. Three points are expected, two marks per point. [6]