Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Tuesday 25 April 2023

14.00 - 16.00 BST

Duration: 2 hours

Additional time: 30 minutes

Timed exam - fixed start time

DEGREES OF MSc, MSci, MEng, BEng, BSc,MA and MA (Social Sciences)

Big Data: Systems, Programming, and Management (M)

COMPSCI5088

Answer all questions

This examination paper is an open book, online assessment and is worth a total of 75 marks.

1.    (a)  Context:  Consider the following scenario.   You are a big data engineer working at a  satellite imaging company, who are contracted to develop an alerting system for natural  disasters based on the images captured by the satellites. When active, a satellitesends a  high-resolution image to a base station on the ground each second, which is then streamed  to the company headquarters to be processed. Each image is around 10 MB in size. The  company owns 100 satellites with the goal of doubling that in the next 5 years. Satellites are  set in fixed orbital paths and are tasked with taking pictures from specific regions on earth. When not over a target region the satellite is inactive (to save power). Typically, around  20% of the satellites are active at any one time, although this may increase to 50% during  busy periods of the year (e.g. US Hurricane season).

Question: Processing the images produced by these satellites for disaster alerting constitutes a big data task. List the dimensions/properties of the task that makes it a big data problem and discuss why.

Guidance: The question is worth 6 marks. There are 3 marks for correctly identifying the dimensions/properties and 3 marks for the associated discussion (1 mark per correct point made). [6]

(b)  Context: The company is approached by a large hardware manufacturer offering a single mainframe computer to perform the processing of the satellite data.

Question: Discuss why this is not a good option given their requirements.

Guidance: The question is worth 8 marks. You are expected to make 4 distinct points in your discussion, 2 marks per point.    [8]

(c)  Context: The company instead decides to develop a horizontally scalable solution to this  problem comprised of three components, each of which can be individually parallelised. These components are:  1) Image splitting into smaller regions (converts each image into 24  smaller regional images); 2) Region image variant generation (applies 4 different filters, e.g. grayscale, to each regional image, creating 4 variants of that image); and 3) disaster type classification on each regional image variant. If any of the regional image variants are  classified as showing a natural disaster, then an alert is generated.

Question: Discuss what the main challenges are that the company will face when developing this solution.

Guidance: The question is worth 8 marks. You are expected to discuss 4 challenges. There is 1 mark for discussing a relevant parallelism concept and 1 mark for describing why it is a relevant challenge for this proposed system. [8]

2.    (a)  Context: A news aggregator company is experimenting with automatic summarization of news articles. They pay major news outlets to receive copies of news articles that those outlets publish and then algorithmically convert multiple news articles on a particular topic into a short summary for customers.  Each article is stored in a separate file.  The summarization process has four main stages:  1) content extraction from the news articles (converts the article into paragraphs); 2) topic identification, where each paragraph is assigned one or more topics (e.g. War in Ukraine, US Election, etc.); 3) paragraph grouping by topic; and 4) summarization of a group of paragraphs on each topic (converts multiple paragraphs into a single summary).

Question: The news aggregator company has decided to implement their automatic sum- marizerin MapReduce using Hadoop. Describe how you would implement this as a Hadoop  program and why.

Guidance: The question is worth 9 marks. We are looking for you to describe how you convert the above stages into MapReduce operations. You should state what the type of each operation is and what that operation does in the context of this solution, along with its inputs and outputs (worth 6 marks). After describing your operations, you should then include a short paragraph (2-3 sentences) summarizing why you chose this design (worth 3 marks). [9]

(b)  Question: The company implements the above system in Java via Hadoop and then profiles it to determine where the main overhead costs are when processing. Describe what you expect them to find and why.

Guidance: This question is worth 8 marks. We expect you to list four main observations, where one mark is available for the observation and one mark is available for why. We do not expect you to discuss the cost of the operations themselves (since you won’t know the relative cost of content extraction, topic identification or summarization), but rather aspects of Hadoop that will result in efficiency overheads. [8]

(c)  Context: Given the overhead costs observed, the company decides that they should test a Java implementation in Apache Spark.

Question: Discuss what the primary advantages that the Spark implementation of this system would have over a Hadoop implementation.

Guidance:  This question is worth 8 marks.  4 points are expected, where one mark is available for what the advantage is, and 1 mark is available for why this is better than what Hadoop provides. [8]

3.    (a)  Context: A large shipping company is in the process of upgrading the storage infrastructure that their company uses to store data produced by the ships in their fleet.  The company needs to store two types of data: 1) audio logs of communication between its ships and third parties (coastguards, docking facilities, etc.) that are relatively large files and 2) sensor data produced by the ships themselves (GPS coordinate data, engine temperature, etc.) that comprise many small files.

Question:  The company currently uses  an NFS backed by RAID6 storage, which is replicated to an off-site back-up at the end of each week and is considering moving to an HDFS-based solution. What are the advantages and disadvantages of such a move?

Guidance: This question is worth 8 marks. 4 distinct points are expected. For each point, one mark is available for what difference between the two solutions would be and a further mark is available for what impact this would have. You should connect these differences to Big Data aspects like resiliency, scalability, etc. [8]

(b)  Context: One of the engineers at the company is worried about the Name Node as a single point of failure and instead suggested using a decentralized solution.

Question:  Summarize the primary difference between a distributed and decentralized storage solution.

Guidance:  This  question is worth 3 marks.   You  should define both distributed and decentralized storage (2 marks) and then contrast them (1 mark).  [3]

(c)  Context: One of the foundations of a decentralized solution like Cassandra is that it uses hashing schemes to enable fast indexing of its available storage.

Question: Describe consistent hashing and order-preserving hashing, and then discuss how they are combined in Cassandra.

Guidance: This question is worth 6 marks. 4 marks are available for explaining how the two hashing schemes function, and 2 marks are available for describing how they can be combined. [6]

4.    (a)  Context: In Hadoop 2.0, Yet Another Resource Negotiator (YARN) was introduced.

Question: Discuss why a YARN-based Hadoop environment is superior to a traditional Hadoop configuration. [3]

(b)  Context:  In the original YARN paper it refers to the place where work is done as a ‘container’. However, this is not the common meaning of containers that are used today.  

Question: What are the differences between a YARN container and containers that are common today?

Guidance:  The question is worth  8 marks.   4 differences are expected, 2 marks per difference. You need to specify how the two types of containers differ to receive the marks. [8]