闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

7CCSMBDT

Big Data Technologies

2018

Question One

(a) What are the categories of data based on who creates them? For each category, provide an example of data and justify why the data in your example belongs to a certain category.

[4 marks]

(b) What are the categories of data based on their format? For each category, provide an example of data and justify why the data in your example belongs to a certain category.

[6 marks]

[4 marks]

(d) For each of the following analytic tasks, identify its type and provide justification for the type you have identified.

(i) Computing the median of a given stream of data

comprised of temperature readings.

(ii) Applying Principal Component Analysis (PCA) to a matrix of 50 rows and 50 columns, whose elements are integers from 0 to 10.

(iii) Clustering the records of a dataset that is stored over 50 machines (nodes of a computing cluster), based on their pairwise similarity.

(iv) Computing the shortest path between two nodes of a graph, representing a social network. The graph is comprised of 10 nodes and 100 edges.

[8 marks]

(e) Identify the most appropriate analytic setting for performing each of the tasks (ii), (iii), and (iv) in part (c) of Question One. Provide justification for your answers.

[3 marks]

Question Two

(a) What is a publish-subscribe messaging data access connector? Briefly describe its main components. How does a publish-subscribe messaging data access connector differ from a Source-sink data access connector?

[6 marks]

(b) Consider the following Apache Sqoop command:

sqoop import --connect jdbc:mysql://localhost/hadoop --username U - -password P --table demographics -m 5 --columns "age, gender"

Describe the process that is performed when this command is executed.

[6 marks]

(c) Consider the task of using Apache Flume for transferring Twitter data from Twitter into memory and then writing the data into HDFS and into local files.

Describe the main architectural components of the Apache Flume system that are used to perform the task, providing their names and types (if any), as well as their main functionality for the given task.

Draw a diagram to illustrate how these components are connected.

[7 Marks]

(d) Consider the following part of a MapReduce program (using mrjob) where the mapper and reducer is as follows:

def mapper(self, _, line):

yield “chars”, len(line)

def reducer(self, key, values):

yield key, f(values)

and f() is a given function.

Provide an example of a function f(), for which a combiner can be used to improve the efficiency of the program. Justify your choice of f() and explain the input and output of the combiner that uses the function f() you chose.

[6 marks]

Question Three

(a) Briefly describe three differences between Relational Data Base Management Systems and NoSQL databases.

[6 marks]

(b) For each of the following example applications, select an appropriate type of NoSQL database. Justify your choice.

(i) An application requiring to efficiently read key/value pairs representing customers, where the key is a customer’s id and the value is a string containing customer information.

Key	Value
111	“store 1, 13:00, 10.5GBP, tip”
324	“12:00, 14:00, waiting time=10”
742	“part 3, 120GBP, 30 mins”
…	…

(ii) An application requiring to perform filtering operations (e.g., regular expressions) on customer information.

(iii) A social application that predicts the behaviour of customers based on their connections and interactions on a social network.

(iv) An application requiring to perform queries on data represented as a high-dimensional and sparse matrix (i.e., a matrix with many columns and many zero values).

(v) An application involving queries based on relationships between customers (e.g., to find groups of interconnected customers).

[10 marks]

[4 marks]

(d) What is a replica set in MongoDB? Briefly describe the components of a replica set in MongoDB and their main role.

[5 marks]

Question Four

(a) What are transformations and actions in Apache Spark? How do they differ?

[5 marks]

(b) Consider the following lines of pySpark code:

MyRDD=sc.parallelize([1,2,3,4,5,6,7,8,9,10])

MyRDD2=MyRDD.filter(lambda x: x*x+1>4)

MyRDD2.collect()

(i) Which of the following commands (if any) are NOT computed instantly?

[2 marks]

(ii) What benefits does this “lazy evaluation” offer?

[2 marks] (iii) What is the result of executing these lines of code and why?

[2 marks]

(i) Assuming the most basic type of estimate (no improvements to the estimate), what is the estimate that is output by a single Flajolet-Martin (FM) sketch when it is applied to D, using h(x), and the hash values (i.e., outputs of h(x)) are stored using 4- bits ? Justify your answer.

[10 marks]

(ii) Propose two ways to improve the estimate. Your answers can assume multiple Flajolet-Martin sketches.

[4 marks]