Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

2022-23 Semester 1 Examination

COM6002

Big Data Management

Question 1: Role-play [60 marks]

A. Background:

You are a data scientist in the data research team at XYZ Limited.

The flagship product of XYZ is a data platform providing real-time and historical financial data of cryptocurrencies, e.g., bitcoin. There are two kinds of users of the data platform: (1) paid subscribers: who can access any real-time data and all historical data of 1,000 cryptocurrencies; and (2) free users: who can only access data of the past 7 days for 10 cryptocurrencies and cannot access real-time data of the recent hour.

At the moment, there are 1,000 paid subscribers and 10,000 registered free users. The maximum number of concurrent users of the data platform is 500.

B. Current situation:

XYZ’s current database system has been MySQL (on a single server machine) since the company was founded in Sep 2012. Recently, the CEO is discussing a partnership with ABC Limited, which is a financial consulting firm with a large customer base. To facilitate  the  partnership, ABC  requires  XYZ  to  provide  social  sentiment  data analyzing  whether  people  on  social  networks  are  bullish  or  bearish  on  each cryptocurrency.

If the partnership is successful, XYZ’s platform will also be used by ABC’s customers. That would mean the number of concurrent users will increase to 10,000.

ABC also commented that MySQL is outdated technology and should be replaced. The CEO is not from a technical background and is not able to reply toABC. Thus, the CEO requested the data research team to submit a proposal to review and revamp the existing data management system if appropriate. The proposal should ofcourse address the need for the coming social sentiment feature.

C. Your task:

The head of the data research team has appointed you to study the technical part of the proposal. Other aspects, like budget and schedule, will be handled by your colleagues. Your head does not have a preferred solution. She will adopt your idea as long as it makes sense to her. Write the technical part of the proposal that includes at least the following parts:

1.   Your proposed data management solution

a.   How the existing data platform can be migrated to use your proposed solution

b.   How the new feature of social network sentiment is supported by your proposed solution

2.   Technical justifications for using your proposed system

a.   Make sure you have also addressed all the concerns / comments from your colleagues (See Section D)

3.   Good alternative approaches that you have considered

4.   Reasons why the alternatives are not used

Note:

1.   You should propose only ONE solution. Your solution should be specific, e.g., don’t say you use a relational database management system, but give the actual system name in this case, e.g., MySQL.

2.   Your solution should include implementation information like what data to store and where data are stored etc.

3.   The primary reader of the proposal is your team head who is very technical. Make sure your justifications are technically sound and clear.

4.   Your proposal may be later read by your CEO or other parties who are not technical. Make sure your content can be understood by a layman.

5.   This is a formal proposal. Use a proper format for your proposal.

D. Information and views from your colleagues:

Your colleagues have given you more information about the current situation and their views for your reference. Note that their opinions may not be the best. Please use your own judgement to design your solution.

1.   CEO

ABC wants to have sentiment data updated every hour. Similar to our existing data platform, historical sentiment data should be provided.

2.   CTO

I have a concern about the risk of migration. If the current system is working well now, we should try to make use of the current system as much as possible.

3.   Head of IT support

The current server (one machine) can only handle at most 1,000 concurrent users. You need to think about how to handle 10,000 concurrent users in the future.

4.   Head of Sales

Some customers ask for extending the functions of our current platform. For example, adding more data attributes like a 5-day moving average. Adding these attributes will definitely increase the competitiveness of our platform.

5.   Head of Data Research

I suggest we only work on the text part ofmessages on social networks and only in English. We ignore images and videos. Even if this is the case, we are talking about around 50GB of raw data per day. There are at least 5 years of social network data available. We can always download historical data from social networks at any time. However, the download speed is slow, around 50GB per hour.

6.   Data scientist for NLP modelling

The text on social networks is often non-standard English. There are many typos too. I don’t know if the NLP model will work well.

E. Database schema and sample data

The current data storage in MySQL has two tables. One table keeps real-time data. The other table keeps historical data. As our platform only provides hourly versions or daily versions for historical data, the current data storage is around 200GB.

Table 1: Realtime

Schema: (symbol: varchar(20), open: float, high: float, low: float, close: float, volume: float,    quoteAssestVolume:    float,    numOfTrades:    int,    takerBaseVolume:    float,

takerQuoteVolume: float)

Sample data:

Table 2: Historical

Schema: (date: datetime, open: float, high: float, low: float, close: float, volume: float,

quoteAssestVolume:       float,       numOfTrades:       int,       takerBaseVolume:       float,

takerQuoteVolume: float, symbol: varchar(20))

F. Marking criteria

Item

Description

Marks

Feasibility

Does your solution work?

Have you described your solution clearly?

15

Justifications

Have you provided enough justifications?

Have you addressed all colleaguesconcerns and comments?

Are your justifications correct and logical?

25

Alternatives

Have you considered alternatives in the proposal?

Do they work?

Are your justifications about why your

proposal is better logical and sound?

15

Presentation

Is your proposal clear and properly organized?

5

Question 2: ETL [40 marks]

A. Background

Your team is given a task to analyze the relationship between the NASDAQ index and crude oil prices. You are responsible to handle the ETL part --- to prepare the raw data properly so that it can be plugged into a machine learning algorithm for training.

Primarily, you may assume the machine learning algorithm is the k-nearest neighbor (kNN) algorithm for regression.

B. Deliverables

1.   A document describing how you perform ETL on the raw data

2.   The processed data (as ONE SINGLE csv file)

C. Data sources

1.  NASDAQ data

Data source:

https://finance.yahoo.com/quote/%5EIXIC/history?p=%5EIXIC

2.   Crude oil data

Data source:

https://www.investing.com/commodities/crude-oil-historical-data

You can download a copy of raw data from online sources directly. A copy of the data (prepared in October) can be found on Moodle.

D. Hints and tips

1.   There are missing data. For example, there is no record of crude oil data on

16 Oct 2022. There is no record of NASDAQ data on 22 Oct 2022. The value of Vol. (volume) is missing on 23 Oct 2022 for the crude oil data.

2.   The processed data is directly fed to a machine learning algorithm, so

a.   Remove all inappropriate fields

b.   Add appropriate fields if you think it is helpful for the machine learning algorithm

c.   Keep in mind that the machine learning algorithm is kNN

3.   You are allowed to use any tool(s) to do your ETL actions, not only Python

4.   In Python, you can load and save a CSV file by the following codes. The data will be kept in the DataFrame structure of Pandas.

E. Marking criteria

Item

Description

Marks

Documentation

Data cleaning

How do you address data quality problems?

10

Appropriateness

Is your transformed data appropriate for machine learning (kNN)?

15

Presentation

Is your documentation clear and properly organized?

5

Processed CSV

Are you able to implement what you describe in the documentation?

Is the processed data consistent to the documentation?

10