Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MSc ANALYTICS

DATA ENGINEERING PLATFORMS (MSCA 31012)

ASSIGNMENT 1

Assignment Objectives

· Data Preparation

· Relational Algebra

· Relational Modeling

· Database Normalization

Software Installation

Follow the installation guides in the course material and install the following software on your computer: OpenRefine, MySQL database, MySQL workbench

 Data

· Download the sandyrelated.csv dataset which can be found under the data folder on the course portal.

· Create database using sql scripts located under Files>Data>Sakila folder in the course material

· First run sakila-schema.sql and then sakila-data.sql

· Further documentation : h ttps://dev.mysql.com/doc/sakila/en/

Submissions

· Solutions and screenshots should be submitted as a single PDF or Word document.

· Submit other additional artifacts such as excel (for normal forms) as required.

· Please d o not submit the cleaned up dataset(s) or the OpenRefine project

Data Wrangling Exercise:

Run the following data preparation steps on the dataset below and s ubmit screenshots for questions d- h (screenshots should show configuation with all checkboxes and options and not summary results).

a. Import the dataset into OpenRefine and create a new project "SandyCleanup"

b. Trim white spaces on all address related columns and transform addresses into title case

c. Remove columns where majority of the cells are empty or have "Unspecified" or “NA” values. Do not remove columns which are being used in subsequent questions.

d. Convert the City column to title case, then Cluster and Merge the column

e. Clean up the Descriptor Column - Cluster and Merge the following text categories:

1. "Other Water problem(WZZ)", "Other Water problem(QZZ)" as "Other Water Problem"

2. "Commercial 421 A/B Exemptions" as "Commercial Exemption"

3. "Commercial Exemption" "Commercial Other Exemption" as "Commercial Exemption"

4. "Personal DRIE Exemption", "Personal SCHE Exemption", "Personal DHE Exemption" as "Personal Exemption"

f. Clean up the Location Type - Cluster and Merge the following text categories:

1. "Comercial", "Commercial", "Store/Commercial" as "Commercial"

2. "RESIDENTIAL BUILDING", "Residential Building", "Residence" as "Residential"

3. "Club/Bar/Restaurant", "Bar/Restaurant", "Restaurant" as "Club/Bar/Restaurant"

4. "3+ Family Apt. Building", "3+ Family Apartment Building" as "3+ Family Apartment"

5. "Street/Sidewalk", "Street and Sidewalk" as "Street/Sidewalk"

g. Online web services such as the following can be used to fetch the address given a geocode:  O pen Street Maps or G oogle Reverse Geocoding for a Latitude/Longitude

Web Service API Example:

h ttps://nominatim.openstreetmap.org/reverse?format=json&lat=40.714224&lon=-73.961452 Formulate the URL expression in OpenRefine that would fetch the complete JSON results from this web service API (You do not need to invoke the API or download the results of the web service call – but the resultant URLs need to be valid and work)

h. Look for at least two other clean up opportunities and execute using OpenRefine

i. Export final project into a CSV file on your local computer (you do not need to submit this file)

2. Relational Modeling

a. Download Sakila dataset s akila-db.zip from the course material and unzip on your computer

b. Open MySQL workbench and first execute the scripts - sakila-schema.sql followed by sakila-data.sql

c. Use MySQL workbench to Reverse Engineer the database and generate the EER model

d. Modify the EER model to add a new lookup table : payment_type

· This table will have a 1 to Many relationship with the Payment table.

· Attributes of payment_type table:

· payment_type_id (Primary Key) : SMALLINT(6)

· method - varchar (10)

· description – varchar (45)

· Add payment_type_id as a foreign key in the Payment table as follows:

0. payment_type_id (Foreign Key) : SMALLINT(6)

Note: Submit the screenshot for the above change in the EER model. You do not need to make changes to the physical tables or add data in the database.

e. For the Payment table fill out the form below:

Table Name: Payment

Field

( Attributes)

Primary Key (Y/N)

Foreign Key (Y/N)

Related Table(s)

and Cardinality between tables

 

 

 

 

 

 

 

 

 

 

 

 

3. Normalization

a. For the table below, provide examples of insertion, deletion, and modification anomalies.

b. Normalize this data to 3NF and list any assumptions made during the normalization process.

Submit an Excel workbook and with a separate tab for each normal form

Item ID

Product Name

Description

Category

Store ID

Store Location

Store Count

Staff Id

Staff Login

Staff Password

Staff Name

6

iphone 12

Apple iPhone 12

Mobile Phone

1

Chicago

43

121

Wab

Adsfw

Walter Burns

6

iphone 12

Apple iPhone 12

Mobile Phone

2

Los Angeles

70

210

Gha

qwerty23

Gloria Alana

4

iwatch series 6

 

Smart Watch

3

New York

21

1213

Fsa

Ksdljaslf

Frank Avalia

9

Iwatch series 5

Apple iWatch 5

Smart Watch

4

San Francisco

12

421

Awt

gohawks1

Adam Turner

3

ipad 6

 

Tablet

5

Washington DC

32

632

Sam

milton41

Sandeep Mathew

2

Power Cord USB

Power Cord

Adapter

5

Washington DC

100

456

Tw

Hung@ry

Tony Winters

8

ear pods

 

Headphones

6

Seattle

40

325

Alv

blimp123

Alba Victors

8

ear pods

 

Headphones

1

Chicago

50

1425

bmb

asliwn

Brendan Buck

4. Data Modeling:

Design a data model that can be used for property management and monitoring of single-family homes for investors and owners. Consider data for the following entities/attributes that need to be captured by business:

a. Home location

b. Age of the house

c. Construction material used

d. Type of residence (apt, condo, etc.)

e. Home layout (number of roomes, sq footage, etc.)

f. Number and Types of Appliances (Heating, Fridge etc.)

g. Name and other details of the renters/leasers/resident (s)

h. Rental Payments made against the house

i. Add other entities (and/or collection of attributes) that you think could add insights for the investors and business users

Submit a design presentation with the following in a 4-5 slide deck in ppt (powerpoint) or pdf format.

· Entity Relationship (EER) diagram (with tables, attributes, relationships, cardinality)

· A short summary of design considerations that includes design best practices, data integrity, constraints, database performance, etc.