Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MSc ANALYTICS

DATA ENGINEERING PLATFORMS (MSCA 31012)

ASSIGNMENT 1

Assignment Objectives

 Data Preparation

 Relational Algebra

 Relational Modeling

 Database Normalization

Software Installation

Follow the installation guides in the course material and install the following software on your computer: OpenRefine, MySQL database, MySQL workbench

Data

 Download the sandyrelated.csv dataset which can be found under the data folder on the course portal.

 Create database using sql scripts located under Files>Data>Sakila folder in the course material

 First run sakila-schema.sql and then sakila-data.sql

 Further documentation : https://dev.mysql.com/doc/sakila/en/

Submissions

 Solutions and screenshots should be submitted as a single PDF or Word document.

 Submit other additional artifacts such as excel (for normal forms) as required.

 Please do not submit the cleaned up dataset(s) or the OpenRefine project

1. Data Wrangling Exercise:

Run the following data preparation steps on the dataset below and submit screenshots for questions d-h (screenshots should show configuation with all checkboxes and options and not summary results).

a. Import the dataset into OpenRefine and create a new project "SandyCleanup"

b. Trim white spaces on all address related columns and transform addresses into title case

c. Remove columns where majority of the cells are empty or have "Unspecified" or “NA” values. Do not remove columns which are being used in subsequent questions.

d. Convert the City column to title case, then Cluster and Merge the column

e. Clean up the Descriptor Column - Cluster and Merge the following text categories:

1. "Other Water problem(WZZ)", "Other Water problem(QZZ)" as "Other Water Problem"

2. "Commercial 421 A/B Exemptions" as "Commercial Exemption"

3. "Commercial Exemption" "Commercial Other Exemption" as "Commercial Exemption"

4. "Personal DRIE Exemption", "Personal SCHE Exemption", "Personal DHE Exemption" as "Personal Exemption"

f. Clean up the Location Type - Cluster and Merge the following text categories:

1. "Comercial", "Commercial", "Store/Commercial" as "Commercial" Assignment 1

2. "RESIDENTIAL BUILDING", "Residential Building", "Residence" as "Residential"

3. "Club/Bar/Restaurant", "Bar/Restaurant", "Restaurant" as "Club/Bar/Restaurant"

4. "3+ Family Apt. Building", "3+ Family Apartment Building" as "3+ Family Apartment"

5. "Street/Sidewalk", "Street and Sidewalk" as "Street/Sidewalk"

g. Online web services such as the following can be used to fetch the address given a geocode:

Open Street Maps or Google Reverse Geocoding for a Latitude/Longitude

Web Service API Example:

https://nominatim.openstreetmap.org/reverse?format=json&lat=40.714224&lon=-73.961452

Formulate the URL expression in OpenRefine that would fetch the complete JSON results from this web service API (You do not need to invoke the API or download the results of the web service call – but the resultant URLs need to be valid and work)

h. Look for at least two other clean up opportunities and execute using OpenRefine

i. Export final project into a CSV file on your local computer (you do not need to submit this file)

2. Relational Modeling

a. Download Sakila dataset sakila-db.zip from the course material and unzip on your computer

b. Open MySQL workbench and first execute the scripts - sakila-schema.sql followed by sakila-data.sql

c. Use MySQL workbench to Reverse Engineer the database and generate the EER model

d. Modify the EER model to add a new lookup table : payment_type

 This table will have a 1 to Many relationship with the Payment table.

 Attributes of payment_type table:

 payment_type_id (Primary Key) : SMALLINT(6)

 method - varchar (10)

 description – varchar (45)

 Add payment_type_id as a foreign key in the Payment table as follows:

0. payment_type_id (Foreign Key) : SMALLINT(6)

Note: Submit the screenshot for the above change in the EER model. You do not need to make changes to the physical tables or add data in the database.

e. For the Payment table fill out the form below:

3. Normalization

a. For the table below, provide examples of insertion, deletion, and modification anomalies.

b. Normalize this data to 3NF and list any assumptions made during the normalization process.

4. Data Modeling:

Design a data model that can be used for property management and monitoring of single-family homes for investors and owners. Consider data for the following entities/attributes that need to be captured by business:

a. Home location

b. Age of the house

c. Construction material used

d. Type of residence (apt, condo, etc.)

e. Home layout (number of roomes, sq footage, etc.)

f. Number and Types of Appliances (Heating, Fridge etc.)

g. Name and other details of the renters/leasers/resident (s)

h. Rental Payments made against the house

i. Add other entities (and/or collection of attributes) that you think could add insights for the investors and business users

Submit a design presentation with the following in a 4-5 slide deck in ppt (powerpoint) or pdf format.

 Entity Relationship (EER) diagram (with tables, attributes, relationships, cardinality)

 A short summary of design considerations that includes design best practices, data integrity, constraints, database performance, etc.