Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ALY 6030

Final Project (Assignment 4)

EdX Data Modeling

Assignment Instructions:

Submissions for this Assignment consist of:

(i) The completed Word document.

(ii) The completed SQL script file (with file extension “.sql”) that contains all the queries written for creating the database, creating the tables, and the queries written for answering the questions. Note that you must write all your queries in the designated spaces in the SQL script file provided. Do not use any other script file as it will not be accepted.

You will create a normalized database by (i) determining the entities, (ii) determining primary keys, (iii) determining relations and foreign keys, (iv) sketching the ERD diagram for the database, (v) explain each step of the Normalization process. Save your both your Word document the the completed SQL script file as: ALY6030_FinalProject_Your Last Name_Your First Initial;

for example, ALY6030_FinalProject_Doe_J.

Overview

In this project, you will build the data model from a dataset that contains users who took courses taught on the edX platform. These data are provided to you as a single CSV file, almost exactly as they were made available from the original website.

Part 1:

In order to make inferences from this dataset, we’ll need to clean up certain fields and also make some adjustments to the data model. This is a common task to undertake for customers or clients. They’ll often hand you data for analysis in a giant Excel sheet, and it will be your job to understand the business rules underpinning the dataset.

Exploring the dataset

If you open up the dataset in Excel, you can take a look at it quickly to see what it looks like and what data are included.

 

Figure 1: Data for edX database

The data that are available as you build the database include:

Column Name

Description

course_id

three-part identifier for a course number

course short title

Short title for the course

course long title

Long title for the course

userid_DI

Individual user ID

registered

Whether the user is registered (1/0)

viewed

Whether the user has viewed the contents (1/0)

explored

Whether the user has explored the course (1/0)

certified

Whether the user is certified (1/0)

country

User’s country of origin

loe

User’s level of education

yob

User’s year of birth

age

User’s age

gender

User’s gender

grade

User’s grade in the course

nevents

Number of events the user has done on the site

ndays_act

Number of actions taken by the user

nplay_video

Number of video plays done by the user

nchapters

Number of chapters read by the user

nforum_posts

Number of forum posts made by the user

roles

Any roles the user has

incomplete_flag

Whether the user has an incomplete for the course

Start The assignment:

1. In the sql file, write sql queries in the designated spaces to:

a. Create the database edX.

b. Create a table named “temp” with the identical columns as those in the CSV file.

c. Next, use the  “LOAD DATA LOCAL INFILE” query to dump data from CSV into the “temp” table.

Note: The database creation, the table creation, and the data dumping should be done by executing the queries written in the designated spaces in the SQL script file. No other means of database creation, table creation, or data dumping will be accepted

2. There are a few things that look tricky about this data that might make it hard to put in a database correctly. Some fields are composites of several pieces of data. For example, the very first field, course_id, seems to have three parts, which actually relate to three different pieces of data. The first part is the institution, the second part is the course number, and the third part is the academic term.

a. In the sql file , write sql queries to add 3 new columns to the ‘temp’ table, located immediately and successively after the “course_id” column, and labeled respectively as “institution”, “course_number”, and “course_term”.

b. In the sql file, write queries to dump data from the first column (course_id) into the new columns created in part (a) above. Note that the first portion of course_id should occupy the “institution” column, the second portion should occupy the “course” column, and the third portion of course_id should populate the “term” column. 

Course_id                                    à  institution    course_number   course_term

HarvardX/CB22x/2013_Spring    à  HarvardX          CB22x           2013_Spring

c. In the space box below, describe the reason as to why the course_id key had to be decomposed into three pieces. Double-click into the box to start writing. Please do not write outside of the box. Only the text inside the box will be reviewed for grading. (Nte: Double-click into the box to start typing).

3. In the space below, describe how many dimensions you recognize in the data, and explain what those dimensions are.

4. Next, your objective is to create your edX database according to the following dimensional model:

 

Figure 2: Dimensional modeling of edX database

5. In the space below, explain (with reason) whether in your opinion, the above database is 1NF or 2NF or 3NF or none of the above.

6. In the space below, describe why the table “course_users” should have been created in this database.

7. In the space below, describe why the attribute “age” has not been included with the “users” table.

8. In the sql file and after you have created the edX database according to the above model (no more and no less), write queries to remove the “temp” table from your database. Note that for the remainder of this assignment, the “temp” table can no longer be used.

9. In the space below, insert the image of the relationship diagram (the ERD diagram) obtained from the Workbench for your edX database. Note that your ERD should be consistent with your queries written in the sql file (after creating the database and all key assignments, use the “reverse engineer” of the workbench to generate an ERD of your database).  

Part 2:

Instructions

For each question,

(i) IN the sql script file provided, enter your queries needed to obtain the required answer in the.

(ii) For each question, insert (or copy/paste) an image of the output obtained in Workbench in the space provided in this word document. Do not include your query.

Note: Only the database tables may be used to answer the following questions. In particular, the “temp” table cannot be used to answer any of the questions below

1.  How many students were registered in MITx/6.00x in fall 2012 and Spring 2013 combined, only including those whose grades were not zero or Null.   

2.  List the countries with doctoral students, along with the number of doctoral enrollees (registered) corresponding to each country and their average grades (the average grade corresponding to each country). Exclude the grades that are either zero or Null, and sort according to the number of enrollees in descending order.

3. How many German students were registered in courses that were offered by HARVARD in 2013? What was their average score in those courses (not including those whose grades were zero or Null)?

4. List all countries with students registered in a course offered by MIT in 2013, along with the average grade for each country. Exclude the grades that are either zero or Null, and order alphabetically according to the countries. Do not include the counties whose names contain “Unknown”, “Other”, or both.

5.  What was the average grade in MITx/6.00x/2013_Spring for users whose countries were US, China, India, France, Brazil, or Japan? Exclude the grades that were zero or Null, and order according to the average grades.

6.  What was the total number of people registered in each course offered? Display the courses according to the course long title and order by the number of registered students in each course.

7. Find the list of courses that are hosted at HarvardX and have more than 10000 enrollees (i.e., registered)? Order these courses according to the number of enrollees from the largest to the smallest.

8. What fraction of users view, explore, or certify in the content in each course once they have registered? Order the courses according to the fraction viewed – from the largest to the smallest.

9. List Users who have registered for more than two courses?  Include the user's ID, age, country, level of education, and the number of courses registered. Order the list according to the number of courses that the user has registered in.

10. Use your tables to determine how many doctoral users are there by country, ordered alphabetically by country name, and not including those whose 'country' is indicated to be “Unknown”, “Other”, or both.

11. For each country, what was the average grade of users who were certified in a Harvard course? Do not include those whose 'country' is labeled as “Unknown”, “Other”, or both. Order according to the average grade for each country in descending order.

12. Create a view with the following three columns

a. Column 1: A column of all the (unique) courses listed by the “course_long_tile”

b. Column 2: For each course, list the country whose registered users had the highest average grade in that course. Do not include countries whose names contain “Unknown”, “Other”, or both. Also, do not include the users whose grades were either zero or Null.

c. Column 3: The average score of the users corresponding to each country listed in Column 2.

13. Use your database tables with JOIN statements to retrieve, as a VIEW, the original “temp” table used in creating this database; excluding the following columns of the temp table:

`course_short_title`, `nevents`,  `ndays_act` , `nplay_video`, `nchapters` , `nforum_posts`, `roles`, and `incomplete_flag`.