Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DTS106TC Introduction to Database

Coursework- Assessment Task 002

Due:  17:00, June 2nd  2023

Weight: 40%

Maximum Marks: 100

The course work will be assessed for the following learning outcomes:

Demonstrate a basic understanding of the design of databases.

Show a fundamental grounding in the operation and usage of database management systems including "hands-on" experience of a basic database management system.

Demonstrate in-depth knowledge of the database language, SQL.

Overview:

Your task is to create a database to represent scratch-off (lottery) ticket data from a simulator, in order to be able to answer questions and perform simple analysis. This will involve creating and normalizing a database to hold the data from a dataset and constructing queries against the data once it is in a suitable form.

There are the following 5 parts of this coursework

The Relational Model (20%)

Normalization (25%)

Database construction (15%)

Querying (30%)

Visualization (10%)

You will be required to write a brief report to answer the questions and note down your process, thoughts, and assumptions made as well as answers. You will also need to construct a database using Oracle Application Express (APEX) or SQLite and produce a set of queries that can be run against that database.

Marking Criteria

This coursework will be marked out of 100. There are 20 questions and 5 marks available for every question. Marks will be awarded based on the level of correctness of each answer. For example, 100% for fully correct with required explanation/justification, partial marks based on the level of incorrectness or missing required details. Marks will be awarded as follows:  

full marks: Complete and correct answers with rationale

80% marks: An answer is correct but rationale/justification is partially correct

60% mark: Partially attempted/correct

40% and 20% marks, based on the level of attempt/correctness

0 marks: Not attempted/wrong

Dataset

The dataset to be used for this assessment is scratch-off (lottery) ticket data. The dataset contains the number of lotteries distributed in different regions and the number of winners of different prizes.

 

The dataset and data dictionary can be downloaded from LMO under the Assessment 002-Coursework.

Setup

You can complete this coursework either using Oracle Application Express (APEX) or SQLite. All you need to submit SQL scripts for constructing and querying the database along with the report.

Q1: The Relational Model (4*5=20 Marks)

Q1(a): Write down the relations that directly map the dataset file into a mega table/relation. You don’t have to create SQL code for transforming the dataset into a table, just list down relations and attributes with appropriate data types, here is an example:

R1 (A1 [data type], A2[data type]…An[data type])

Where R1 refers to relation and A1 to An are set of attributes

Note: You must use the attribute names exactly as they appear in the dataset.

Types: To ease the modeling process, you can either use SQLite data types (INTEGER, TEXT, BLOB, REAL, and NUMERIC). Please refer the following link for more information https://www.sqlite.org/datatype3.html

OR Oracle data types (NUMBER, VARCHAR2, CHAR, CLOB, BLOB, etc.) https://docs.oracle.com/cd/B12037_01/server.101/b10759/sql_elements001.htm

Q1(b): List the minimal set of Functional Dependencies (FDs) and provide an explanation.

· Every FD must be minimal on its LHS (left-hand side)

o There can be more than one attribute on the LHS, but there should be no attributes on the LHS that add no further information (e.g. If A -> C, then A, B -> C would not be minimal as the B is not adding anything further)

· There should be no trivial FDs (a -> a adds nothing of value)

· There must be no redundant FDs (those which are already implied by other FDs)

· Tip: If A -> B and C -> B and C -> A then C -> B is redundant (as this is implied by C -> A -> B)

Tip: Explain any assumptions you make applying what you know of the domain to the data and consider future data and the impact it may have as well. 

Tip: You will need to think and determine whether values are 'blank' (a known value of blank) or null (an as yet unknown value) as this may have an impact on your dependencies. Explain any assumptions and decisions you make in the report.

Q1(c): List all potential candidate keys with justification

Tip: Remember, a candidate key contains no redundant attributes

Q1(d): Identify a suitable primary key, and justify your decision

Q2: Normalization (5*5=25 Marks)

In this exercise, you will decompose the mega relation into a set of normalized schemas and explain the role of each schema in the database.

You should apply step by step procedure to transform the unnormalized tables created in Q1 to create a normalized database. Each schema must be in the 3rd Normal Form. Please consider the following guidelines while answering the below questions:   

Keys: Where possible, you should only introduce new Surrogate Keys to avoid anomalies, you should explain this in your report with a justification.

Attributes: While you are able to introduce new attributes if you wish; you must not rename or remove or change the values of any of the attributes in the original relation. All must appear as originally named in your decomposed relations.

NULL values: NULL values are not values in themselves, but represent unknown values in the dataset

(You cannot treat all NULL values as the same 'null' value). NULL values can be present throughout the

normalization process, you do not need to remove them. However, you may find you need to introduce

surrogate keys in the case where a NULL could or is present in something you would want to be a key or

split into a relation.

Q2(a): List any partial-key dependencies from the mega relation as it stands in the dataset file and any resulting additional relations you should create as part of the decomposition.

Q2(b): Convert the relation into 2nd Normal Form using your answer to the above. List the new relations and their fields, types, and keys. Explain the process you took.

Q2(c): List any transitive dependencies, if any, in your new relations and justify your answer

Q2(d): Convert the relation into 3rd Normal Form using your answers to the above. List the new relations and their fields, types, and keys. Explain the process you took.

Note: Depending on assumptions you have made in earlier processes, tables may already be in 3NF - in

that case, demonstrate why a table is in 3NF.

Q2(e): Is your relation in Boyce-Codd Normal Form? Justify your answer.

Q3: Database Construction (3*5=15 Marks)

Q3(a): Import the raw dataset into a single table called 'dataset'.

Tip: To import the dataset in SQLite DB Browser, you should create a database called lottery.db and then use File->import-> Table from CSV file.

Note: You should not change the CSV file - it must be the original provided dataset file. The attribute names must be the same as in the original file.

Export this table as dataset.sql (including CREATE and INSERT statements), such that running it will import the full dataset into a fresh database.

The entire database at this point should be saved as dataset.sql.

In the report write down the process you took and submit dataset.sql along with the report.

Q3(b): Write the SQL to create the full normalized representation, including all tables as a result of the decomposition process you took for Q2, excluding the mega dataset table.

The SQL should contain CREATE statements to create all new tables. You should include foreign keys where appropriate, and list and justify these in your answer.

Q3(c): Write INSERT statements using SELECT to populate the new tables from the 'dataset' table.

Save both (CREATE and INSERT statements in a single file) as Normalized_Data.sql and run them to populate the normalized database.

Each SQL statement should be written in the report, as well as saved. You should explain the steps for each of the above questions. Briefly write down the process you went through to go with it - enough that a person with just your report could reproduce what you have done.

Submit dataset.sql (containing original dataset)  and Normalized_Data.sql along with your report.

Q4: Querying (6*5=30 Marks)

For each exercise in this question, you will need to write a SQL query against your newly created normalized tables in your database. The queries created over the unnormalized mega dataset table will be considered wrong.

Each SQL statement, explanation, snapshot of results should be written in the report, as well as all queries should be saved in a single file with numbers such as 4a, 4b, and so on, as sqlScript.sql  which can be run against your normalized database, as it stands at the end of Q3. You should also briefly describe your approach for each in the report.

Write SQL statement, explanation, and snapshot of results for each of the following questions:

Q4(a): The worldwide total number of winners and first prizes with total winners and total first prizes as columns.

Q4(b): The number of WinnerNum by date, in increasing date order, for the city ‘Jnddanbbrr’ with date and number of WinnerNum as columns.

Q4(c): The total number of winners and first prize by date, in increasing date order, for each province (with the province, date, number of winners, and number of first prize as columns)

Q4(d): The number of winners and first prizes as a percentage of the Total_tickets, for each city (with

city, % winner of total scratch-off (lottery) tickets, % first prize of total scratch-off (lottery) tickets as columns).

Q4(e): A descending list of the top 10 cities, by percentage total first prizes out of total scratch-off (lottery) tickets in that city (with city name, and % first prizes of city scratch-off (lottery) tickets as columns)

Q4(f): The date against a cumulative running total of the number of first prizes by day and winners by day for the city ‘Jnddanbbrr‘ (i.e. date, City_daily_winners, City_cumulative_winners, City_daily_FirstPrizes, and City_cumulative_FirstPrizes as columns)

Note: Cumulative winners refer to the addition of all the winners to a specific date. For example, winners: 1, 6, 10, 20, Cumulative winners would be: 1, 7, 17, 37.

Q5: Visualization (2*5=10 Marks)

Q5(a): Write a SQL script to plot the top 10 cities in terms of overall cumulative first prizes only. Include an explanation of your script in the report.

Q5(b):  Produce a graph that has a date on the horizontal axis and the cumulative number of first prizes by city on the vertical axis. You should represent the top 10 cities in terms of overall cumulative first prizes only.  

The full script and resulting screenshot of the graph should be included in the report.

Report Submission Guidelines

Report

You must write your report in MS word and submit it as a PDF. The name of the pdf file should be

yourID_report.pdf.

Your report should contain:

The first page of this document with your ID for grading purpose

You must have a section for each part of the coursework, and they must be named as follows:

1. The Relational Model

2. Normalization

3. Database construction

4. Querying

5. Visualization

Files

Your submission should be a single zip file. Your submission must include the following in this structure:

Report.pdf

dataset.sql: the create and insert statements of the unnormalized dataset after importing the dataset

Normalized_Data.sql: The full database after creating and populating the normalized tables (CREATE and INSERT statements of the normalized dataset

sqlScrip.sql: SQL statements to run queries against normalized tables 

Generic Marking Criteria and Assessment Rubric

The below are generic marking criteria and assessment rubric.

Support

Any questions and answers will be added to the FAQ. Please use the coursework discussion channel at LMO as the first point of call for any questions, problems, clarifications, or anything you would like us to go over. If you prefer to ask privately, please send an email to the relevant instructor.