Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CIS 4500/5500: Database & Information Systems

Course Project Guidelines

A. OVERVIEW

1. Description

The goal of this project is to identify at least two large, overlapping datasets of interest, import and  integrate them  into  a  SQL  relational  database,  then  create  a web-enabled application of your own choosing over the database.   You will work in teams of four, and you can use any SQL database technology and any web-development stack.

The objective of this project is to assess whether you’re able to apply the concepts you’re learning  about  in  this  course  to  solve  unstructured  problems  with  limited  hands-on guidance. Specifically, the project will require you to practice:

●   wrangling and cleaning data,

●   performing entity resolution (ER),

●   designing schema,

●   writing SQL queries,

●   choosing database indexes,

●   designing a web application,

●   optimizing SQL queries.

If you use this as an opportunity to build something you’re proud of, you are welcome to use the final product to demonstrate your skills to potential employers. Please note that these repositories should be private, and can still be shared with recruiters via a private link in your portfolio.

We  also  want you to  enjoy  completing the  project.  This  is  one of the most open-ended projects  in  the  Penn  computer  science  curriculum.  You  can  build  just  about  any  game, application,  or  website you want,  as  long  as  it’s  built  on top  of a SQL  database and has multiple  distinct  pages.  For  example,  past  students  have  built trading  card games,  recipe recommendation platforms, and many other interesting apps. So we encourage you to have fun with it and be creative. Check out the resources section for more inspiration.

In this  document, you  will find  deliverables you’ ll  produce while completing the  project, each project milestone described in detail, the grading criteria and several sources of extra credit, and several resources that might benefit you, including dataset sources, descriptions of past students’ projects, and links to useful tools.

You will also be assigned a TA as a project mentor, who will be your primary point-of-contact on staff for questions about your project. You may also post your questions relating to this project on Ed Discussion. Goodluck and have fun!

2. Deliverables

The table below provides a brief description of each deliverable you’ ll need to submit to complete the project. Please see the milestone section for more detailed instructions.

Check Canvas module Project Information and Tutorials for past project examples and video overviews of the project goals and expectations.

*Milestone deliverables due dates are tentative and subject to change*

Deliverable

Description

Milestone

Project Group

 Form Groups of 4 on Canvas: People > Project Groups

Milestone 0

(Oct. 1)

Project

Proposal

A 1-2 page document that identifies the datasets you will use, gives a rough idea of what your application may do, and demonstrates you    have performed some basic descriptive analysis on the data

Milestone 1

(Oct. 15)

Project Outline

A 2-3 page document that describes what your application will do, explains its significance, and gives the schema your database will    implement

Milestone 2

(Nov. 5)

Database

Population

A 3NF- or BCNF-normalized MySQL database hosted on AWS RDS that contains all your data

Milestone 3

(Nov. 19)

SQL Queries

A list of SQL queries that can run on your database with short

explanations of what each query is supposed to do. At least two queries must be complex.

Milestone 3

(Nov. 19)

Mentor

Meeting and

API Specs

An informal Zoom meeting with your project mentor where you

demonstrate that you have implemented some basic functionality of your application, backend API should be done and testable.

Milestone 4

(Dec. 3)

Final Report

A 6-10 page document that thoroughly describes the problem you  tried to solve, your application functionality, your database design, your query optimization efforts, etc.

Milestone 5

(Dec. 16)

Application

Code

A zip file that contains your application code, a list of dependencies,   instructions for building the app and any code you wrote to populate the database or wrangle data

Milestone 5

(Dec. 16)

Application

Demo

A 2-4 minute screen-captured video that demonstrates your

application’s main functionalities and includes narration from at least one group member. Upload the video to your UPenn

account-associated Google drive and attach the link to Canvas.

Milestone 5

(Dec. 16)

Final Demo

A video conference call with your project mentor, another TA, and the professor where you demonstrate functionality of your application.

We use this to test edge cases and ensure your video demo wasn’t heavily edited.

(Dec. 14 - Dec. 16)

B. MILESTONES AND FINAL PRESENTATION

1. Milestone 1 - Project Proposal (Oct. 15)

Develop an initial idea.

The initial step is to select your teammates, then brainstorm ideas and search for datasets. Remember your application must have several distinct pages and a SQL database with at least several tables. Your datasets must be:

1.   Large  --  We   recommend  that  you   have  at  least  2  large  datasets  containing  at  least 100,000 rows (post-cleaning) to achieve meaningful query optimizations.

2.   Overlapping    -- Your application will  need to  include queries that require information from both datasets. This means the datasets need to be related and probably need to contain references to the same entities.

Before finalizing your  dataset  choices, you should conduct some basic exploratory analysis (EDA). Poke around, compute summary statistics, and try to get a sense of how clean, large, and complete the data is. These factors will affect how difficult it will be to clean/pre-process the data,  create  and  query  a  database,  and  perform  entity  resolution  later.  We  provide  an introduction to Pandas--a Python library for data analysis-- in this tutorial and demonstrate how to perform EDA with Pandas in this tutorial.

Next, write your project proposal. The project proposal should contain the following:

1.     A list of group members, email addresses, and GitHub usernames

2.     A description of application/website idea

3.      For each dataset you’ve chosen:

a.   A 1-2 sentence description of the dataset

b.   A link to where you found the dataset

c.    If you’re scraping the data, a description of how you will scrape it

d.   If you’re not scraping the data:

i.   relevant size statistics (e.g. For a table, mb/gb, number of rows, and number of attributes. For a graph, mb, number of nodes, and number of edges)

ii.   summary statistics of several attributes (e.g. report mean, standard deviation)

4.      A list of at least 5 queries (in natural language) you could write for your datasets. Some of these should require complex SQL  (aggregations, subqueries, joins, etc.)

Milestone  1  Submission  Instructions:  One  member  should  upload  the  proposal  as  PDF  to Gradescope and add the other three members of the group to the submission as teammates.

Based on your project proposal, we will assign each group a TA who will serve as your project mentor for the  remainder  of  the  semester.  Your  project  mentor will  email your group with feedback  on  your  proposal.  One  person  should  add  every  group  member  and  your  project mentor to a private GitHub repository before the next milestone.

2. Milestone 2 - Project Outline (Nov. 5)

Detailed functionality description and schema design, and initial set up.

In this phase, you will set up your version control environment, explain your project idea in more detail, and assign responsibilities to each group member.

First,  make  sure  to  create  a  private  GitHub  repository  and  share  it  with  your  Project Mentor. During development, you will use the repository to share code and perform version control. Your project mentor will also use the repository during grading to review the quality of your code and ensure you haven’t plagiarized any of it.

Next, write your project outline. The outline should contain the following:

1. Motivation for the idea/description of the problem the application solves 2. List of features you will definitely implement in the application

3. List of features you might implement in the application, given enough time

4. List of pages the application will have and a 1-2 sentence description of each page. We expect that the functionality of each page will be meaningfully different from the functionality of the other pages.

5. Relational schema as an ER diagram

6. SQL DDL for creating the database

7. Explanation   of    how   you    will    clean   and    pre-process   the    data.   This   tutorial demonstrates how to do simple pre-processing in Python.

8. List  of  technologies  you  will  use.  You  must  use  some  kind  of  SQL  database.  We recommend using MySQL or Oracle specifically because you will use MySQL in HW2, and we will provide guidance for setting up a MySQL database. Some groups in the past have had issues with MySQL, but Oracle is another option.

9. Description of what each group member will be responsible for

Milestone 2 Submission Instructions: One group member should upload the project outline to Gradescope as a PDF and add all other group members to the submission as teammates.

IMPORTANT NOTE ABOUT AWS: For the next milestone,  you will populate and host your database on AWS. We recommend that you use AWS Academy over AWS Free Tier to avoid students getting  charged for overages. Separate instructions on getting started with AWS Academy  will  be  made  available  on  Canvas  and  through  Ed  Discussion.  However,  if  you choose to host your database using your AWS Free Tier account, you will be responsible for overages.

3. Milestone 3 - Database Population/SQL Queries (Nov. 19)

Pre-process   your   data,   perform   entity   resolution,   populate   the database, and write queries.

Now that you have a schema, clean your data, perform entity resolution as needed (make sure your ER diagram is connected), create a database and populate it with your data.  You may find the following resources useful for performing these steps (check Canvas):

1.   Processing   Tutorial    -    Interactive   Google    Cola b    notebook   that   demonstrates pre-processing and entity resolution techniques in Python

2.   RDS setup instructions and data upload - from past exercises and homeworks

Next, create a .txt file containing the following:

1.   Queries – A list of 10 SQL queries. At least 4 of these must be complex ones. Complex queries  draw  on  multiple  tables  to  find  interesting  insights  about  your  data.  Our expectation on complex queries is explained below.

2.   Descriptions -- A 1-2 sentence description of what each query is supposed to do

3.   Credentials -- Instructions and guest credentials for accessing the database

Clarification about complex queries:

A complex query uses a combination of elements such as but not limited to : multiple joins, subqueries or views, aggregations, universal/existential checks. One or two aggregations or a simple JOIN between two tables wouldn’t suffice. A tip on designing a complex query is that you  should  start  with  a  complex  idea  to  implement;  if  you  try  to overcomplicate a simple idea just to meet this requirement, you’ ll struggle and most importantly, we’ ll notice. At   least  2  of  your  complex  queries  should   have  a   non-trivial   runtime   (>15s  before optimization, >1s after optimization). Check out this doc to get a sense of what a complex query should look like. Query 1 would be on the border of complex and trivial, but Queries 2, 3, and 4 meet our expectations.

Milestone 3 Submission Instructions: One group member should upload the .txt file of SQL queries to Gradescope and add all other group members to the submission as teammates.

Make  sure that your  Github is up to date (TAs will be looking at commits and individual contributions), and that you have a README file in your Github repo including a description of your project and what each directory in your repo corresponds to.

Now, begin building your application! Here are a few recommendations for getting started:

1.   Use the  code from  HW2 as a starting point.  Feel free to copy and paste it into your project directory just as long as you mention that you used it, in your final report.

2.   Before  building  anything,  make sure you can retrieve data from your database and display  it  on  a  webpage.  This  requires  setting  up  a  server,  connecting  it  to  your database, and creating one route that   queries your  database then sends the data to the web client. You shouldn’t need to write much code for any of these steps, but if you’re  not  familiar  with  web-development,  putting  all  the  pieces  together  can  be challenging.  Once  you’ve   done  it  once,  it’s  easy  to  replicate  and  modify  for  your different features. Reach out to your project mentor early if you struggle with this step.