CIS 4500/5500: Database & Information Systems
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CIS 4500/5500: Database & Information Systems
Course Project Guidelines
A. OVERVIEW
1. Description
The goal of this project is to identify at least two large, overlapping datasets of interest, import and integrate them into a SQL relational database, then create a web-enabled application of your own choosing over the database. You will work in teams of four, and you can use any SQL database technology and any web-development stack.
The objective of this project is to assess whether you’re able to apply the concepts you’re learning about in this course to solve unstructured problems with limited hands-on guidance. Specifically, the project will require you to practice:
● wrangling and cleaning data,
● performing entity resolution (ER),
● designing schema,
● writing SQL queries,
● choosing database indexes,
● designing a web application,
● optimizing SQL queries.
If you use this as an opportunity to build something you’re proud of, you are welcome to use the final product to demonstrate your skills to potential employers. Please note that these repositories should be private, and can still be shared with recruiters via a private link in your portfolio.
We also want you to enjoy completing the project. This is one of the most open-ended projects in the Penn computer science curriculum. You can build just about any game, application, or website you want, as long as it’s built on top of a SQL database and has multiple distinct pages. For example, past students have built trading card games, recipe recommendation platforms, and many other interesting apps. So we encourage you to have fun with it and be creative. Check out the resources section for more inspiration.
In this document, you will find deliverables you’ ll produce while completing the project, each project milestone described in detail, the grading criteria and several sources of extra credit, and several resources that might benefit you, including dataset sources, descriptions of past students’ projects, and links to useful tools.
You will also be assigned a TA as a project mentor, who will be your primary point-of-contact on staff for questions about your project. You may also post your questions relating to this project on Ed Discussion. Goodluck and have fun!
The table below provides a brief description of each deliverable you’ ll need to submit to complete the project. Please see the milestone section for more detailed instructions.
Check Canvas module Project Information and Tutorials for past project examples and video overviews of the project goals and expectations.
*Milestone deliverables due dates are tentative and subject to change*
Deliverable |
Description |
Milestone |
Project Group |
Form Groups of 4 on Canvas: People > Project Groups
|
Milestone 0 (Oct. 1) |
Project Proposal |
A 1-2 page document that identifies the datasets you will use, gives a rough idea of what your application may do, and demonstrates you have performed some basic descriptive analysis on the data |
Milestone 1 (Oct. 15) |
Project Outline |
A 2-3 page document that describes what your application will do, explains its significance, and gives the schema your database will implement |
Milestone 2 (Nov. 5) |
Database Population |
A 3NF- or BCNF-normalized MySQL database hosted on AWS RDS that contains all your data |
Milestone 3 (Nov. 19) |
SQL Queries |
A list of SQL queries that can run on your database with short explanations of what each query is supposed to do. At least two queries must be complex. |
Milestone 3 (Nov. 19) |
Mentor Meeting and API Specs |
An informal Zoom meeting with your project mentor where you demonstrate that you have implemented some basic functionality of your application, backend API should be done and testable. |
Milestone 4 (Dec. 3) |
Final Report |
A 6-10 page document that thoroughly describes the problem you tried to solve, your application functionality, your database design, your query optimization efforts, etc. |
Milestone 5 (Dec. 16) |
Application Code |
A zip file that contains your application code, a list of dependencies, instructions for building the app and any code you wrote to populate the database or wrangle data |
Milestone 5 (Dec. 16) |
Application Demo |
A 2-4 minute screen-captured video that demonstrates your application’s main functionalities and includes narration from at least one group member. Upload the video to your UPenn account-associated Google drive and attach the link to Canvas. |
Milestone 5 (Dec. 16) |
Final Demo |
A video conference call with your project mentor, another TA, and the professor where you demonstrate functionality of your application. We use this to test edge cases and ensure your video demo wasn’t heavily edited. |
(Dec. 14 - Dec. 16) |
B. MILESTONES AND FINAL PRESENTATION
1. Milestone 1 - Project Proposal (Oct. 15)
Develop an initial idea.
The initial step is to select your teammates, then brainstorm ideas and search for datasets. Remember your application must have several distinct pages and a SQL database with at least several tables. Your datasets must be:
1. Large -- We recommend that you have at least 2 large datasets containing at least 100,000 rows (post-cleaning) to achieve meaningful query optimizations.
2. Overlapping -- Your application will need to include queries that require information from both datasets. This means the datasets need to be related and probably need to contain references to the same entities.
Before finalizing your dataset choices, you should conduct some basic exploratory analysis (EDA). Poke around, compute summary statistics, and try to get a sense of how clean, large, and complete the data is. These factors will affect how difficult it will be to clean/pre-process the data, create and query a database, and perform entity resolution later. We provide an introduction to Pandas--a Python library for data analysis-- in this tutorial and demonstrate how to perform EDA with Pandas in this tutorial.
Next, write your project proposal. The project proposal should contain the following:
1. A list of group members, email addresses, and GitHub usernames
2. A description of application/website idea
3. For each dataset you’ve chosen:
a. A 1-2 sentence description of the dataset
b. A link to where you found the dataset
c. If you’re scraping the data, a description of how you will scrape it
d. If you’re not scraping the data:
i. relevant size statistics (e.g. For a table, mb/gb, number of rows, and number of attributes. For a graph, mb, number of nodes, and number of edges)
ii. summary statistics of several attributes (e.g. report mean, standard deviation)
4. A list of at least 5 queries (in natural language) you could write for your datasets. Some of these should require complex SQL (aggregations, subqueries, joins, etc.)
Milestone 1 Submission Instructions: One member should upload the proposal as PDF to Gradescope and add the other three members of the group to the submission as teammates.
Based on your project proposal, we will assign each group a TA who will serve as your project mentor for the remainder of the semester. Your project mentor will email your group with feedback on your proposal. One person should add every group member and your project mentor to a private GitHub repository before the next milestone.
2. Milestone 2 - Project Outline (Nov. 5)
Detailed functionality description and schema design, and initial set up.
In this phase, you will set up your version control environment, explain your project idea in more detail, and assign responsibilities to each group member.
First, make sure to create a private GitHub repository and share it with your Project Mentor. During development, you will use the repository to share code and perform version control. Your project mentor will also use the repository during grading to review the quality of your code and ensure you haven’t plagiarized any of it.
Next, write your project outline. The outline should contain the following:
1. Motivation for the idea/description of the problem the application solves 2. List of features you will definitely implement in the application
3. List of features you might implement in the application, given enough time
4. List of pages the application will have and a 1-2 sentence description of each page. We expect that the functionality of each page will be meaningfully different from the functionality of the other pages.
5. Relational schema as an ER diagram
6. SQL DDL for creating the database
7. Explanation of how you will clean and pre-process the data. This tutorial demonstrates how to do simple pre-processing in Python.
8. List of technologies you will use. You must use some kind of SQL database. We recommend using MySQL or Oracle specifically because you will use MySQL in HW2, and we will provide guidance for setting up a MySQL database. Some groups in the past have had issues with MySQL, but Oracle is another option.
9. Description of what each group member will be responsible for
Milestone 2 Submission Instructions: One group member should upload the project outline to Gradescope as a PDF and add all other group members to the submission as teammates.
IMPORTANT NOTE ABOUT AWS: For the next milestone, you will populate and host your database on AWS. We recommend that you use AWS Academy over AWS Free Tier to avoid students getting charged for overages. Separate instructions on getting started with AWS Academy will be made available on Canvas and through Ed Discussion. However, if you choose to host your database using your AWS Free Tier account, you will be responsible for overages.
3. Milestone 3 - Database Population/SQL Queries (Nov. 19)
Pre-process your data, perform entity resolution, populate the database, and write queries.
Now that you have a schema, clean your data, perform entity resolution as needed (make sure your ER diagram is connected), create a database and populate it with your data. You may find the following resources useful for performing these steps (check Canvas):
1. Processing Tutorial - Interactive Google Cola b notebook that demonstrates pre-processing and entity resolution techniques in Python
2. RDS setup instructions and data upload - from past exercises and homeworks
Next, create a .txt file containing the following:
1. Queries – A list of 10 SQL queries. At least 4 of these must be complex ones. Complex queries draw on multiple tables to find interesting insights about your data. Our expectation on complex queries is explained below.
2. Descriptions -- A 1-2 sentence description of what each query is supposed to do
3. Credentials -- Instructions and guest credentials for accessing the database
Clarification about complex queries:
A complex query uses a combination of elements such as but not limited to : multiple joins, subqueries or views, aggregations, universal/existential checks. One or two aggregations or a simple JOIN between two tables wouldn’t suffice. A tip on designing a complex query is that you should start with a complex idea to implement; if you try to overcomplicate a simple idea just to meet this requirement, you’ ll struggle and most importantly, we’ ll notice. At least 2 of your complex queries should have a non-trivial runtime (>15s before optimization, >1s after optimization). Check out this doc to get a sense of what a complex query should look like. Query 1 would be on the border of complex and trivial, but Queries 2, 3, and 4 meet our expectations.
Milestone 3 Submission Instructions: One group member should upload the .txt file of SQL queries to Gradescope and add all other group members to the submission as teammates.
Make sure that your Github is up to date (TAs will be looking at commits and individual contributions), and that you have a README file in your Github repo including a description of your project and what each directory in your repo corresponds to.
Now, begin building your application! Here are a few recommendations for getting started:
1. Use the code from HW2 as a starting point. Feel free to copy and paste it into your project directory just as long as you mention that you used it, in your final report.
2. Before building anything, make sure you can retrieve data from your database and display it on a webpage. This requires setting up a server, connecting it to your database, and creating one route that queries your database then sends the data to the web client. You shouldn’t need to write much code for any of these steps, but if you’re not familiar with web-development, putting all the pieces together can be challenging. Once you’ve done it once, it’s easy to replicate and modify for your different features. Reach out to your project mentor early if you struggle with this step.
2023-12-16