Project Introduction

发布时间：2024-02-24

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Project Introduction

The project is designed to help you develop and practice your database access and analysis skills using SQL and NoSQL databases, in combination with Python. Working in a team, limited to 4 members, you will first develop a series of business questions based on provided scenarios using value chain analysis. You will then develop database queries and additional analysis to answer the questions in a business context. You will access both SQL and NoSQL databases using specific interface programs and augmented by Python to provide data analysis and visualization.

The project will be implemented as two sprints. Each sprint will cover two different business and data scenarios.

The first sprint will focus on SQL and the second sprint will focus on NoSQL.

Download Jupyter Notebooks for Project [Zip, 187 KB]

Available Databases

The following databases are available to support the four project scenarios.

Scenario 1

The first scenario is based on data stored in an SQLite relational database, accessible using SQL that is coded within a Python program.

This scenario is about a Video Rental Store company. The database is called “sakila.”

This database will be implemented using the SQLite database. The physical database will be stored as a file on your local computer and accessed using SQL commands via Python.

Scenario 2

The second scenario is based on data stored in a MySQL database hosted on the Google Cloud Platform (GCP) This database describes the activities of a Soft Drink Distribution company. The database is relational and accessible by SQL. The physical schema on GCP is called “user.”

The MySQL database will be accessed using SQL executed via the MySQL Workbench application and via Python. Note that the MySQL Workbench application and Python will be hosted on your local computer.

Scenario 3

The third scenario is based on data describing the names, locations and cuisine categories for restaurants in New York City. This data is stored in MongoDB, a document NoSQL database. It is hosted on the Google Cloud Platform (GCP). You will access MongoDB using the Robo-3T application and Python. Both Robo-3T and Python will be hosted on your local computer.

Scenario 4

The fourth scenario is based on a movie dataset showing which actors and directors have been involved with various movie titles. The data is stored in Neo4j, a graph NoSQL database. It is hosted on Neo4j’s sandbox accessed via Neo4j’s website. You will access this database using the Neo4j Browser application accessed via your local web browser.

Project Structure

The project will be executed in two sections, called sprints. Each sprint will be treated as a mini-project. A final report will be required that describes the work of both sprints. An in-class presentation will also be delivered as an executive level summary of your project activities and results.

· Sprint 1 - Focus on SQL data access based on Scenario 1 and 2.

· Sprint 2 - Focus on NoSQL document and graph data access based on Scenario 3 and 4.

Project Approach

Sprint 1

a. Explore each of the two SQL databases. Review the database diagrams and carry out exploratory analysis by understanding what data is available in each database.

b. For each of the two SQL scenarios, use Value Chain analysis to identify THREE key business questions that your group develops. The questions should be analytical in nature and identified by the “value” statement of your value chan. The questions should be answered by data in the related database. This will amount to three queries per scenario and six for each sprint. The questions should be analytical in nature based on creating insights and improvements to each business scenario.

c. For each scenario, develop and test the necessary queries to answer each of the THREE scenario questions and execute them via the appropriate tool.

d. For each scenario, use Python to present the answers to each question in a visual format.

e. Prepare a draft report for this spring summarizing your work and results. The draft report is not to be submitted. This content will be a section of your final report.

Sprint 2

a. Explore each of the two NoSQL databases. Carry out exploratory queries to become familiar with the data in the MongoDB and Neo4j databases.

b. For each of the two NoSQL scenarios use Value Chain analysis to identify THREE key business questions. Similar to the previous sprint, the questions should be identified based on starting with business value and working down through the model to business questions. The questions should be analytical in nature based on creating insights and improved decision making identified in your value chain analysis.

c. For each scenario, develop and test the necessary queries to answer each of the THREE scenario questions and execute them via the appropriate tool.

d. Use Python to present the answers to the MongoDB related questions. You are free to use the Neo4j Browser to present the answers to the Neo4j questions. You can also use Python to show results in an programming environment.

e. Prepare a draft report summarizing your work and results. The draft report is not to be submitted. This content will be a section of your final report.

Final Report

· The final report should include the details describing all four scenarios according to the following content guidelines:
Note each report section described below will contribute equally to the grade your team receives for the project.
The Final Report should include the following sections for each of the four scenarios.

1. Data Exploration

§ Explore the available database and content.

§ Perform simple queries on several different tables or documents or nodes (depending on the type of database) to understand the available data.

§ Provide a high-level narrative about the main categories of data you discovered. Provide the narrative in a paragraph outlining the data subjects you found.

2. Business Objective(s)

§ Based on the data you discovered, identify one or two business objectives that could achieved from querying each database. For example, reduce the number of late video returns or increase the number of repeat customers.

3. Decisions

§ Based on your chosen business objectives, use value chain analysis to identify up to 3 decisions needed to achieve that goal.

4. Business Questions

§ Identify a maximum of 3 business questions per scenario needed to enable those decisions.

5. Database Queries

§ Design and develop 3 database queries that will retrieve the data needed to answer the questions.

6. Analysis

§ After the queries are tested, execute them via Python for scenario 1, 2 and 3. For scenario 4 you can use the Neo4j browser for your visuals. Create some visuals to support the results needed to answer the questions.
Note: Accessing the Neo4j database via Python is optional and not required in this project.

7. Results

§ Show the results of the analysis including tabular and graphical perspectives that answer the original questions.

§ Provide your results including each section shown in the above guidelines in a format of your choice. You are free to use Word, Powerpoint or HTML versions of Jupyter notebooks.
Note: Your report should be designed to be a final deliverable to a client. Any rough work should be in an appendix and not the main report body.

8. Conclusions

§ For each scenario create conclusions from your data access and analysis and make recommendations to the decision maker.

Database Connection Information

1. Scenario 1 Database - SQLite

a. Download the file “sqlite-sakila.db” and ensure it is in the same directory as your Jupyter notebook. This is the database file link. Download Link

b. As an example for accessing this database, download the sample Jupyter notebook, Project_Scenario_1_Sample.ipynb [Zip file, 141 KB] and ensure it is in the same directory as the SQLite database from the previous step.

c. Use the sample notebook to see how to import and use SQLite within a Python context.

2. Scenario 2 Database - MySQL

a. Ensure you have created your Google Cloud Platform (GCP) and been granted access to the MySQL and MongoDB databases instances.

b. Ensure you have installed MySQL Workbench on your local computer.

c. Follow the instructions in the DA_Setup.pdf document to connect to the MySQL database on GCP.

d. Download the sample Jupyter notebook, Project_Scenario_2_Sample.ipynb [Zip file, 2 KB] for an example on connecting to MySQL via Python.

3. Scenario 3 Database - MongoDB

a. Ensure you have created your Google Cloud Platform (GCP) and been granted access to the MySQL and MongoDB databases instances.

b. Ensure you have installed Robo-3T on your local computer.

c. Follow the instructions in the DA_Setup.pdf document to connect to the MongoDB database on GCP.

d. Download the sample Jupyter notebook, Project_Scenario_3_Sample.ipynb [Zip file, 3 KB] for an example on connecting to MongoDB via Python.

4. Scenario 4 Database - Neo4j

a. Access Neo4j’s website [new window] and follow the instructions to “Get Started” and select the Neo4j Sandbox.

b. Follow the instructions to create your Sandbox account.

c. After you are connected, you will have access to several different datasets. Select the “Movie” dataset for this project.

d. Use Neo4j Browser to connect to the Movie dataset in the Sandbox to execute queries and monitor results.

e. Download the sample Jupyter notebook showing how to connect to Neo4j using Python. Download Link