Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


MSCI623 – Big Data Analytics

Course description: This course focuses on methods and algorithms for turning very large
collections of data into actionable insight. Topics may include data profiling, transformation and
cleaning, data mining, data warehousing and cloud computing. Applications will be drawn from
various areas such as smart grid analytics and ubiquitous computing.

Students will complete a research project: apply the Python programming language to solve business problems using machine learning.

Project Description:
The purpose of the project is to apply supervised machine learning to a real-world problem of your choosing. Your problem must be complex enough to require a machine learning solution. For example, it is not easy to guess which engineering program is right for a prospective student.
If you can solve your problem using data retrieval tools such as SQL or Excel, it is probably not complex enough. For example, suppose you wanted to recommend a restaurant based on location, cuisine and price. If someone tells you they want an inexpensive (say under $20) Italian restaurant in Waterloo, you could simply return all Italian restaurants in your database whose average dinner price is <$20. You don’t need a model to learn which restaurants in Waterloo serve inexpensive Italian food.

Deliverable #1: proposal, due June 21 at 2:30pm, and worth 20% of the project grade. Please include the following sections:
• Problem statement: clearly explain the problem and justify why a prediction model is needed
• Dataset: describe your dataset and include a link if it is publicly available. Include information such as: the number of rows, a few sample rows, column names and datatypes.
• Variables: point out the variable that you will predict. Then point out the explanatory variables (features). Good explanatory variables are independent of each other and correlated with the variable you want to predict.

NOTE: To avoid potential plagiarism, please do NOT use data from Kaggle if sample solutions are provided with the dataset.

Marking scheme: 5 points for presentation, organization and clarity of writing, 15 points for problem selection (creativity) and technical content (understanding of what machine learning is and what it can do).

Deliverable #2: Jupyter notebook with code and explanations, due July 26 at 2:30pm, and worth 80% of the project grade. Please include the following sections:
1. Introduction: restate your business problem and motivation.
2. Data exploration: describe the data exploration you did before building your models. Include tables and graphs.
3. Models: you may experiment with multiple classifiers and multiple sets of features
4. Discussion: comment on the accuracy of your models. Which models work well? Which features are important? Did you learn anything interesting/surprising?

Marking scheme: 50 points for technical content; 30 points for presentation, organization and clarity.

Data sources for the project:

I spent some time aggregating some dataset sources for you to explore in case you are having a hard time finding a dataset outside of Kaggle. FYI: my process consisted of me looking up "open data" and the name of a country, city, university or topic. So, I encourage you to do the same if you are interested in a demographic that is not present in these sources.

Dataset Search: https://datasetsearch.research.google.com/

City of New York: https://opendata.cityofnewyork.us/data/
City of Toronto: https://open.toronto.ca/
City of Vancouver: https://opendata.vancouver.ca/pages/home/
City of Waterloo: https://data.waterloo.ca/
City of Ottawa: https://open.ottawa.ca/

Government of Canada: https://open.canada.ca/en
Gov. Canada (Healthcare): https://open.canada.ca/data/en/dataset?organization=hc-sc
Gov. of U.S.: https://catalog.data.gov/dataset
Gov. of U.K.: https://data.gov.uk/

Stanford: https://stanfordopendata.org/#/datasets
Harvard: https://www.hodp.org/data

CoViD-19: https://gisumd.github.io/COVID-19-API-Documentation/
Open Images: https://storage.googleapis.com/openimages/web/index.html
NLP Datasets: https://odsc.medium.com/20-open-datasets-for-natural-language-processing-538fbfaf8e38