ASSESSMENT 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ASSESSMENT 2
Deadline: |
Hand in by midnight 8 May 2022 |
Evaluation: |
Part 1: 7.5% of your final course grade. Part 2: 7.5% of your final course grade. |
Late Submission: |
Refer to the course guide. |
Work |
This assignment is to be done individually. |
Purpose: |
Implement the entire data science/analytics workflow. Use regression techniques to solve real- world problems. Gain skills in extracting data from the web using APIs and web scraping. Build on the data wrangling, data visualization and introductory data analysis skills gained up to this point as well as problem formulation and presentation of findings. Gain skills in kNN regression modelling and supervised and unsupervised learning. Learning outcomes 1 - 5 from the course outline. |
Please note that all data manipulation must be written in python code in the Jupyter Notebook environment. No marks will be awarded for any data wrangling that is completed in excel.
Please submit a separate notebook for each part of this assignment.
These assignments will take longer than you think, so …
Do not leave starting these assignments until the last minute. You have the tools you need to start now.
As of the week 5 lecture, you will have been introduced to tools that will assist you in completing Part 1.
By week 7 (before semester break) you will be able to complete most of Part 2, except for task 3, which you will be able to complete after the week 8 lecture.
PART 1: DATA ACQUISITION AND REGRESSION
Here you will be integrating data from two sources:
• The World Happiness Index
and one of:
• The World Bank API
• A web-scraped source of your choosing
Your goal is to build Regression models for predicting happiness, following a good process, including:
• careful selection of explanatory variables (features) through engaging your critical thinking in choosing data sources, exploratory data analysis and optional feature set expansion;
• good problem formulation;
• good model experimentation (including explanation of your experimentation approach), and
• thoughtful model interpretation
TASK 1: DATA ACQUISITION AND INTEGRATION (25 MARKS )
a) Static Data: Import Table 2.1 of the World Happiness Report data (1 mark)
You can download the “WHRData2022.xls” static dataset from the Stream site. This dataset is from the 2022 World Happiness Report. You can learn more about this report here:
https://worldhappiness.report/ed/2022/
Data definitions and other variable documentation can be found here:
https://happiness-report.s3.amazonaws.com/2022/Appendix_1_StatiscalAppendix_Ch2.pdf
You should familiarise yourself with the data documentation before proceeding. As a bare minimum, you will need to identify which variable represents ‘Happiness’ .
Note: if you are unable to meet the challenges laid out in Task 1 b) and c) you will still be able to continue with Tasks 2 and 3 with only the static dataset.
b) Dynamic data (14 marks)
Do ONE of either option 1 or option 2:
OPTION 1:
API Data: Identify, import and wrangle indicators of your choosing from the World Bank API
The World Bank API is briefly introduced in Lecture 5. Your task is to identify and import 5 or more World Bank indicators (features) that you would like to have as options for inclusion in your models for predicting happiness.
• Identify: To identify 5 or more appropriate indicators, you will need to explore the World Bank API documentation and figure out for yourself how to find which indicators are available and then how to identify and request them. Finding your own way through the documentation is a deliberate part of this challenge. Briefly explain your process and why you chose your features. These links will provide you with
a start:
https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-api-documentation https://datahelpdesk.worldbank.org/knowledgebase/articles/898599-api-indicator-queries
• Import and wrangle your chosen indicators so that they are in the right shape for integration with the WHR data. In Lecture 5, only one indicator is imported . To import many indicators in a tidy fashion (i.e. without repeating code) will possibly involve the use of a loop and/or function, depending on your approach.
Note that by default you may not be returned all the data you require - you may have to set arguments to obtain the full range (keep an eye out for the ‘per_page’ argument). Also note that you can specify a date range.
Task 1b) Option 1 marking:
❖ 6 marks for identification of features and explanation of why you chose them . We are looking for your curiosity and initiative in exploring the World Bank API and figuring out how to use it to effectively identify appropriate indicators. 0/7 marks will be awarded if you simply import a subset of the indicators that you have been given codes for in Lecture 5.
❖ 8 marks for the import and wrangling of the data – the more elegant and tidy the solution, the higher the marks
OPTION 2:
Web-scraped data: Source, import, parse and wrangle web data
• Source: Go to the internet and find another data source with which to expand your feature set that:
o can be web-scraped,
o you think may improve your predictive model, and
o can be meaningfully integrated with the WHR data and your World Bank data.
In case it is not obvious, you will be looking for data that can be linked on both country name and one or more years of the data you have already acquired .
• Import, parse, wrangle: Scrape the data and wrangle it into the shape it needs to be in in order to integrate it later.
• Explain: Include a brief explanation of your wrangling process at the beginning of wrangling.
Task 1b) Option 2 marking:
❖ 3 marks for finding an appropriate and good quality data source and explanation of why you chose it.
❖ 8 marks for effective and tidy import/parse/wrangle code
❖ 3 marks for briefly explaining your wrangling process before you import your data
c) Integration: By whichever means appropriate, clean labels and integrate the two datasets from a), b) into one dataframe (10 marks)
• Inspect and clean labels for integration: To integrate your data without losing rows, you will need to make sure your labels you are joining with are compatible. This will may involve some data cleaning/updating using good old-fashioned gruntwork. For instance, the same country can have two different names in two different datasets (e.g. Democratic People's Republic of Korea vs North Korea). Do some data checks pre and post-integration to ensure you have not lost data . Data loss due to some countries being present in one dataset but genuinely not in another is acceptable.
• Include a brief explanation of your process at the beginning.
• Integrate your data into one dataframe.
Task 1c) Marking:
❖ 6 marks for checking label compatibility for integration (via scripting) and, if required, cleaning/updating those labels
❖ 2 marks for briefly explaining your process
❖ 2 mark for the final integration (at this point the final integration should be a straightforward line (or few lines) of code.)
TASK 2: DATA CLEANING AND EXPLORATORY DATA ANALYSIS (EDA) (24 MARKS)
a) EDA – data quality inspection (8 marks)
• Explore: Explore your data with a view to looking for data quality issues . This could involve looking at summary statistics, plots, inspection of nulls and duplicates – whatever you think is appropriate, there is no single correct way of doing this . Clean your data if and as required and save the cleaned dataset to csv.
• Explain: Include a brief explanation of your process at the beginning.
Task 2a) marking:
❖ 6 marks for your code/outputs: Did you produce outputs appropriate inspecting and addressing data quality issues?
❖ 2 marks for briefly explaining your process .
b) EDA – the search for good predictors (16 marks)
• Explore: Explore your data with the goal of finding explanatory variables/features that could be good predictors of your target variable (Happiness) . This should include:
o Inspection of correlations between features
o Pairs plot/scatter matrix
o Any other visualisation that you deem appropriate
• Explain: Include a brief explanation of your process at the beginning.
• Inspect and transform: Inspect your chosen subset of potential explanatory variables more closely with some visualisations and/or summary statistics. Do any of them look like they need transformation to conform to a normal distribution? Transform any variables that need transformation with an appropriate transformation for normality (e.g. log, square, quarter root etc). Go back and check correlations as required – searching for predictors will likely be an iterative process.
• Discuss: Briefly discuss your findings, e.g. “I have chosen this subset of variables as good candidates for model predictors because …” (warning: do not copy and paste this text into your report, we will deduct marks if you do.) It is also OK to choose variables for reasons other than them being the best possible predictors – perhaps you are curious as to whether a given variable would have any effect in a model.
Note: You are looking for features that are well correlated with the target variable. You are also looking out for features that are highly correlated with each other. Be aware that while models can have predictive power while including highly correlated explanatory features (multicollinearity), the effects of those correlated features will be masked by each other. Where there is multicollinearity, interpretation of specific feature coeficients is uncertain. Bear this in mind later when interpreting your models.
Note: You may find that all your chosen explanatory variables end up coming from the same data source. That is OK.
Task 2b) marking:
❖ 12 marks for your code/outputs (explore, inspect and transform): Did you produce outputs appropriate for finding good predictors? Did you transform where appropriate? Is your code elegant and concise?
❖ 4 marks for your words (explain and discuss): Did you explain your process and discuss your findings? Are your words elegant and concise?
**BONUS QUESTION**
Up to 10 marks will be awarded for feature set expansion via the creation of derived variable/s that make a significant and novel contribution to your final model. How you do this is completely up to you and being a bonus question, no further guidance will be given. Ingenuity and initiative will be rewarded. As this is an extension task, a very high standard is set for achieving maximum marks.
TASK 3: MODELLING (44 MARKS)
Build the best regression model you can, with Happiness as the target variable, within whichever bounds you set yourself in your problem formulation .
• Formulate a problem: You know ‘Happiness’ is your target variable, but what else are you interested in with respect to this problem? Would you like to simply find the model with the most predictive power? Are you interested in understanding how particular features of interest to you affect Happiness? Or perhaps you are interested in finding the most parsimonious model possible, while still retaining predictive power? Another approach is to look at models for a particular group or groups. Perhaps you would like to filter your dataset to include only OECD countries? Or perhaps you would like to build different models for developed, developing and underdeveloped countries? (the World Bank API has this data). Maybe you have some other ideas? Briefly explain how you will be approaching this regression problem. This will help you to focus your experimentation.
• Experiment: Explore different regression models in a way that is appropriate to your problem formulation. Experiment with linear and multiple linear regression as appropriate. Consider a form of the step-wise algorithm. Optionally, look at a polynomial regression (this is not expected).
o Do not use joint plots as a substitute for regression modelling. Zero marks will be given to any model experimentation that relies on joint plots.
o Do use a module for modelling, and do not code up your regression model from scratch .
o Do consider ‘Year’ as a feature to include in your model.
o Do display model statistics
Note: If you are interested in the predictive power of your model, your best model is likely to include multiple explanatory variables so don’t waste time bulking out the assignment with single variable models.
Note: when you have more than one explanatory variable in your model, you will not be able to produce the regression plots from Lecture 4 because they are two dimensional (target vs one explanatory). That is OK. There are other ways to visualise if you want to produce plots, for instance you could use a visualisation to compare certain model summary statistics (like RMSE, prob(F), RSq) that you have collated into a dataframe from multiple different model outputs.
• Write elegant code: Experimenting with many different models will involve repetition of code so employ loops and functions for model creation and evaluation. Functions and loops = less code = easier to read reports and easier and more effective experimentation .
• Evaluate/interpret: To compare models, model outputs must be interpreted. For instance, the probability of the F statistic tells us whether there is a significant relationship between the response and explanatory variables as expressed by the model. R-Squared tells us about the strength of that relationship (and how good our model would be for prediction). Consider the coefficients for your explanatory variables – are they significant and doing heavy lifting in the model, or are they surprisingly superfluous? Can the coefficients be interpreted or is multicollinearity an issue? You may like to calculate RMSE and interpret that in context.
• Present preferred/final model: settle on a preferred or final model for further inspection.
o Residuals: Produce a plot of residuals and fitted values and explain whether it is likely that this model fulfils the necessary assumptions of homoscedasticity (homoscedastic residuals should not fan out) and linearity (the residuals should randomly scatter around the fitted line and not follow a curved shape). You could find code for this online, or you could look up the code in the exercise hints for Lecture 4. For the purposes of this assignment you are not expected to analyse the residuals beyond a visual
inspection. We would usually inspect residuals before interpreting any model output. That
requirement is waived here to pare down the scope.
o Describe what the coefficients of the model mean, remembering to mention what units they are in (eg sealevel = 0.58*temp_celsius : ‘for every degree Celsius increase in average global temperature, sea level rises by 58 centimetres’).
o Explain how reliable the model was. Was it a good fit and good for prediction? How did the residuals look, do you think they conformed well enough with assumptions? Could you recommend this predictive model to a client?
o Optional - Plot the confidence intervals and prediction bands for that model and describe what they tell you (there are no extra marks for this option)
Note: As we do not delve deeply into statistics in this course, and to keep the assignment scope manageable, we will not be holding your work in this assignment to a high statistical standard (for instance, looking for outliers, high leverage points, inspection of residuals etc). We are more interested in you demonstrating some curiosity, your ability to use the tools provided and showing that you can select good predictive features and evaluate a model.
Task 3 Marking:
❖ 4 marks for problem formulation
❖ 14 marks for model experimentation
❖ 5 marks for elegance of code (use of loops/functions)
❖ 10 marks for appropriate interpretations
❖ 11 marks for presentation of preferred model :
o Residuals plot – 4 marks
o Interpretation of residuals plot – 2 marks
o Coefficient explanation – 2 marks
o Discussion of model reliability – 3 marks
TASK 4: PRESENTATION - ‘REPORT-ERIZE’ YOUR WORK (7 MARKS)
Go back through what you have done and turn your Part 1 work into something that looks like a report that you could hand to a client (a technically savvy client as you still need to include your scripting for marking). Include a brief introduction, that describes the modelling problem you formulated and a brief description of the datasets that you use, and a conclusion. Use formatted mark down boxes that include headings. It is OK to include text that clearly delineates the different tasks of the assignment (eg ‘Task 1b’). In fact, any formatting that makes the task of marking easier would be most appreciated.
Clear out any unnecessary code and outputs that clutter your work. Run your text through a spell checker extension. See the end of Part 2 for more tips on how to tidy up a report.
HAND-IN :
Zip-up all your notebooks, python files and dataset(s) into a single file . Submit this file via stream. Make sure that your jupyter notebook has benn run with all outputs visible. Download an HTML version of your notebook (with outputs showing) and include this in your zip file.
PART 2 ASSESSMENT
PART 2: KNN REGRESSION, SUPERVISED AND UNSUPERVISED LEARNING
PROJECT OUTLINE
In this project you will be producing another Jupyter Notebook report. This project requires that you apply techniques taught so far either build a kNN regression models or supervised learning models. You will also build unsupervised learning models. You will be using the dataset you have developed in Part 1 that you may optionally expand. If you choose the supervised learning option, you may use a different dataset of your choosing, if you wish.
You do not need to repeat any of the analysis from Part 1. Consider Part 1 to be an extension of the work you did in the previous part.
You may nonetheless find that further data wrangling and analysis is required to pick and use features for modelling in Part 2. If that is the case, then this should be included and will be considered in the marking.
TASK 1 – IMPORT THE CSV YOU SAVED IN TASK 2A) OF PART 1 ( NO MARKS FOR THIS )
TASK 2 – BUILD KNN REGRESSION MODELS OR SUPERVISED LEARNING MODELS (50 MARKS )
OPTION 1 – KNN REGRESSION MODELS
• Formulate: Using your Part 1 dataset, creatively formulate a problem that enables you to perform kNN regression for prediction . It is acceptable if this problem is the same problem you explored in your regression analysis in Part 1. Describe this problem in your introduction.
2022-04-16