Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ASSESSMENT 2

Deadline:

Hand in by midnight 8 May 2022

Evaluation:

Part 1: 7.5% of your final course grade.

Part 2: 7.5% of your final course grade.

Late Submission:

Refer to the course guide.

Work

This assignment is to be done individually.

Purpose:

Implement the entire data science/analytics workflow. Use regression techniques to solve real- world problems. Gain skills in extracting data from the web using APIs and web scraping. Build on the data wrangling, data visualization and introductory data analysis skills gained up to this point as well as problem formulation and presentation of findings. Gain skills in kNN regression

modelling and supervised and unsupervised learning.

Learning outcomes 1 - 5 from the course outline.


Please  note that all data  manipulation  must  be written  in  python code in the Jupyter  Notebook environment.  No marks will be awarded for any data wrangling that is completed in excel.

Please submit a separate notebook for each part of this assignment.

These assignments will take longer than you think, so

Do not leave starting these assignments until the last minute. You have the tools you need to start now.

As of the week 5 lecture, you will have been introduced to tools that will assist you in completing Part 1.

By week 7 (before semester break) you will be able to complete most of Part 2, except for task 3, which you will be able to complete after the week 8 lecture.


PART 1: DATA ACQUISITION AND REGRESSION

Here you will be integrating data from two sources:

•    The World Happiness Index

and one of:

•    The World Bank API

•    A web-scraped source of your choosing

Your goal is to build Regression models for predicting happiness, following a good process, including:

•    careful selection  of explanatory variables  (features) through  engaging your critical thinking  in  choosing data sources, exploratory data analysis and optional feature set expansion;

•    good problem formulation;

•    good model experimentation (including explanation of your experimentation approach), and

•    thoughtful model interpretation

TASK 1: DATA ACQUISITION AND INTEGRATION (25 MARKS )

a) Static Data: Import Table 2.1 of the World Happiness Report data (1 mark)

You can download the WHRData2022.xls” static dataset from the Stream site.   This dataset is from the 2022 World Happiness Report.  You can learn more about this report here:

https://worldhappiness.report/ed/2022/

Data definitions and other variable documentation can be found here:

https://happiness-report.s3.amazonaws.com/2022/Appendix_1_StatiscalAppendix_Ch2.pdf

You should familiarise yourself with the data documentation before proceeding.  As a bare minimum, you will need to identify which variable represents Happiness’ .

Note: if you are unable to meet the challenges laid out in Task 1 b) and c) you will still be able to continue with Tasks 2 and 3 with only the static dataset.

b) Dynamic data (14 marks)

Do ONE of either option 1 or option 2:

OPTION 1:

API Data: Identify, import and wrangle indicators of your choosing from the World Bank API

The World Bank API is briefly introduced in Lecture 5.  Your task is to identify and import 5 or more World Bank indicators  (features)  that  you  would  like  to  have  as  options  for  inclusion  in  your  models  for  predicting happiness.

Identify: To  identify  5  or  more  appropriate  indicators,  you  will  need  to  explore  the  World  Bank  API documentation  and figure  out for yourself  how to find which  indicators  are available and then  how to identify and request them.   Finding your own way through the documentation is a deliberate part of this challenge. Briefly explain your process and why you chose your features.  These links will provide you with

a start:

https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-api-documentation https://datahelpdesk.worldbank.org/knowledgebase/articles/898599-api-indicator-queries


Import and wrangle your chosen indicators so that they are in the right shape for integration with the WHR data. In Lecture 5, only one indicator is imported .  To import many indicators in a tidy fashion (i.e. without repeating code) will possibly involve the use of a loop and/or function, depending on your approach.

Note that by default you may not be returned all the data you require - you may have to set arguments to obtain the full range (keep an eye out for the per_page’ argument).  Also note that you can specify a date range.

Task 1b) Option 1 marking:

6 marks for identification of features and explanation of why you chose them .   We are looking for your curiosity and initiative in exploring the World Bank API and figuring out how to use it to effectively identify appropriate indicators.  0/7 marks will be awarded if you simply import a subset of the indicators that you have been given codes for in Lecture 5.

8 marks for the import and wrangling of the data the more elegant and tidy the solution, the higher the marks

OPTION 2:

Web-scraped data: Source, import, parse and wrangle web data

Source: Go to the internet and find another data source with which to expand your feature set that:

o  can be web-scraped,

o  you think may improve your predictive model, and

o  can be meaningfully integrated with the WHR data and your World Bank data.

In case it is not obvious, you will be looking for data that can be linked on both country name and one or more years of the data you have already acquired .

Import, parse, wrangle: Scrape the data and wrangle it into the shape it needs to be in in order to integrate it later.

Explain: Include a brief explanation of your wrangling process at the beginning of wrangling.

Task 1b) Option 2 marking:

3 marks for finding an appropriate and good quality data source and explanation of why you chose it.

8 marks for effective and tidy import/parse/wrangle code

3 marks for briefly explaining your wrangling process before you import your data

c) Integration: By whichever means appropriate, clean labels and integrate the two datasets from a), b) into one dataframe (10 marks)

Inspect and clean labels for integration: To integrate your data without losing rows, you will need to make  sure  your  labels  you  are  joining  with  are  compatible.     This  will  may  involve  some  data cleaning/updating using good old-fashioned gruntwork.   For instance, the same country can have two different names in two different datasets (e.g. Democratic People's Republic of Korea vs North Korea). Do some data checks pre and post-integration to ensure you have not lost data .  Data loss due to some countries being present in one dataset but genuinely not in another is acceptable.

Include a brief explanation of your process at the beginning.

Integrate your data into one dataframe.

Task 1c) Marking:

6 marks for    checking    label    compatibility    for    integration    (via    scripting)    and,    if    required, cleaning/updating those labels

2 marks for briefly explaining your process


2 mark for the final integration (at this point the final integration should be a straightforward line (or few lines) of code.)


TASK 2: DATA CLEANING AND EXPLORATORY DATA ANALYSIS (EDA) (24 MARKS)

a) EDA data quality inspection (8 marks)

Explore: Explore your data with a view to looking for data quality issues .  This could involve looking at summary statistics, plots, inspection of nulls and duplicates whatever you think is appropriate, there is no single correct way of doing this . Clean your data if and as required and save the cleaned dataset to csv.

Explain: Include a brief explanation of your process at the beginning.

Task 2a) marking:

6 marks for your code/outputs: Did you produce outputs appropriate inspecting and addressing data quality issues?

2 marks for briefly explaining your process .

b) EDA the search for good predictors (16 marks)

Explore: Explore your data with the goal of finding explanatory variables/features that could be good predictors of your target variable (Happiness) .  This should include:

o  Inspection of correlations between features

o  Pairs plot/scatter matrix

o  Any other visualisation that you deem appropriate

Explain: Include a brief explanation of your process at the beginning.

Inspect and transform: Inspect your chosen subset of potential explanatory variables more closely with some visualisations and/or summary statistics.   Do any of them look like they need transformation to conform  to  a  normal  distribution?     Transform  any  variables  that  need  transformation  with  an appropriate  transformation  for  normality  (e.g.  log,  square,  quarter  root  etc).    Go  back  and  check correlations as required searching for predictors will likely be an iterative process.

Discuss: Briefly discuss your findings, e.g. “I have chosen this subset of variables as good candidates for model  predictors  because …”  (warning: do not copy and paste this text into your report, we will deduct marks if you do.)  It is also OK to choose variables for reasons other than them being the best possible predictors perhaps you are curious as to whether a given variable would have any effect in a model.

Note: You are looking for features that are well correlated with the target variable. You are also looking out for features that are highly correlated with each other. Be aware that while models can have predictive power while including highly correlated explanatory features (multicollinearity), the effects of those correlated features will be masked by each other. Where there is multicollinearity, interpretation of specific feature coeficients is uncertain. Bear this in mind later when interpreting your models.

Note: You may find that all your chosen explanatory variables end up coming from the same data source. That is OK.

Task 2b) marking:

❖  12 marks for your code/outputs (explore, inspect and transform): Did you produce outputs appropriate for finding good predictors? Did you transform where appropriate? Is your code elegant and concise?

❖ 4 marks for your words (explain and discuss): Did you explain your process and discuss your findings? Are your words elegant and concise?


**BONUS QUESTION**

Up to 10 marks will be awarded for feature set expansion via the creation of derived variable/s that make a significant and novel contribution to your final model.   How you do this is completely up to you and being a bonus question, no further guidance will  be given.    Ingenuity  and  initiative  will  be  rewarded.   As this  is  an  extension task,  a very  high standard is set for achieving maximum marks.

TASK 3: MODELLING (44 MARKS)

Build the best regression model you can, with Happiness as the target variable, within whichever bounds you set yourself in your problem formulation .

Formulate a problem: You know Happiness’ is your target variable, but what else are you interested in with respect to this problem?   Would you like to simply find the model with the most predictive power?   Are you interested in understanding how particular features of interest to you affect Happiness?   Or perhaps you are interested in finding the  most  parsimonious  model  possible, while still  retaining  predictive power?   Another approach is to look at models for a particular group or groups.  Perhaps you would like to filter your dataset to include only OECD countries? Or perhaps you would like to build different models for developed, developing and underdeveloped countries? (the World Bank API has this data).  Maybe you have some other ideas? Briefly explain how you will be approaching this regression problem.  This will help you to focus your experimentation.

Experiment: Explore different  regression  models  in a way that is appropriate to your  problem formulation. Experiment  with  linear  and  multiple  linear  regression  as  appropriate.    Consider  a  form  of  the  step-wise algorithm. Optionally, look at a polynomial regression (this is not expected).

o Do not use joint plots as a substitute for regression modelling.  Zero marks will be given to any model experimentation that relies on joint plots.

o Do use a module for modelling, and do not code up your regression model from scratch .

o Do consider Year’ as a feature to include in your model.

o Do display model statistics

Note: If you are interested in the predictive power of your model, your best model is likely to include multiple explanatory variables so dont waste time bulking out the assignment with single variable models.

Note: when you have more than one explanatory variable in your model, you will not be able to produce the regression plots from Lecture 4 because they are two dimensional (target vs one explanatory). That is OK. There are other ways to visualise if you want to produce plots, for instance you could use a visualisation to compare certain model summary statistics (like RMSE, prob(F), RSq) that you have collated into a dataframe from multiple different model outputs.

Write elegant code: Experimenting with many different models will involve repetition of code so employ loops and functions for model creation and evaluation.  Functions and loops = less code = easier to read reports and easier and more effective experimentation .

Evaluate/interpret: To compare models, model outputs must be interpreted.   For instance, the probability of the F statistic tells us whether there is a significant relationship between the response and explanatory variables as expressed by the model. R-Squared tells us about the strength of that relationship (and how good our model would  be for  prediction).  Consider the coefficients for your explanatory variables are they  significant  and doing heavy lifting in the model, or are they surprisingly superfluous?  Can the coefficients be interpreted or is multicollinearity an issue? You may like to calculate RMSE and interpret that in context.

Present preferred/final model: settle on a preferred or final model for further inspection.

o Residuals: Produce a plot of residuals and fitted values and explain whether it is likely that this model fulfils the necessary assumptions of homoscedasticity (homoscedastic residuals should not fan out) and linearity (the residuals should randomly scatter around the fitted line and not follow a curved shape). You could find code for this online, or you could look up the code in the exercise hints for Lecture 4. For the  purposes of this assignment you are  not expected to analyse the  residuals  beyond a visual


inspection.     We  would   usually   inspect   residuals   before   interpreting any model   output.     That

requirement is waived here to pare down the scope.

o Describe what the coefficients of the model mean, remembering to mention what units they are in (eg sealevel = 0.58*temp_celsius  : ‘for every degree Celsius increase in average global temperature, sea level rises by 58 centimetres’).

o Explain how reliable the model was. Was it a good fit and good for prediction?  How did the residuals look,  do  you  think  they  conformed  well  enough  with  assumptions?  Could  you  recommend  this predictive model to a client?

o Optional - Plot the confidence intervals and prediction bands for that model and describe what they tell you (there are no extra marks for this option)

Note: As we do not delve deeply into statistics in this course, and to keep the assignment scope manageable, we will not be holding your work in this assignment to a high statistical standard (for instance, looking for outliers, high leverage points, inspection of residuals etc). We are more interested in you demonstrating some curiosity, your ability to use the tools provided and showing that you can select good predictive features and evaluate a model.

Task 3 Marking:

4 marks for problem formulation

14 marks for model experimentation

5 marks for elegance of code (use of loops/functions)

10 marks for appropriate interpretations

11 marks for presentation of preferred model :

o  Residuals plot – 4 marks

o  Interpretation of residuals plot – 2 marks

o  Coefficient explanation – 2 marks

o  Discussion of model reliability – 3 marks

TASK 4: PRESENTATION - ‘REPORT-ERIZE’ YOUR WORK (7 MARKS)

Go back through what you have done and turn your Part 1 work into something that looks like a report that you could hand to  a  client  (a technically  savvy  client  as  you  still  need to  include  your  scripting  for  marking).    Include  a  brief introduction, that describes the modelling problem you formulated and a brief description of the datasets that you use, and a conclusion. Use formatted mark down boxes that include headings.  It is OK to include text that clearly delineates the different tasks of the assignment (eg Task 1b’).  In fact, any formatting that makes the task of marking easier would be most appreciated.

Clear out any unnecessary code and outputs that clutter your work.  Run your text through a spell checker extension. See the end of Part 2  for more tips on how to tidy up a report.

HAND-IN :

Zip-up all your notebooks, python files and dataset(s) into a single file . Submit this file via stream. Make sure that your jupyter  notebook  has  benn run with all outputs visible.   Download an  HTML version of your  notebook  (with outputs showing) and include this in your zip file.



PART 2 ASSESSMENT


PART 2: KNN REGRESSION, SUPERVISED AND UNSUPERVISED LEARNING

PROJECT OUTLINE

In this project you will be producing another Jupyter Notebook report.  This project requires that you apply techniques taught so far  either  build a  kNN  regression  models  or supervised  learning  models.   You will also  build  unsupervised learning models. You will be using the dataset you have developed in  Part  1 that you may optionally expand.   If you choose the supervised learning option, you may use a different dataset of your choosing, if you wish.

You do not need to repeat any of the analysis from Part 1.  Consider Part 1 to be an extension of the work you did in the previous part.

You may nonetheless find that further data wrangling and analysis is required to pick and use features for modelling in Part 2.  If that is the case, then this should be included and will be considered in the marking.

TASK 1 – IMPORT THE CSV YOU SAVED IN TASK 2A) OF PART 1 ( NO MARKS FOR THIS )

TASK 2 – BUILD KNN REGRESSION MODELS OR SUPERVISED LEARNING MODELS (50 MARKS )

OPTION 1 – KNN REGRESSION MODELS

Formulate: Using  your  Part  1  dataset,  creatively  formulate  a  problem  that  enables  you  to  perform  kNN regression for prediction .   It is acceptable if this problem is the same problem you explored in your regression analysis in Part 1.  Describe this problem in your introduction.