Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

INF6027 Introduction to Data Science (2021-22)

Analysis of the Our World in Data COVID-19 Dataset

1. Introduction

This part of the assessment for INF6027 Introduction to Data Science comprises a piece of individual coursework to assess your ability to analyse data using R/RStudio and to then communicate your findings. Given a specific topic and dataset (see Section 2), you should identify a specific problem or topic you would like to investigate. You will then need to pre-process and analyse the dataset to identify patterns and relationships that address your selected problem/topic. This should involve using techniques learned throughout the practical sessions that will help you to demonstrate your R skills, such as summarising datasets, statistical modelling or data visualisation, to highlight and illustrate particular aspects of the data you want to communicate (e.g., particular patterns or trends).

This coursework aims to follow the stages involved in a ‘typical’ data science process: (i) define the question(s) to address (note, sometimes this does not come at the start of the process, but after initial exploration of the data); (ii) gather data; (iii) transform, clean and structure the data; (iv) explore and analyse the data; and (v) communicate the findings of the data analysis. This often occurs in an iterative manner and centred on one or multiple questions you are seeking to address. For example, the data discovery process in Figure 1 presents an example of the stages involved in data discovery as an iterative process1 and you can find more details in Section 3. This is also similar to the data science process we have been using in class from the “Doing Data Science” book (O’Neil & Schutt, 2013).

Fig. 1 Example data discovery process (Jones, 2014: p.2)


You should write a 3,000 word structured report (see Section 4) that describes the approach you have taken to explore and analyse the data for the selected problem/topic. You report should clearly communicate the results of your data analysis and be written in a way that helps the reader interpret your findings. Note: charts, tables, and appendices are not included in the word count.

This assessment is worth 100% of the overall module mark for INF6027. A pass mark of 50 is required to pass the module as a whole. Submission deadline: 10am Monday 17th January 2022 via Turnitin. See Section 5 for more general information about Coursework Submission Requirements within the Information School.


2. Our World in Data COVID-19 Dataset

There has been a lot of recent interest in analysing publicly available datasets to identify patterns and gain insights into the impact of COVID-19, see for example the Coronavirus Resource Center by John Hopkins University2 . There is also an increasing use of COVID-19   data   used   in   the   media   to   highlight   aspects   of   the   pandemic   and   related   activities   (see,   e.g., https://www.bbc.co.uk/news/uk-51768274). The dataset to be used in this assessment is the Our World in Data COVID-19 dataset, a collection of public global COVID-19 data (this is an example of Open Data). A description of the data is available here: https://ourworldindata.org/coronavirus. The data is provided as CSV files and can be downloaded from Github.3


The dataset is a collection of COVID-19 data and includes the following data: vaccinations, tests & positivity, hospital & ICU, confirmed cases & deaths, reproduction rates, policy responses, and other variables of interest.

You can select any data from the Our World in Data COVID-19 dataset. (This may require multiple downloads.) You can also aggregate the dataset with other open data sources if you want (e.g., census data), which would demonstrate your ability to join datasets (although you don’t have to do this to pass the coursework as the emphasis of the coursework is on how you carry out your analysis in R/RStudio and communicate your findings on the Our World in Data COVID-19 Dataset).


3. What you need to do

The following sections describe what you need to do in order to carry out the coursework. This roughly follows the steps shown in Fig. 1, but you don’t have to be constrained by this or follow them in this particular order; it is just a suggestion. Also, all the R we have done in the practical sessions (and the final sessions) should be enough to conduct the coursework, although you may need to investigate certain areas further that relate specifically to the problem you tackle in your investigation.


3.1. Review the literature and identify research question(s)

As mentioned previously, you should select a specific problem/topic related to the data (the ‘question’ stage in Fig. 1). To decide what area to focus on you could start by undertaking a brief review of the relevant literature around areas, such as analysis of infection data, geographical analysis of infections, predictive modelling, analysis of vaccinations statistics, etc. For example, these articles may be a useful starting point:

Latif, S. et al. (2020). Leveraging Data Science to Combat COVID-19: A Comprehensive Review. IEEE Transactions on Artificial Intelligence, Volume 1, Issue 1, pp. 85-103, IEEE. (Available online: https://doi.org/10.1109/TAI.2020.3020521)

Callaghan, S. (2020). COVID-19 is a Data Science Issue. Patterns, Volume 1, Issue 2, pp. 100022. (Available online: https://dx.doi.org/10.1016%2Fj.patter.2020.100022)


Reviewing past literature will help you understand what kinds of analyses are undertaken using COVID-19 data and provide a possible source of ideas for what you could do with the dataset mentioned in Section 2. Examples of possible topics include, but are not restricted to, the following:

•    Evolution of COVID-19 infections in an area over time;

•    Models and predictions of infection rates;

•    Analysis of infections and vaccinations;

•    Comparisons of the spread of variants in various regions;

•    Clustering and classification of data

•    Normalisation and integration with other datasets (e.g., LSOA census statistics);

•    Focus on a certain census dimension (e.g., demographics in the area);

•    Visualisation of the data (e.g., on maps).


3.2. Download, pre-process and explore the data

As well as reviewing relevant academic literature you should also download some data as clarified above and perform an exploratory analysis (i.e. ‘play’ with the data), to better understand the dataset and also help you to identify a particular problem or topic you might want to focus on.

This part of your investigation will include steps to pre-process and transform the data, such as cleaning up the data, dealing with missing values, standardising numeric values, etc. This may also include combining or joining the data with further datasets, e.g. census or deprivation data. This reflects the ‘gather’ and ‘structure’ stages in Fig. 1. (Note: this part of the analysis could take a lot of time so don’t underestimate how much time you will need to spend on this part of the coursework.)


3.3. Analyse and explore the data

As you identify a topic of interest for your analysis then you should identify the most appropriate techniques (using R and associated packages) for carrying out your analysis and exploring the data, e.g. you might want to predict infection rates using regression or compare levels of recovery rates using statistical tests. This might also be an iterative process whereby you perform some analysis and then gather (or remove) more data. Where possible relate you analysis to the relevant literature. This relates to the ‘exploring data’ stage in Fig. 3.



Note that this is often an iterative process: as you explore the data you may end up re-designing your research questions, having to gather more data or having to perform further cleaning as more data quality issues arise. Again, this is all a part of the data discovery process.


3.4. Write up your findings

Once you have performed analysis on the data and have some results then you need to write up your investigation into a report (this is the ‘communicate’ stage of Fig. 1). The report should be structured as outlined in Section 4. You will be evaluated on your ability to plan and undertake data analysis and exploration of the pandemic based on named dataset, your ability to engage with the relevant literature, your use of R (and appropriate packages) and RStudio to process and analyse the data, and the way in which you communicate your findings within the report for your given problem/topic.

You should also provide your R code as an appendix and marks will be awarded for your clarity, consistency and way in which you comment your R code (see, e.g.http://stat405.had.co.nz/r-style.html). The specific style you use is not as important as how well you comment your code so that someone else can follow what you have done and being consistent in whichever style you adopt.

The minimum requirement to pass is to perform at least one type of data analysis (e.g., clustering, prediction, time-series analysis, etc.) and include at least two visualisations (e.g., charts, maps, etc.) in the report. To obtain a higher mark and more effectively communicate your findings, you may decide to use more than one dataset or present more than one type of data analysis and/or use multiple visualisations. Again, you should also engage as much as possible with the appropriate literature.


4. Report structure

You are required to produce a structured report that includes the sections detailed in Table 1. You must state the word count on the first page of the report. As there is a word count limit (3,000 words) you should aim to make your writing as concise and informative as possible. Also note that your work will be assessed taking into account the word limit; therefore, we are not expecting detailed multiple analyses in the report; rather the emphasis should be on the clarity, accuracy and quality in communicating your findings. Note that words within tables and appendices are not included in the word count.