Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Big Data Analytics:

Assignment 3 – Your data science project

Overview

In this course, you have been learning how our everyday interactions with technology are creating huge amounts of data capturing human behaviour worldwide. You have learned how this sort of data  can help  data  scientists  measure  what  is  going  on  in  the  world,  and  even  make  better predictions about how people might behave in the future.

In  this  final  assignment,  you  are  asked  to  pose  an  interesting  question  that  can  be  answered using these new datasets and the data science skills you now possess. You then need to acquire the relevant data, process it into a form that you can analyse, carry out the statistical analysis, and produce relevant visualisations to illustrate your results. You also need write up your results in a clear and engaging style.

The aim of this project is for you to have an opportunity to apply your skills to a question that you are interested in, and at the same time, produce a document that you can use to demonstrate your skills to future employers. Good luck and have fun!

What to submit

Please submit your final write-up as a PDF. This should be uploaded to my.wbs under Individual

Assignment (15 CATS).

Please also submit:

•    The R code which you have written to implement your final project.

You should provide clear comments in your code so that it is easy to understand what it does . Save your code in a script. Do not submit your R workspace, your command history or your RStudio project.

•    Any datasets you have used in your analysis .

Please also include a PDF document explaining what data is contained within the dataset files.

Combine the R code and dataset files into one zip file. Upload this zip file under Code and Data Files

(15  CATS) .  If  you  wish,  you  can  arrange  your  files  in  directories  before  you  zip  them,  to  help  us

understand what is there .

If you expect your zip file to be greater than 50MB, please speak to us about this at least one week in advance of the deadline.

You must upload both your final write-up PDF and your Code and Data Files zip to my.wbs.

We cannot mark your assignment if you do not upload your R code or your data (or indeed, your

write-up) .

Further guidance

Your question

This assignment builds on the project design you carried out for Assignment 2. As such, your question

must  involve  one  of the following  kinds  of  online  data : Google  Trends data, data on  Wikipedia

page views or data retrieved from the Flickr API .

Your question must link this online data with another source of data which reflects human behaviour in the offline world: for example, financial data, national statistics, or any other data source you find interesting. Consult Assignment 2 for further guidance on appropriate project questions.

You are strongly recommended to use the question you identified in Assignment 2, which you will have received feedback about, with any changes which have been suggested. This will help you avoid unexpected difficulties in explaining the value of your question, acquiring the data or carrying out a suitable analysis of the data.

Your analysis

To develop and demonstrate the skills you have acquired during this course, you must carry out your analysis in R. We cannot give you credit for a project that is not implemented in R.

Your aim is to carry out your analysis in such a way that third parties could easily replicate the analysis and verify your findings. You should therefore write your code in as clear a style as possible, with comments to help explain what your code does where necessary. You should also provide clear documentation of the data sets which you have used and which you submit: for example, what data is contained in each file and where this data was acquired from.

See further guidance below for what you need to explain in the write-up itself. (For example, you should not copy-paste large chunks of output from R into your write-up. It is also extremely unlikely that you need to mention specific R functions either. Have a look at examples of papers from the reading list that we have written if you are unsure about this.)

Your write-up

Remember  that  the  goal  of this  assignment  is  to  carry  out  the  analysis  required  to  evaluate  an interesting question. As long as your question is well-motivated, do not worry if your results do not turn out as you hoped. Just make sure that, in your write-up, you provide a clear motivation for your question; a clear argument for why one might expect to find the result you hypothesised may hold; a clear description of the analysis you carried out; and a clear evaluation of your findings, including why you may not have found what you expected. If there was a good reason to suppose you might find

evidence for your hypothesis, it is useful to discover no evidence for the hypothesis too. Your write-up should be no longer than 3,000 words and structured as follows:

   Title

o Your title should convey the main thrust of your analysis and results, but crucially should also catch the reader’s attention.

o Your title should have a maximum of 15 words (but good titles are normally shorter) .

   Abstract

o In your abstract, you should briefly explain the problem your question is addressing, and the opportunity you have identified to address this problem.

o  You should then clearly state what your question therefore is .

o You should give an overview of the analysis you are carrying out to address this question, and you should then explain the results of your analysis.

o Finally, you should describe the conclusions of your analysis . In other words, what do your results mean? What is the takeaway message from your analysis?

o Your abstract should be no longer than 150 words .


•    Introduction

o The  main  goal  of your  introduction  is  to  motivate  your  question  and  introduce  your analysis.

o You should therefore provide enough background to make the value of your analysis clear. Who does the problem you are addressing affect?

o  You should cite between 5 and 10 scientific papers that are related to your analysis . (For  example,  there  are   many   papers  on  the   course   reading   list  that  explore  the relationship between online data and offline behaviour, or that give a broader background to analyses of human behaviour with big data . We have discussed a number of them in the weekly videos and discussions .)

o You should then clearly explain what your analysis sets out to do . What is your question? What do you expect to find? Why do you expect to find this?

o You may wish to give an initial indication of the results you uncover, but this is a stylistic decision .

o  There is no word limit for your introduction, but make sure your writing style is concise .

•     Methods and results

o In the methods and results section, you should very clearly explain what analysis steps you carried out, and what the results were.

o As a guide to the level of detail required, you should include enough information in this section to enable someone else to reproduce your analysis without access to your code or the data you downloaded.

o  To  achieve this, you  should  make the  source  of your  data  clear,  including  providing references   for   websites   from   which   you   have   downloaded   the   data   where appropriate . You should also clearly describe any calculations you carried out on the raw data you downloaded to reach your final results . You do not need to make reference to the specific R functions that you used to do this, however .

o All  statistical  tests  should  be  reported  appropriately,  including  at  least  details  of  the sample  size  (or  degrees  of freedom) , the value  of the test  statistic  calculated  and, where calculated, the p-value .

o You should also describe any assumptions of the analyses you carried out (e.g., should your data be normally distributed?) and show how you checked that these assumptions hold.

o You should provide at least two figures that visualise your findings. We will give you 20% of your marks for visualisation, as detailed below.

o If appropriate, you can provide up to four figures. (Do not provide more than four figures.) You can also construct figures which contain multiple subfigures. However, only include important figures which help you tell your story. You need to be as concise with your figures as you are with your words.

o Figures should always have appropriately labelled axes, with the units of measurement specified. Legends should be provided to explain different colours or line types used, and font sizes should not be too small. As a guide, ensure that any text in your figures is at least as big as text used in the body of your assignment. Check that this is still the case when you have included the figure in your assignment. Make sure that your figure does not get stretched horizontally or vertically when you add it to your assignment.

o Under each figure, provide a caption which clearly outlines to the reader what data the figure shows, and what patterns the reader should note in the data. Each caption should be no longer than 350 words.

o To capture the attention of busy readers and to help them understand your analysis, you should produce figures and figure captions that convey the basic story of your analysis on their own.

o There is no word limit for your full methods and results section, but make sure your writing style is concise.




•   Discussion

o The discussion should briefly summarise what you have done, and discuss what your findings mean.

o To make your document as accessible as possible to busy readers, it is a good idea to ensure that your discussion would make sense if the reader had not read the rest of the document.

o You may wish to begin by briefly summarising the motivation for your study once again. What is the problem you are addressing and what is the opportunity you have identified to address it? You can then restate your research question.

o Next, give a brief indication of the nature of your analyses and summarise what your analyses found .

o Indicate which answer to your research question your findings provide support for.   Is this what you expected?

o Try to offer a potential explanation for your findings. If you have found the pattern you expected, you may have already hinted towards this explanation in your introduction. If you did not find what you expected, why do you think this is?

o It is not a problem if you are not sure why you found a particular pattern simply suggest some possible ideas. It is very important that you are careful not to overstate your case. In particular, be aware that most investigations do not“prove”anything on their own, but you may have found new strong or weak support for a given idea.

o Indicate  what  the  implications  of  your  investigation  are.  For  example,  have  you highlighted a new opportunity to use a certain dataset to measure or forecast a certain type of behaviour? Have you provided evidence of an interesting behavioural pattern? Have you helped explain a previously observed behavioural pattern? Have you provided evidence that a particular line of enquiry may not be worth following further? What might people be able to do once they have read your results that they might not have been able to do before?

o There is no word limit for your discussion, but make sure your writing style is concise.

•   References

o You should provide full references for all papers you have cited .

o  Following the guidance given under Introduction , you should have no fewer than 5 and no  more  than  10  references  for  scientific  papers . All  references  you  give  should  be cited in your write-up . You should not cite anything you do not provide a reference for . Make sure that you are citing high-quality sources such as journal papers, not webpages.

o  You  may  have further  references  for  data  sources  from the Methods and  Results

section . These will not count towards the 5 to 10 reference limits .

o Please  use the  Harvard  style  of  referencing for this  assignment. You  can  find  more guidance here:

https://www2.warwick.ac.uk/services/library/students/referencing/referencing-styles

o Do not rely on a computer program to correctly format your references for you. For example, automatic formatting provided by Microsoft Word is frequently incorrect. Check the references yourself against the Harvard style to ensure that they are correct.

Please do not include any appendices for this assignment. However, in your Code and Data Files

submission, do  remember to  include a  PDF explaining what data  is contained within the dataset files

(see page 1 for further information) .

Your assignment should be formatted in 11pt font, with 1.5 lines spacing and with 2.54cm margins. For any queries about whether certain sections of text count towards the word limit (e.g., figure captions, references), please consult the WBS Policy on Word Count and Formatting:

https://my.wbs.ac.uk/-/academic/37360/resources/in/381545,786874/item/786880/


How marks will be allocated

You will receive marks for the following:


•    Quality of question

o  This area is worth 20% of your final mark for the module .

o You will be awarded marks for choosing a question which was interesting and feasible to answer .

o You can emphasise how interesting your question is by stating your question clearly and  motivating  it well  in the abstract and  introduction . Who would  be  interested  in the answer,  and  why?  You  may  be  able  to  provide  more  evidence  of  the  value  of  your question in the discussion as well .

o Again,  if  you  have  provided  a  good  motivation  for  why  your  question  was  worth investigating and why you believed you might find an interesting answer, do not worry if your results do not turn out as you hoped.

o  You  can  emphasise  how feasible  your  question  was  to  answer  by  completing  an appropriate  analysis  in  the  methods  and  results,  and  crucially,  not  overstating  your findings in the discussion . Your assignment as a whole needs to provide clear evidence that the question you proposed could be answered from the data you identified and the analysis methods you chose, without a leap of faith .

•    Quality of analysis

o  This area is worth 20% of your final mark for the module .

o You will be awarded marks for choosing an analysis method appropriate for answering your question; verifying that assumptions made by this analysis method hold (e.g., should your data  be  normally distributed?); carrying out the analysis correctly; and correctly interpreting the results of the analysis.

o You will also be assessed on whether you have motivated any pre-processing steps well (e.g ., you have not left out half of your dataset without explaining why) .

o Finally, you will be awarded marks for clearly documenting your code, and providing clear pointers to where the data you analyse can be obtained, in order to support replication of your study.

o You can make it easier for your analysis to be correctly assessed by providing a clear and concise description of your analysis in the methods and results.

•    Quality of visualisation

o  This area is worth 20% of your final mark for the module .

o Crucially, you should provide visualisations which tell the story of your analysis in a clear, concise and engaging fashion.

o You will be awarded marks for choosing appropriate visualisations for your data and analysis. Remember, you should only include the visualisations which help tell your story . Do  not  simply  include  every  possible visualisation you  can think  of.  Make  sure you include at least two figures and no more than four.

o You  will  be  awarded  marks  for  providing  legible  visualisations,  and  labelling  your visualisations well (e.g., all axes are labelled, including units of measurements, legends are provided to explain different colours or line types used, and font sizes are not too small).

o You will be awarded marks for creating an attractive visualisation. The base level of plots generated  by the ggplot2  library  is good,  but  it will also allow you to  change  many different aspects of your visualisation where you feel this is appropriate, from colours, to line thickness, to font used, and more.

o For the purposes of this assignment, please make all changes to your figures by writing code in R, apart from assembly of multi-panel figures which you can do in an external program (e.g., Word). You should not postprocess your figures in Adobe Illustrator or similar programs.


o You will also be awarded marks for good figure captions. Do your figure captions meet the specification detailed in the structure above, describing the data shown in the figure and highlighting the key patterns that readers should note in the data? Do your figures and figure captions together successfully tell the main story of your analysis? (Crucially, do not forget your captions!)

•    Quality of written description

o  This area is worth 20% of your final mark for the module .

o You   should   provide   a   clear,   concise   and   engaging  written   description   of  your investigation.

o You will be awarded marks for using the structure described above and covering all the points highlighted in the structure description.

o Within individual sections, you will be awarded marks for structuring your writing well, to make your arguments and descriptions easy to follow .

o You  will  be  awarded  marks  for  the  style  of  your  writing.  Is  it  clear,  concise,  and engaging? Have you kept your sentences short where possible? Have you used correct grammar and appropriate vocabulary? (Simple vocabulary is often easier to understand  do not use complicated words for the sake of it.)

o You will be assessed on whether you have correctly observed conventions for reporting statistical results, including formatting .

o Finally, you will be assessed on whether you have correctly integrated references into your writing, and listed all references correctly at the end of your assignment. This will again include the formatting of your references.

Plagiarism: how to avoid losing marks

Please make sure you observe the WBS plagiarism guidelines to ensure you do not needlessly lose marks. You can see these in full on the next page.

In  particular,  it  is  extremely  important that you  do  not  copy text from  existing  sources  or your classmates. For this assignment, you are also strongly recommended to avoid including any quotes – this should not be necessary. Write everything in your own words and provide clear references where you refer to ideas and results you have read about elsewhere.

Similarly, do not copy an analysis design from an existing paper (whether on the course reading list or not) without clearly referencing the paper and explaining how your work adds to what had previously been found.

In summary, make sure that it could not be suggested that you have copied any text, analysis design or other ideas from any work that you have not provided appropriate references to. We want you to get good marks for your work and do not want to have any difficulty in being able to argue that it is clearly your own. If you are at all unsure about this, just ask for guidance before submission.

Getting help

We are here to help you with implementing your final project. You can ask for support in the seminars, in the office hours, or via the course forum throughout the week. Please feel free to approach us with any queries you may have.

Good luck!

We have seen some great work and great questions on this course. We are looking forward to you submitting some excellent data science projects!

WBS Plagiarism Policy

Please ensure that any work submitted by you for assessment has been correctly referenced as WBS expects all students to demonstrate the highest standards of academic integrity at all times and treats all cases of poor academic practice and suspected plagiarism very seriously. You can find information on these matters on my.wbs, in your student handbook and on the University’s library web pages:

https://warwick.ac.uk/services/library/students/referencing

The University’s Regulation 11 (see link below) clarifies that“...’cheating’means an attempt to benefit oneself or another by deceit or fraud. This shall include reproducing one’s own work...”It is important to note that it is not permissible to reuse work which has already been submitted by you for credit either at WBS or at another  institution  (unless you  have  been explicitly told that you can  do so). This  is considered self-plagiarism and could result in significant mark reductions.

Upon submission of assignments, students will be asked to agree to one of the following declarations:

Individual work submissions:

"I declare that this work is entirely my own in accordance with the University's Regulation 11 and the WBS guidelines on plagiarism and collusion. All external references and sources are clearly acknowledged and identified within the contents. No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done it may result in me being reported for self-plagiarism and an appropriate reduction in marks may be made when marking this piece of work.”

Group work submissions:

"I declare that this work is being submitted on behalf of my group, in accordance with the University's Regulation  11 and the  WBS guidelines on plagiarism and collusion. All external references  and  sources  are  clearly  acknowledged  and  identified  within  the  contents.  No substantial part(s) of the work submitted here has also been submitted in other assessments for accredited courses of study and if this has been done it may result in us being reported for self- plagiarism and an appropriate reduction in marks may be made when marking this piece of work. "

By agreeing to these declarations you are acknowledging that you have understood the rules about plagiarism and self-plagiarism and have taken all possible steps to ensure that your work complies with the requirements of WBS and the University.

You should only indicate your agreement with the relevant statement, once you have satisfied

yourself  that  you  have  fully  understood  its  implications.  If  you  are  in  any  doubt,  you  must

consult   with   the   NIE   of   the   relevant   module,   because   once   you   have   indicated   your

agreement it will not be possible to later claim that you were unaware of these requirements

in the event that your work is subsequently found to be problematic in respect to suspected

plagiarism or self-plagiarism.

Regulation 11: https://warwick.ac.uk/services/gov/calendar/section2/regulations/academic_integrity/