Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Introduction to Data Science

Coursework

Deadline: Wednesday November 30th

Submission: eBART

This assessment is worth 60% of the overall mark. This is an individual        assessment. You are reminded of the University's regulations on plagiarism.

Brief

In this coursework you are asked to explore a large dataset of tweets collected from the Twitter API. Most of the tasks below should be possible to accomplish using the  skills you have learned in this course. The assignment tests your ability to apply your skills and knowledge of data science to an unseen dataset. You may also need to do some self-guided learning to work out how to complete some of the tasks – this is an important part of data science in the real world!

There are several data analysis and critical reflection tasks to complete. Your            submission should be in the form of a single pdf containing code, results, figures and text where appropriate. Do not simply copy paste your script/notebook into a             document, include only important code snippets to show your working, the majority   of the document should be text and images. Please separate and label the different  tasks clearly to aid assessment.

NOTES

1.  The aim of this coursework is to give you experience of analysing a realistic (but still manageable) “BIG” data set. You may have to do some self guided learning on data management strategies.

2.  Expensive hardware is not necessary to complete this coursework, I can run  the whole analysis in a couple of hours on a laptop from 2013. If you find your code taking significantly longer than this it’s likely there’s a better approach.

3.  Use all of the data. If you present an analysis of only a subset you will lose a lot of marks.

4.  Include discussion and working for every question, even if not explicitly asked!

Dataset

The dataset you will use consists of tweets collected from the Twitter API during the period June 1st to June 30th 2022. These tweets were collected by applying a          geographical filter to return only tweets in Europe. The bounding box used is             specified by (longitude,latitude) coordinates. The lower-left corner is at (-24.5, 34.8)  and the upper-right is at (69.1, 81.9). Assume this box defines “Europe” . No keyword or other thematic filters were applied, so the dataset should contain all tweets that    Twitter can identify as originating from the specified region, irrespective of their         topic/content.The data is available for download as a number of compressed files     each covering 1 hour of data. The whole dataset contains millions of tweets.

Each file contains a tweet on every line. Tweets are stored as JSON objects, as       described in the Twitter developer documentation. In particular, as these are located tweets, pay attention to the difference between place” tags and “coordinates” tags.

Files can be downloaded at the link below (you will need to login via your University account): TwitterJune2022

WARNING: With careful management it is possible to do this analysis without a huge hard drive. The optimal strategy is not to simply decompress all the files.

HINT: Look at the json and zipfile modules in Python for processing the tweets.    HINT: One of the challenges with this coursework is handling large data files. You may wish to think about your strategy. The JSON files contain a lot of redundant   information. A good data management strategy will require less storage and make the files quicker to process.

HINT:  Check correctness, don’t assume it. Look at the JSON objects carefully. Check things like timestamps and mandatory fields for consistency.

Tasks

Part 1. Basic Stats (20 marks)

1.  Count the total number of tweets, describing how you deal with duplicates or other anomalies in the data set. [5 marks]

2.  Plot a time-series of the number of tweets by day using the whole dataset and comment on what you see. [5 marks]

3.  Use box and whisker diagrams to compare the average number of tweets on weekdays to the numbers for weekend days. Are there statistically significant differences between the number of tweets on weekdays and weekends? [5   marks]

4.  Plot a time-series of the number of tweets by hour, averaged over all weekdays and comment on what you see. [5 marks]

Part 2. Users (20 marks)

1.  Make a histogram with the number of users on the y-axis and number of      tweets they make on the x-axis. Discuss the distribution that you see. All the users in the data set should be included! [5 marks]

2.  Find the top-5 users by total number of tweets. Do you think any are automated accounts (aka. bots)? Justify your answer.  [5 marks]

3.  Find the 5 users who receive the most mentions and comment on this. [5 marks]

4.  Choose 4 countries and compute how often they mention each other. This   means you should compute 16 numbers e.g. UK mentions UK, UK mentions France, France mentions UK etc. Comment on any patterns you observe. [5 marks]

Part 3. Mapping (30 marks)

1.  Draw a map of Europe that displays the use of Twitter across the continent. Use only the GPS-tagged tweets, these are tweets which have a                 “coordinates” field in the metadata. The exact form of the map is up to you: marks will be given for accuracy, clarity and presentation.  [10 marks]

2.  Explain any patterns you observe.  [5 marks]

3.  The rest of the tweets should have a “place” tag. For these tweets, plot the CDF of the bounding box diagonals and comment. [5 marks]

4.  Find one additional spatial dataset, produce a map comparing Twitter activity with the other dataset and discuss. Your secondary dataset doesn’t have to cover the entire bounding box e.g. it could be for a single city or nation.  [10 marks]

Part 4. Events (20 marks)

1.  Identify 3 days with unusually high activity in 3 different countries of your    choosing. For example you could choose one day in the UK, one in France and one in Turkey. Describe and justify how you identify unusual’ days. [5  marks]

2.  Characterise each of these three days by

a.  Making a word cloud from the tweet text. [5 marks]

b.  Any other method. [5 marks]

3.  Summarise the events you have detected and validate your discussion with some other source of data e.g. news articles. [5 marks]

Part 5. Reflection (10 marks)

Using social media to study the real world is very common in academia, the media  and in industry. Now that you have some experience analysing Twitter data, discuss the use and misuse of Twitter. In particular:

-    What are the drawbacks of Twitter data?

-    What ethical concerns might there be in using Twitter data in different applications?

Write no more than 500 words in total. Remember, this is an academic writing exercise. You should be citing sources and justifying your opinions with          evidence/analysis.

Presentation

Documents which are extremely hard to navigate, messy or otherwise poorly presented will be penalised up to 10%.

Marking Guide

This is a general marking guide to how your document will be assessed. All criteria apply to all relevant questions.

Important:

-     If 5 marks are available for a question, there may be 3 for the numeric or graphical output and 2 for the discussion. The ratio may be different for different questions.

-    Partial marks are available for correct methods, so include working. This does not mean including every line of code - summarise in words and code snippets   what you did. Including large chunks of code will be penalised as bad            presentation.

Criterion

What is expected for a good mark?

Writing

Your writing should be clear, well-structured and concise.

Structure of the document

The structure should be clear, easy to navigate and with useful headings.

Presentation

Your document should conform to a clear   and consistent visual style, be well-spaced, with appropriate font sizes and consistent   and complete labelling and captions.

Code

We are interested in seeing relevant short code snippets that add meaning and        context to your document.

Your code should be well-structured and     readable, with consistent naming                conventions, good use of object-oriented or functional programming principles where    needed and with an appropriate level of      commenting to add context or explanation  where it is needed.

Graphs and maps

Your graphic outputs should be well labelled and captioned, readable, meaningful and     relevant.

Analysis

This is the most important area in which you will be assessed. Your analysis of the data   should be thorough and we are looking to    be impressed by your background research, verification of conclusions and exploration   of the available techniques.

Explanation

Your methods and approaches should be  clearly described and justified, and your     comments and conclusions should be        robust, valid and verified against additional sources where possible.

Submission

Your work must be submitted by 12pm (noon) on the hand-in date shown at the top of this descriptor. Please allow time for the submission process.

You should submit one PDF document containing all of your answers.