Introduction to Data Science
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Introduction to Data Science
Coursework
Deadline: Wednesday November 30th
Submission: eBART
This assessment is worth 60% of the overall mark. This is an individual assessment. You are reminded of the University's regulations on plagiarism.
Brief
In this coursework you are asked to explore a large dataset of tweets collected from the Twitter API. Most of the tasks below should be possible to accomplish using the skills you have learned in this course. The assignment tests your ability to apply your skills and knowledge of data science to an unseen dataset. You may also need to do some self-guided learning to work out how to complete some of the tasks – this is an important part of data science in the real world!
There are several data analysis and critical reflection tasks to complete. Your submission should be in the form of a single pdf containing code, results, figures and text where appropriate. Do not simply copy paste your script/notebook into a document, include only important code snippets to show your working, the majority of the document should be text and images. Please separate and label the different tasks clearly to aid assessment.
NOTES
1. The aim of this coursework is to give you experience of analysing a realistic (but still manageable) “BIG” data set. You may have to do some self guided learning on data management strategies.
2. Expensive hardware is not necessary to complete this coursework, I can run the whole analysis in a couple of hours on a laptop from 2013. If you find your code taking significantly longer than this it’s likely there’s a better approach.
3. Use all of the data. If you present an analysis of only a subset you will lose a lot of marks.
4. Include discussion and working for every question, even if not explicitly asked!
Dataset
The dataset you will use consists of tweets collected from the Twitter API during the period June 1st to June 30th 2022. These tweets were collected by applying a geographical filter to return only tweets in Europe. The bounding box used is specified by (longitude,latitude) coordinates. The lower-left corner is at (-24.5, 34.8) and the upper-right is at (69.1, 81.9). Assume this box defines “Europe” . No keyword or other thematic filters were applied, so the dataset should contain all tweets that Twitter can identify as originating from the specified region, irrespective of their topic/content.The data is available for download as a number of compressed files each covering 1 hour of data. The whole dataset contains millions of tweets.
Each file contains a tweet on every line. Tweets are stored as JSON objects, as described in the Twitter developer documentation. In particular, as these are located tweets, pay attention to the difference between “place” tags and “coordinates” tags.
Files can be downloaded at the link below (you will need to login via your University account): TwitterJune2022
WARNING: With careful management it is possible to do this analysis without a huge hard drive. The optimal strategy is not to simply decompress all the files.
HINT: Look at the json and zipfile modules in Python for processing the tweets. HINT: One of the challenges with this coursework is handling large data files. You may wish to think about your strategy. The JSON files contain a lot of redundant information. A good data management strategy will require less storage and make the files quicker to process.
HINT: Check correctness, don’t assume it. Look at the JSON objects carefully. Check things like timestamps and mandatory fields for consistency.
Tasks
Part 1. Basic Stats (20 marks)
1. Count the total number of tweets, describing how you deal with duplicates or other anomalies in the data set. [5 marks]
2. Plot a time-series of the number of tweets by day using the whole dataset and comment on what you see. [5 marks]
3. Use box and whisker diagrams to compare the average number of tweets on weekdays to the numbers for weekend days. Are there statistically significant differences between the number of tweets on weekdays and weekends? [5 marks]
4. Plot a time-series of the number of tweets by hour, averaged over all weekdays and comment on what you see. [5 marks]
Part 2. Users (20 marks)
1. Make a histogram with the number of users on the y-axis and number of tweets they make on the x-axis. Discuss the distribution that you see. All the users in the data set should be included! [5 marks]
2. Find the top-5 users by total number of tweets. Do you think any are automated accounts (aka. bots)? Justify your answer. [5 marks]
3. Find the 5 users who receive the most mentions and comment on this. [5 marks]
4. Choose 4 countries and compute how often they mention each other. This means you should compute 16 numbers e.g. UK mentions UK, UK mentions France, France mentions UK etc. Comment on any patterns you observe. [5 marks]
Part 3. Mapping (30 marks)
1. Draw a map of Europe that displays the use of Twitter across the continent. Use only the GPS-tagged tweets, these are tweets which have a “coordinates” field in the metadata. The exact form of the map is up to you: marks will be given for accuracy, clarity and presentation. [10 marks]
2. Explain any patterns you observe. [5 marks]
3. The rest of the tweets should have a “place” tag. For these tweets, plot the CDF of the bounding box diagonals and comment. [5 marks]
4. Find one additional spatial dataset, produce a map comparing Twitter activity with the other dataset and discuss. Your secondary dataset doesn’t have to cover the entire bounding box e.g. it could be for a single city or nation. [10 marks]
Part 4. Events (20 marks)
1. Identify 3 days with unusually high activity in 3 different countries of your choosing. For example you could choose one day in the UK, one in France and one in Turkey. Describe and justify how you identify ‘unusual’ days. [5 marks]
2. Characterise each of these three days by
a. Making a word cloud from the tweet text. [5 marks]
b. Any other method. [5 marks]
3. Summarise the events you have detected and validate your discussion with some other source of data e.g. news articles. [5 marks]
Part 5. Reflection (10 marks)
Using social media to study the real world is very common in academia, the media and in industry. Now that you have some experience analysing Twitter data, discuss the use and misuse of Twitter. In particular:
- What are the drawbacks of Twitter data?
- What ethical concerns might there be in using Twitter data in different applications?
Write no more than 500 words in total. Remember, this is an academic writing exercise. You should be citing sources and justifying your opinions with evidence/analysis.
Presentation
Documents which are extremely hard to navigate, messy or otherwise poorly presented will be penalised up to 10%.
Marking Guide
This is a general marking guide to how your document will be assessed. All criteria apply to all relevant questions.
Important:
- If 5 marks are available for a question, there may be 3 for the numeric or graphical output and 2 for the discussion. The ratio may be different for different questions.
- Partial marks are available for correct methods, so include working. This does not mean including every line of code - summarise in words and code snippets what you did. Including large chunks of code will be penalised as bad presentation.
Criterion |
What is expected for a good mark? |
Writing |
Your writing should be clear, well-structured and concise. |
Structure of the document |
The structure should be clear, easy to navigate and with useful headings. |
Presentation |
Your document should conform to a clear and consistent visual style, be well-spaced, with appropriate font sizes and consistent and complete labelling and captions. |
Code |
We are interested in seeing relevant short code snippets that add meaning and context to your document. Your code should be well-structured and readable, with consistent naming conventions, good use of object-oriented or functional programming principles where needed and with an appropriate level of commenting to add context or explanation where it is needed. |
Graphs and maps |
Your graphic outputs should be well labelled and captioned, readable, meaningful and relevant. |
Analysis |
This is the most important area in which you will be assessed. Your analysis of the data should be thorough and we are looking to be impressed by your background research, verification of conclusions and exploration of the available techniques. |
Explanation |
Your methods and approaches should be clearly described and justified, and your comments and conclusions should be robust, valid and verified against additional sources where possible. |
Submission
Your work must be submitted by 12pm (noon) on the hand-in date shown at the top of this descriptor. Please allow time for the submission process.
You should submit one PDF document containing all of your answers.
2022-11-06