1 Submission and Assessment

This assignment is worth 60% of the marks for CSM0120. You should submit a single zip file containing your code in as many files as you wish and your reports (in PDF format). The file naming and directory structure within the zip should be self explanatory: you may include a parent index file named README.txt explaining the structure of your submission if you wish. The zip file is to be submitted via Blackboard before 1pm, 10th January 2020; any work submitted after the deadline will receive zero marks unless an extensionhas been agreed before submission.

By submitting via Blackboard, you are implicitly declaring the work to be your own. The body of reports must be in your own words; it is not acceptable to construct your report by copying and pasting chunks of text from the web

The body of reports must be in your own words; it is not acceptable to construct your report by copying and pasting chunks of text from the web1. It is important to indicate clearly in your own work where you

have included the work of others: in Computer Science this could include reuse of designs and code as well as copying or quoting text.

Marking will be anonymous and will be according to the assessment criteria for Development, Appendix AA of the student handbook (https://www.aber.ac.uk/~dcswww/Dept/Teaching/Handbook/AppendixAA. pdf).

Feedback will be returned on or before 31st January 2020.

In case of personal, financial or health problems affecting this coursework, please provide a special circumstances form http://www.aber.ac.uk/~dcswww/Dept/Teaching/AdvisingResits/spec-circ.htm to your year coordinator (for Masters students taking computer science modules this is Edel Sherratt, [email protected]). If you have specific questions relating to the assignment itself, please contact Roger Boyle, [email protected].

2 The Assignment

Download and unpack the zipfile attached to this assignment: there are two independent parts.

2.1 Part 1: Mortality analysis

This part of the work carries 20 marks overall, of which 15% (i.e., 3) are given to the report. 

  • The directory Ddata contains certain data on death rates for 17 different countries2. Inspect these data and be sure you understand them.
  • Develop Python code that will read all the data in this directory into a suitable data structure. The code should still function correctly if data from more countries are added to the directory, or if some are deleted (you may assume data files are named as CountryName.txt, and that the internal format is consistent). Note that there may be an end-point effect in some data if the recorded year intervals do not match the pattern xxx0-xxx4, xxx5-xxx9.
  • Develop a dialogue with the user that will plot in one or more graphs a subset of the data specified by

Country or countries
A contiguous span of years. This span’s end-point may not align with the intervals seen in the data.
A contiguous age range.
Gender specification that may be male, female or both

There should be different plot lines for each gender and each country.

  • It will be clear that the different country populations have an effect that impairs easy comparison of the data between countries. Devise and implement a normalisation approach that allows country data to be co-plotted on the same axes that illustrate such comparison immediately.
  • If the user chooses more than two countries, for the chosen year interval and countries, determine a mean of your normalised measure for the chosen gender(s). Identify the chosen country whose data shows most variance from this mean: co-plot that country and the mean response (there will be two such plots if both genders have been selected).

2.2 Part 2: Twitter analysis

This part of the work carries 40 marks overall, of which 15% (i.e., 6) are given to the report.

Twitter (www.twitter.com) is an exceptionally popular and influential system3. Unsurprisingly, different people have markedly different styles when writing within the constraints that Twitter enforces. It is reasonable to enquire whether these different styles may be characterised in some way, perhaps as a first step in author identification or verification. This work will engage you in the very early stages of such a project in

  • automatically extracting Twitter data for a given user or users
  • starting to clean it
  • suggesting and partially implementing feature extraction from such data

2.2.1 Data extraction

You will write Python code to take as input a list of Twitter usernames and an integer N which outputs the most recent N tweets of each user (or fewer, if the user has not produced that number). The output should be in a form that creates a file for each user in which the tweets appear separated by a line of 50 ‘ >‘ characters4.
The dialogue would thus be:

>>> run tweetextractor

TweetExtractor version 3 . 1 , 25/12/19.

Written by Montague Burton

Twitter names : realdonaldtrump , rogerdboyle , jacindaarden
Tweets to retrieve : 100

– and the program would create three files realdonaldtrump.txt, rogerdboyle.txt, jacindaarden.txt, each containing the text of up to 100 tweets, separated by lines of > as specified. If the user does not exist, no file would be created for her/him. 

To do this work, you will need to engage with a Twitter/Python API. There are many of these5 and you may use whichever you like. The recommended route is Tweepy (see https://github.com/tweepy/tweepy): problems encountered with other APIs will probably not receive support. 

Assuming you do use Tweepy, you will proceed as follows: 

  • If the machine you are using does not have it installed, install it. This is best done from within Anaconda with the command

pip in stall tweepy

Alternatively, download Tweepy: https://github.com/tweepy/tweepy.

  • Acquire a Twitter ID, if you do not have one: https://twitter.com/i/flow/signup.
  • Register as a Twitter developer: https://developer.twitter.com/en/apply-for-access.htmlThis will give you permission to develop programs that communicate with Twitter. The registration process  is short but requires you to answer a number of questions about your intended use of the facility - be cautious and honest in your replies: Twitter will quite rightly take a dim view of frivolous or malicious comments. (Guidance is in a Blackboard announcement)
  • Create an App from the developer dashboard.
  • Get the keys and tokens associated with your App: these will permit your code to communicate directly with Twitter.
  • Study and run the program T.py in the zipfile associated with this work.

(This list is a little terse: you will find online a lot of good guidance on the use of Twitter and Tweepy - look for it! Just one example is https://realpython.com/twitter-bot-python-tweepy/). Using T.Py as a base [or otherwise] construct a program that meets the requirements. You will need to:

insert your own codes and tokens

determine how to acquire tweets from other users

filter out tweets that are ‘retweets’ (so not authored by the specified user, and anything else you can identify as not authored by the target user

isolate the text of the tweet from meta-information: the attribute display_text_range may help you

output files in the given format

2.2.2 Feature extraction

Feature extraction is an early stage of most pattern recognition systems, and will often involve taking an input object (e.g., a face, a tweet, a game of rugby, an episode of Game of Thrones, a Montague Burton art deco structure) and deriving some ‘suitable’ measures from it that will probably be real numbers. Thus the object can be described by a vector of real numbers; these vectors are then passed to one of a number of classification algorithms (such as SVMs, neural nets, . . . ) for subsequent processing.

There is a mass of literature on this topic for text characterisation and you are not required to comprehend it all or implement it. Nevertheless, look at the samples you have collected and see if you can identify stylistic differences of any kind. Use these to define a minimum of three features that might be derived from the text streams. 

Develop a Python program that will search the current directory for files of the form Name.txt that are output of the preceding section (that is, tweets separated by lines of > characters). The program will then derive your chosen features and output to the screen a sequence of lines of the form

(Name; f1; f2; f3)
– that is, the Twitter ID followed by 3 numbers. There will be one such line for each valid tweet of each user.
[Enthusiasts for Machine Learning might think at this point about what they would do next].

2.3 Submission

Submit a zipfile that contains in a sensibly named file structure.

For each part of this assignment, include files of Python code that will run according to the requirements given herein. Code should output suitable summary progress for user confirmation.

Python code should be properly commented throughout.

Graphs and diagrams should be fully and properly annotated.

Exception handlers should be provided to deal hygienically with any errors that might occur at runtime; in particular, input operations should be ‘bomb-proof’.

For each section, a structured report that outlines what you have done, gives a technical overview, explains what testing you have performed, and gives some suggestions about how this analysis might be taken further.

Any textual document should be either raw text (a txt file) or PDF. You can include diagrams and screenshots if these are helpful. A good report will be clear, concise and well-illustrated. Note that overly long reports will be penalised.