Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CITS 2401

Computer Analysis and Visualisation

Assignment 3

Tweet Data Processing and Visualisation

1. Outline

In assignment 2, we cleaned bunch of tweets to carry out analysis (i.e., a preparation). With cleaned data. However, reading values and inspecting cleaned data is still not sufficient for us to understand what the data is trying to tell us, or what kind of hidden information it may entail. So it is useful to understand how data can be further processed for analysis and visualisation, which can be easier to see things like trends, patterns and any other useful information.

In this assignment, we will do some simple data processing and visualisation. In particular, structuring the cleaned data as well as reformatting as necessary (i.e., datatypes), and conduct a few different visualisation techniques to depict information contained in the dataset provided. We will also explore a widely used NLP metric and try to implement it from scratch.

Materials necessary to complete this assignment includes lectures up to week 11 Interpolation and Curve Fitting (i.e., visualisation) and their relevant labs.

You have also been provided with skeleton code in which you can complete with your code answers and test in your IDE. This file can be used as your submission file make sure you change the filename as per the submission instructions.

Note : This is an individual assignment, please don't share your solution/code/files with others (only high- level discussion is allowed, e.g., the syntax of the formula, use of modules with other examples etc.). If

it is found to be not your original work, then you may be penalised.

Note2: This assignment takes time to complete, please start early!

2. Tasks

Task 1 Load metric data

Write a function load_metrics(filename) that given filename (a string, always a csv file with same columns as given in the sample metric data file), extract columns in the order as follows:

1.  created_at

2.  tweet_ID

3.  valence_intensity

4.  anger_intensity

5.  fear_intensity

6.  sadness_intensity

7.  joy_intensity

8.  sentiment_category

9.  emotion_category

The extracted data should be stored in the NumPy array format (i.e., produces ). No other post-processing is needed at this point. The resulting output will now be known as data.

Note: when importing, set the delimiter to be ',' (i.e., a comma) and the quotechar to be '"' (i.e., a double quotation mark).

Task 2 Structuring the data

The NumPy array you created from task  1 is unstructured because we let NumPy decide what the datatype for each value should be. Also, it contains the header row that is not necessary for the analysis. Typically, it contains float values, with some description columns like created_at etc. So, we are going to remove the head row, and we are also going to explicitly tell NumPy to convert all columns to type

float apart from columns specified by indexes, which should be Unicode of length 30 characters. Write a function unstructured_to_structured(data, indexes) that achieves the above goal.

Task 3 Converting timestamps into better format

The created_at column contains the timestamp of each tweet the row is referring to, but its current format is not ideal for comparing which one is earlier or older (why?). To change this, we are going to reformat the column (but still Unicode of length 30after conversion).

Write a function converting_timestamps(array) that converts the original timestamp format into a new format as follows:

Current format   : [day] [month] [day value] [hour]:[minute]:[second] [time zone difference] [year]

New format        : [year]-[month value]-[day value] [hour]:[minute]:[second]

For example, a current format value 'Tue Feb 04 17:04:01 +0000 2020' will be converted to :

'2020-02-04 17:04:01'.

Task 4 Replacing nan values

Sometimes data gets corrupted, and for us it will be indicated by value np.nan (or simply, nan). You are given a code that randomly introduces the nan value into your current dataset for testing. For our analysis, we will convert nan values to be the mean value of the rest of the data. For example, if we have data [1, 2, nan, 4, 5, nan], then those nan values will be replaced by (1 + 2 + 4 + 5)/4 = 3 -> [1, 2, 3, 4, 5, 3].

Write a function replace_nan(data) where you replace all nan values in the columns 'valence_intensity', 'anger_intensity', 'fear_intensity', 'sadness_intensity', and 'joy_intensity' to be the mean value of each column.

Task 5 Box plot data

Our data is a series of measurements for each tweet (i.e., scores), and box (or boxes and whiskers) plot is a good choice to visualise and compare them.

Write a function boxplot_data(data, output_name) that achieves the following requirements:

•     we will be making a boxplot for each of the 'valence_intensity', 'anger_intensity', 'fear_intensity', 'sadness_intensity', and 'joy_intensity' from the data, in this particular order.

•     Figure size should be set to (10, 7) .

•     The linestyle is set to '-', linewidth = 1, and color to black (hint: lookup medianprops).

•     Patch_artist of the boxplot is set to True .

•     Set the facecolor to be green, red, purple, blue and yellow in this order.

•     Set the title to be 'Distribution of Sentiment'.

•     Add x-axis data labels 'Valence', 'Anger', 'Fear', 'Sadness' and 'Joy' in this order.

•     Set the yaxis grid to True .

•     Set the x-axis label to 'Sentiment'.

•     Set the y-axis label to 'Values' .

Finally, save the graph as output_name with a default value "output.png".

The expected output is shown in Figure 1 using the dataset.

.

Figure 1. Output for task 5.

Task 6 NumPy to Pandas

Pandas provide various methods that can be used to handle data more efficiently. So we will convert our NumPy data into Pandas dataframe type. Write a function convert_to_df(data) that uses the data's

dtype names as column headers and their associated data values.

Note: You cannot use the pd.DataFrame() function for this task.

Task 7 Loading tweets into Pandas and merging with Metrics dataframe

You probably noticed in the original data the tweets and usernames are masked. You are provided with another file of format .tsv (tab separated values) that contains the original tweets and usernames, which you just loaded into the memory space as a Pandas dataframe from task 6. Also note that tweets have been cleaned (i.e., what you were doing in Assignment 2!), so you don't have to clean them anymore. Because both files share the common tweet ID, we will use that information to join them to form a single dataframe.

Write two functions:

•     load_tweets(filename): returns the Pandas dataframe of the data from filename, which will always be a .tsv file. You can assume there will always be data in the file provided to this function.

•    merge_dataframes(df_metrics, df_tweets) : takes inputs df_metrics (from task 6) and df_tweets (which you will load using load_tweets()) and join the metrics with tweets (i.e., tweets.join(metrics)) using the tweet_ID.

The join method is 'inner' (will come in handy). Because not all rows will match, make sure to drop NA values. Finally, the function returns a single dataframe; the resulting joined dataframe using the two input dataframes.

Note: depending on your approach, this can be processed anywhere between 1 second to 60+ seconds. Although the efficiency is not directly tested, you should think about ways to improve your speed, and implement it if you can.

Task 8 Pie Chart of Emotions

We want to know how many tweets fall under each emotion category. A great way to visualise this is with a pie chart. Write a function pie_chart_emotions(df_merged, output_name) that achieves the    following requirements:

•    Each segment should represent the proportion of the number of tweets in each category relative to every other category (this is slightly different to the requirement in Assignment 1, in which    you only considered tweets from the top percentile of accounts based on followers).

•    Use the default ordering for the emotion categories

•    The 'no specific emotion' category should be renamed to 'neutral'

•    Figure size should be set to (10, 10)

•    Segment colours should be:

o  anger: red, fear: purple, joy: yellow, neutral: green, sadness: blue

•    Each segment should display on it its percentage proportion, rounded to 1 d.p.

•    Chart shadow should be turned on

•    Starting angle should be 0

•    Axis labels should be turned off

•    Plot should have a tight layout

•    Plot title should be 'Emotion Category Breakdown'

•    Finally, save the graph as output_name with a default value "output.png".

The expected output is shown in Figure 2 below:

Figure 2. Output for task 8.

Task 9 Term Frequency (TF)

Continuing with the NLP theme for the assignments, we will explore a widely used measure called term frequency-inverse document frequency (TF-IDF) in the next few tasks. In short, TD-IDF  "provides a numerical statistic that is intended to reflect how important a word is to a document in a corpus". A document is simply a piece of distinct text (e.g. a sentence) and a corpus is a collection of documents. It is a combined metric based on 2 constituents: term frequency (TF) and inverse          document frequency (IDF). We will perform the TF calculations in this task.

TF is the measurement of how frequently a term occurs within a document. The formula for TF is:

tf(t, d) = count of t in d / number of words in d

where t = a particular term (or word) in a document and d = the document itself. For example, in the document d0 "The sky is really blue", each word appears once so each word has a tf score of (1/5). In the document d1 "Hey, is that really that cool?", the tf score for "that" is (2/6) while every other word has a score of (1/6).

To be able to compare the tf scores for all words across multiple documents within a corpus, we can produce a matrix, where the columns are every unique word in the corpus and every row is a         document. For example, the tf scores for the above examples can be laid out like so:

Hey,

The

blue

cool?

is

really

sky

that

0

0.0

0.2

0.2

0.0

0.2

0.2

0.2

0.0

1

0.167

0.0

0.0

0.167

0.167

0.167

0.0

0.333

Write a function term_frequency(df_merged, top_n) that takes in df_merged as the merged           dataframe from Task 7 and top_n as the first n rows in this dataframe and outputs a dataframe of the tf scores for every word in the corpus. The resulting scores should be rounded to 4 d.p. The columns  must be in sorted alphabetical order.

Tips:

•    We've given you the tokenize_words(df_merged, top_n) function for you to use, which returns a sorted list of all the unique words found in the corpus.

•    Don't worry about stop-words, punctuation, capitals, etc. like you did in the previous assignment.

•    The corpus in this case is the tweets found in the ['text'] column of df_merged.

•    The documents in the resulting dataframe should be in the same order as they're found in df_merged.

•    Row indexes can just be the index number (0, 1, 2, 3, ...) as opposed to (d0, d1, d2, ...) like in the table above

•    A process for performing this task can be loosely described as follows (not necessarily the only way to do this):

o  create an empty dataframe using the unique tokens as columns and the rows represent each document.

o  for each row fill in this dataframe with the tf score of each word as it appears in each document

NOTE: You cannot import any python packages other than pandas, numpy and math for Tasks 9, 10 &

11.

Task 10 Inverse Document Frequency (IDF)

IDF measures how important a term is throughout the corpus. While computing TF, all terms are considered equally important. However, it is known that certain terms, such as "is", "of", and "that",  may appear a lot of times but have little importance. TF will tend to incorrectly emphasize documents which happen to use common words more frequently, without giving enough weight to the more        meaningful terms. In computing IDF, we can weigh down the frequent terms while scaling up the rare ones. The formula for IDF is:

idf(t, D) = log(total number of documents / (number of documents with term t in it)). where t = a particular term (or word) in a document and D is the corpus. Consider the examples:

•    d0 : "this is the first document"

•     d1 : "this document is the second document"

•     d2 : "this is the third one"

d3 : "is this the first document"


We have a corpus of 4 total documents. The for t = "document", t appears in 3 different documents. The idf score for "document" is then log(4/3) = 0.2877. Similarly, for t = "first", t appears in 2        different documents. The idf score for "first" is then log(4/2) = 0.6931. The idf score for t = "this" is log(4/4) = 0. The resulting idf scores for the above examples can be laid out like so:

document

first

is

one

second

the

third

this

0

0.2877

0.6931

0.0

0.0

0.0

0.0

0.0

0.0

1

0.2877

0.0

0.0

0.0

1.3863

0.0

0.0

0.0

2

0.0

0.0

0.0

1.3863

0.0

0.0

1.3863

0.0

3

0.2877

0.6931

0.0

0.0

0.0

0.0

0.0

0.0

We can get the max values in each column to compare the IDF scores more easily per term. We can see how terms that appear across many documents are weighted less using IDF.

document

first

is

one

second

the

third

this

idfmax

0.2877

0.6931

0.0

1.3863

1.3863

0.0

1.3863

0.0

Write a function inverse_document_frequency(df_merged, top_n) that takes in df_merged as the  merged dataframe from Task 7 and top_n as the first n rows in this dataframe and outputs a            dataframe of the idf scores for every word in the corpus (like in the first table in this description). The resulting scores should be rounded to 4 d.p. The columns must be in sorted alphabetical order.

Tips:

•    This task can be performed using a similar methodology to the previous task.

•    In addition to the tokenize_words(df_merged, top_n) function, we've also given you the      document_frequency(df_merged, top_n) function which returns a dictionary where  = each unique token found in the corpus and  =  the number of unique documents the   token occurs in across the corpus.

•    The log function from the math package can be used and has been imported for you.