Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


COMP20008 Elements of Data Processing

Assignment 1


Background

You have access to match results and news articles relating to soccer matches in the English Premier League. This assignment requires you to analyse an extract information from these sources. Parts of this assignment will require you to interpret match scores in soccer. Match scores are usually expressed as the number of goals scored by the first team, followed by a hyphen, followed by the number of goals scored by the second team (e.g. 2-1). For the purposes of this assignment, we will assume that a team can score at most 99 goals in a match and thus any score from 0-0 to 99-99 is considered valid.


Learning outcomes

The learning objectives of this assignment are to:

● Gain practical experience in written communication skills for documenting for data science projects.

● Practice a selection of processing and exploratory analysis techniques through visuali-sation.

● Practice text processing techniques using Python.

● Practice widely used Python libraries and gain experience in consultation of additional documentation from Web resources.


Getting started

Before starting the assignment you must do the following:

Create a github account at https://www.github.com if you don’t already have one.

● Visit https://classroom.github.com/a/VdnT1_8s and accept the assignment. This will create your personal assignment repository on github.

● Clone your assignment repository to your local machine. The repository contains im-portant files that you will need in order to complete the assignment.


Repository Files

The starting repository contains the following files:

● main.py: This file contains code to verify your answers. You must not edit anything in main.py

● assignment1.py: This file contains one function for each task in the assignment. You should fill in the relevant function to complete the task. You may choose to create additional functions to segment your code, but all the code you write must be contained in this file.

● data/data.json: This file contains details of recent soccer matches in the English Premier League, which you will need in order to complete your assignment.

● data/football: This folder contains a number of news articles about soccer matches in the English Premier League. You will need to load these files in order to complete your assignment.


Your Tasks (Total 20 marks)

Task 1 Loading and interpreting a JSON file (1 mark)

Write a function task1() that loads the data/data.json file into Python. Your function should return a list of teams codes, sorted in alphabetical order by team code.

        You can test your implementation with the following command: python main.py task1


Task 2 Data Aggregation (2 marks)

Write a function task2() that uses the information contained in the clubs objects to work out how many goals were scored by and against each team in total throughout the season. Your function should output this information to a csv file called task2.csv. Your csv file should contain the following headings: team code, goals scored by team, goals scored against team. Each row in the file should contain the details for one team, sorted in alphabetical order by team code.

        You can test your implementation with the following command: python main.py task2


Task 3 Regular Expressions (2 marks)

In addition to the information contained in the data.json file, we also have a number of news articles written about soccer matches. Each article is located in a separate text file in the data/football folder. For this task we will assume that each article is written about a match. Write a function task3() to extract the largest match score identified in the article.

Add the number of goals scored by each side together to produce the total number of goals scored in the match.

        For example, if the largest match score mentioned in an article is 14-6, your program should calculate 20 as the total number of goals. For this task we define the largest match score as the one with the highest total number of goals, so a score of 14-6 is considered larger than a score of 16-2.

        If a suitable score cannot be found in the article, your function should return 0 as the total number of goals for that article. You will need to use regular expressions to accomplish this.

        Your function should produce a csv file containing the filename and the total number of goals for each article. Your csv file should contain two columns, filename and total goals. Each row in the file should contain the detail for one article, sorted in ascending alphabetic order by filename. Save this file as task3.csv

        You can test your implementation with the following command: python main.py task3


Task 4 Visualising Scores (1 mark)

We now wish to understand whether there are outliers present in the number of goals we calculated in Task 3. Write a function task4() that produces a boxplot showing the distri-bution of values for total goals. Any values more than 1.5 interquartile ranges above Q3 should be identified as outliers on the plot. This boxplot should be saved as task4.png

        For all tasks involving visualisations, you should ensure that your plots contain a title and labels for all relevant axes.

        You can test your implementation with the following command: python main.py task4


Task 5 Extracting information from text data (2 marks)

We now wish to understand how often each club is mentioned by the media. The data.json file also contained a list of club names. Write a function task5() that searches through each of the news articles for mentions of each club and counts the articles for which each club is mentioned at least once. Your function should produce a csv file containing the club name and number of mentions for each club. Your csv file should contain the following column headings: club name and number of mentions. Save this file as task5.csv. Each row in the file should contain the details for one team, sorted in ascending alphabetic order by club name. Your function should also produce a bar chart conveying this information, saved as task5.png.

        You can test your implementation with the following command: python main.py task5


Task 6 Extracting information from text data (1 mark)

We also wish to understand which clubs are commonly mentioned together in the same news articles, as we believe that clubs that are commonly mentioned together in the same news articles are similar. We can produce a similarity score for clubs using the following formula:

Write a function task6() that calculates the similarity score for each pair of clubs and produce a heatmap conveying this information. This heatmap should be saved as task6.png

        You can test your implementation with the following command: python main.py task6


Task 7 Comparing Information (2 marks)

We now wish to understand whether the number of times a team is mentioned in the media is related to its performance. Write a function task7() that produces a scatterplot comparing the number of articles mentioning each team (as calculated in Task 5) with the total number of goals scored by each team (as calculated in Task 2). This scatterplot should be saved as task7.png

        You can test your implementation with the following command: python main.py task7


Task 8 Text preprocessing (2 marks)

We now wish to perform the following preprocessing on each article in the in order to make them easier to analyse. Write a function task8(filepath) that performs the following pre-processing. This function takes a single argument, filepath, which specifies the file to be processed.

● Remove all non-alphabetic characters (for example, numbers, apostrophes and punctu-ation characters), except for spacing characters such as whitespaces, tabs and newlines.

● Convert all spacing characters such as tabs and newlines to whitespace and ensure that only one whitespace character exists between each word

● Change all uppercase characters to lower case

● Tokenize the resulting string into words

● Remove all stopwords in nltk’s list of English stopwords from the resulting list

● Remove all remaining words that are only a single character long from the resulting list

You can test your implementation with the following command: python main.py task8


Task 9 Detecting similar articles (4 marks)

We now wish to understand how which news articles in our dataset are most similar to each other. One way in which we can do this is to generate a TF-IDF vector for each article in the dataset and then use the cosine similarity measure discussed in lectures to compare each pair of articles. This should be done after applying the preprocessing in Task 8. Write a function task9() that uses this technique to produce a csv file containing the filenames of the 10 pairs of articles with the highest similarities and their similarity score. This file should be saved as task9.csv. Your csv file should have the following headings article1, article2, similarity and be ordered in descending order of similarity score (i.e. highest similarity score first). Note that there should not be any duplicate entries. For example, if articles 001.txt and 002.txt were judged to be highly similar with a score of 0.9, we should have only one entry for that pair of articles:

001.txt 002.txt 0.9

And not a second entry:

002.txt 001.txt 0.9

        You can test your implementation with the following command: python main.py task9


Task 10 Communicating your results (5 marks)

Write a brief report of not more than 500 words to convey your key findings. Your report should include:

● A discussion of the appropriateness of the regular expression used in Task 3, including some examples of where you might expect it to perform poorly (2 mark)

● An analysis of the visualisations produced in Tasks 4, 5, 6 and 7 explaining what information can be interpreted from them (3 marks)

● Each of the visualisations produced in Tasks 4, 5, 6 and 7 must be included in your report. Failure to do so will result in a mark of 0 for each of those tasks.


Before submitting

You can run all your code using the python main.py all command. You should test your code in this way on the JupyterHub server, as that is the method we will be using to mark your final submission.


Submission Instructions

Your code must be uploaded via GitHub. Ensure all of your completed code files as well as your report have been pushed to the github repository you created in the ’Getting Started’ section. We strongly encourage you to push an updated version of your code to your github repository each time you make a major change. Your repository must also contain a README file, which must contain your name and student ID. It must also contain a brief description of your project and a list of dependencies.

        You must also complete the following form https://forms.office.com/r/nnydQTn35ZThis will allow us to link your GitHub account to your Student ID so that we can mark your assignment.

        You must also submit the following files via Canvas:

1. Your Task 10 report, in PDF or Word format

2. A copy of your assignment1.py file. This will be used as a backup in case there are any issues with your GitHub submission.

        Note that when marking your assignment we will copy your assignment1.py file into an empty directory containing only the data folder. This means that you may not change the folder structure (e.g. by renaming the data folder), nor write any code in files other than assignment.py. Note that we will use slightly different news articles and match scores from those provided when marking your final submission.


Extensions and late submission penalties

If requesting an extension due to illness, please submit a medical certificate to the lecturer. If there are any other exceptional circumstances, please contact the lecturer with plenty of notice. Late submissions without an approved extension will attract the following penalties

● 0 < hourslate <= 24 (2 marks deduction)

● 24 < hourslate <= 48 (4 marks deduction)

● 48 < hourslate <= 72: (6 marks deduction)

● 72 < hourslate <= 96: (8 marks deduction)

● 96 < hourslate <= 120: (10 marks deduction)

● 120 < hourslate <= 144: (12 marks deduction)

● 144 < hourslate: (20 marks deduction)

where hourslate is the elapsed time in hours (or fractions of hours).

This project is expected to require 15-20 hours work.


Academic honesty

You are expected to follow the academic honesty guidelines on the University website https://academichonesty.unimelb.edu.au


Further information

A project discussion forum has also been created on the Ed forum. Please use this in the first instance if you have questions, since it will allow discussion and responses to be seen by everyone. There will also be a list of frequently asked questions on the project page.