Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 1 – News Classifier

Deadline: 12:00pm, Nov 10, 2023

1    Introduction

In this assignment, you are tasked with constructing a classifier that can categorise news articles based on their content.

Specifically,  you will be working with a dataset consisting of 20 news articles sourced from the Sky News website (included in the "resources" folder).  These arti- cles can be broadly classified into two distinct categories, each representing a different topic.  For instance, the first category encompasses articles like the one titled "Osiris- Rex’s sample from asteroid Bennu will reveal secrets of our solar system". Conversely, the second category includes articles such as the one headlined "Bitcoin slides to five- month low amid wider sell-off".

The main idea here is to assess the semantic closeness of these 20 news articles by using the Term Frequency-Inverse Document Frequency (TF-IDF) embedding.

1.1 Term Frequency-Inverse Document Frequency Embedding

TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a popular numerical statistic that reflectshow important a word is to a document in a collection or corpus. It’s a widely used technique in information retrieval and text mining to evaluate the relevance of words within documents in a dataset.

TF-IDF Embedding is a technique where text documents are converted into vector representations such that each document is represented as a vector in a multidimen- sional space. Each dimension in this space corresponds to a unique word in the corpus vocabulary, and the value in each dimension is the TF-IDF weight of that word in the respective document.

A major advantage of using high-dimensional vectors for document representation is their compatibility with further numerical processing tasks, such as input for neural networks (NN). In essence, TF-IDF Embedding acts as a vectorisation procedure. Un- like one-hot encoding—which uniquely numerically identifies each vocabulary word, equating the maximum number to the vocabulary’s size—TF-IDF Embedding main- tains words’ intrinsic relevance (or weight) throughout the transformation phase.

1.2 A Step-by-Step Guidance of TF-IDF Embedding

As suggested by its name, TF-IDF assigns a score or vectorises a word by calculating the product of the word’s Term Frequency (TF) and the Inverse Document Frequency (IDF).

Term Frequency: The TF represents the occurrence of a term or word in relation to the document’s total word count, expressing how frequently a specific term appears within it. TF is calculated by:

T F(t , d) = (1)


where ft ,d is the number of times a word (t) appears in a document d, and Σtd ft′ ,d is the total number of words in that document.

Inverse Document Frequency: The IDF signifies the representation of a term based on its occurrence across various documents in a corpus.  It quantifies the rar- ity of a term by determining how frequently it appears, offering insight into the term’s uniqueness or commonality within the corpus. It is calculated by:

IDF(t ,D) = log+1                                    (2)

where N is the total number of documents in the corpus, |{d D : t d}| is the number of documents where the word t appears.

Then the final TF-IDF is calculated as:

TFIDF(t , d) = T F(t , d)IDF(t ,D)                                 (3)

To illustrate how TF-IDF Embedding works, consider we have three documents in a corpus (N = 3).

Document

Contents

D1

harry_potter is a student at hogwarts

D2

voldemort used to be a student at hogwarts but graduated already

D3

the parents of harry_potter studied at hogwarts as well

Table 2: Documents in the corpus