Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP5423 Natural Language Processing

Lecture 2: Text Normalization and Representation

Fridayday, 20 January, 2023

Exercise 1:

Segment the character string of “後天我們去北京” with the backward minimum matching approach (MaxLen=4) and a given dictionary {後天, 我們, 去, 北京}. 

Exercise 2:

Consider the document frequencies of the four terms, “car”, “auto”, “insurance” and “best”, in the Reuters collection of 806,791 documents and their term frequencies for 3 documents denoted as Doc1, Doc2 and Doc3 in the following tables.

term

df

 

tf

Doc1

Doc2

Doc3

car

18165

 

car

27

4

24

auto

6723

 

auto

3

33

0

insurance

19241

 

insurance

0

33

29

best

25235

 

best

14

0

17

(a) Compute the tf-idf weights for these four terms and three documents.

(b) What are the TF-IDF weighted vector representations of three documents in terms of the given four terms?