COMP5423 Natural Language Processing Lecture 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COMP5423 Natural Language Processing
Lecture 2: Text Normalization and Representation
Fridayday, 20 January, 2023
Exercise 1:
Segment the character string of “後天我們去北京” with the backward minimum matching approach (MaxLen=4) and a given dictionary {後天, 我們, 去, 北京}.
Exercise 2:
Consider the document frequencies of the four terms, “car”, “auto”, “insurance” and “best”, in the Reuters collection of 806,791 documents and their term frequencies for 3 documents denoted as Doc1, Doc2 and Doc3 in the following tables.
term |
df |
|
tf |
Doc1 |
Doc2 |
Doc3 |
car |
18165 |
|
car |
27 |
4 |
24 |
auto |
6723 |
|
auto |
3 |
33 |
0 |
insurance |
19241 |
|
insurance |
0 |
33 |
29 |
best |
25235 |
|
best |
14 |
0 |
17 |
(a) Compute the tf-idf weights for these four terms and three documents.
(b) What are the TF-IDF weighted vector representations of three documents in terms of the given four terms?
2023-03-27
Text Normalization and Representation