CE 314/887 Natural Language Engineering Assignment 1

发布时间：2022-11-16

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CE 314/887 Natural Language Engineering

Assignment 1

Regular expression (40%) (You can store your code in output part1_regex_studentID.py)

1: Write a regular expression that can find all amounts of money ina text. Your expression should be able to deal with different formats and currencies, for example £50,000 and £117.3m as well as 30p, 500m euro, 338bn euros, $15bn and $92.88. Make sure that you can at least detect amounts in Pounds, Dollars and Euros. (You should write a python program to check thematching results, 20pts)

For full marks: include the output of a Python program that applies your regular expression to the

following BBC News Web site:

https://www.bbc.co.uk/news/business-41779341

2： Write a regular expression that can matching all phonenumbers listedbelow: (You should write a python program to check thematching results, 20pts)

555.123.4565

+1-(800)-545-2468

2-(800)-545-2468

3-800-545-2468

555-123-3456

555 222 3342

(234) 234 2442

(243)-234-2342

1234567890

123.456.7890

123.4567

1234567900

12345678900

NLTK (10%)

1： Find the 50 highest frequency word in Wall Street Journal corpus in NLTK.books (text7), submit your code as the name: part2_NLTK_studentID.py (All punctuation removed and all words lowercased.)

Language modelling:

You should write a python program for that and named as part3_LM_studentID.py

1： Build an n gram language model based on nltk’s Reuters corpus (from nltk.corpus import reuters), providethe code. (You can build a language model in a few lines of code using the NLTK package, you can use bigram, trigram or higher order grams) (20pts)

2: After step 1, make simple predictions with the language model you have built in question 1. We will start with two simple words – “he is” . Let your n gram model to tell me what will be the next word, show me both code and module generated results. (15 pts)

3: Based on the work of question 1 and question 2, generate a few sentences start with “he is” . (15 pts)

Hints:

For building n grams, you can refer to this link:

https://medium.com/swlh/language-modelling-with-nltk- 20eac7e70853#:~:text=An%20n%2Dgram%20model%20is,'There %20was%20huge%20rainfall'.

Writing code with comments is a good habit.