COMP90042 Natural Language Processing Final Exam Semester 1 2021
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COMP90042
Natural Language Processing
Final Exam
Semester 1 2021
Section A: Short Answer Questions [45 marks]
Answer each of the questions in this section as briefly as possible. Expect to answer each sub-question in no more than several lines.
Question 1: General Concepts [24 marks]
a) What is a “sequence labelling” task and how does it differ from independent prediction? Explain using “part-of-speech tagging” as an example. [6 marks]
b) Compare and contrast “antecedent restrictions” and “preferences” in “anaphora resolution”. You should also provide examples of these restrictions and preferences. [6 marks]
c) What is the “exposure bias” problem in “machine translation”? [6 marks]
d) Why do we use the “IOB tagging scheme” in “named entity recognition”? [6 marks]
Question 2: Distributional Semantics [9 marks]
a) How can we learn “word vectors” using “count-based methods”? [6 marks]
b) Qualitatively, how will the word vectors differ when we use “document” vs. “word context”? [3 marks]
Question 3: Context-Free Grammar [12 marks]
a) Explain two limitations of the “context-free” assumption as part of a “context-free grammar”, with the aid of an example for each limitation. [6 marks]
b) What negative effect does “head lexicalisation” have on the grammar? Does “parent conditioning” have a similar issue? You should provide examples as part of your explanation. [6 marks]
Section B: Method Questions [45 marks]
In this section you are asked to demonstrate your conceptual understanding of the methods that we have studied in this subject.
Question 4: Dependency Grammar [18 marks]
a) What is “projectivity” in a dependency tree, and why is this property important in dependency parsing? [3 marks]
b) Which arc or arcs are “non-projective” in the following tree? Explain why they are non-projective. [6 marks]
c) Show a sequence of parsing steps using a “transition-based parser” that will produce the dependency tree below. Be sure to include the state of the stack and buffer at every step. [9 marks]
Question 5: Loglikelihood Ratio [15 marks]
The “loglikelihood ratio” is used in summarisation to measure the “saliency” of a word compared to a background corpus. In the second task of the project, to understand the nature of rumour vs. non-rumour source tweets, one analysis we can do is to extract salient hashtags in rumour source tweets and non- rumour source tweets to understand the topical differences between them. Illustrate with an example with equations how you can apply loglikelihood ratio to extract salient hashtags in these two types of source tweets.
Question 6: Ethics [12 marks]
You’re tasked to develop an NLP application to predict the “intelligence quotient (IQ)” scores of high school students based on their essays written for a range of topics. Discuss at least three ethical impli- cations of this application.
Section C: Algorithmic Questions [30 marks]
In this section you are asked to demonstrate your understanding of the methods that we have studied in this subject, in being able to perform algorithmic calculations.
Question 7: N-gram Language Models [15 marks]
This question asks you to calculate the probability for “N-gram language models”. You should leave your answers as fractions. Consider the following table, which collects the counts of words that occur after salted in a corpus.
Word |
Count |
Unsmoothed |
Probability |
Smoothed Probability |
|
Absolute Discounting Katz Backoff |
|||||
egg |
6 |
? |
|
? |
? |
caramel |
4 |
? |
|
? |
? |
fish |
3 |
? |
|
? |
? |
peanuts |
2 |
? |
|
? |
? |
butter |
0 |
? |
|
? |
? |
salted |
0 |
? |
|
? |
? |
E.g. the bigram salted egg occurs 6 times, while salted caramel occurs 4 times.
a) Assuming the 6 distinct words in the table are all the words in vocabulary, compute the bigram probabilities for all the bigrams listed in the table without any smoothing. Hint: you should fill in the missing values for the “Unsmoothed Probability” column in the table, and demonstrate how you arrive at these values. [3 marks]
b) Compute the bigram probabilities for all bigrams listed in the table using “absolute discounting”, with a discount factor of 0.2. Hint: you should fill in the missing values for the “Absolute Discounting” column in the table, and demonstrate how you arrive at these values. [6 marks]
c) Compute the bigram probabilities for all bigrams listed in the table using “Katz Backoff”, with the same discount factor of 0.2. Use the corpus below (2 sentences) for computing unigram probabilities. For simplicity, you do not need to consider special tokens (ending or starting tokens), and may assume all the unique words in the 2 sentences as your vocabulary when computing the unigram probabilities. Hint: you should fill in the missing values for the “Katz Backoff” column in the table, and demonstrate how you arrive at these values. [6 marks]
butter in batter will make batter salted
but better butter will make batter better
Question 8: Topic Models [15 marks]
Consider training a “latent Dirichlet allocation” (LDA) topic model using the following corpus with 3 documents (d1, d2, d3). To initialise the training process, each word token is randomly allocated to a topic (e.g. peck/t3 means peck is assigned topic t3). Hyper-parameters of the topic model are set as follows: (1) number of topics T = 3; (2) document-topic prior α = 0.5; and (3) topic-word prior β = 0.1.
d1: peck/t3 pickled/t1 peppers/t1
d2: peter/t1 piper/t2 picked/t3 peppers/t2
d3: peppers/t2 piper/t3 peck/t3 peppers/t1
a) Compute the probability over the topics (t1, t2, t3) if you were to sample a new topic for the first word (peck) in d1 for a training step. You should show co-occurrence tables that are relevant to producing your solution. [9 marks]
b) Assume now that the topic model is trained. You are now given a new document: pickled peppers popped. Describe how LDA infers the topics for this new document. Note: you do not need to show equations or tables here. [6 marks]
2022-06-11