COM3110 TEXT PROCESSING Autumn Semester 2013-2014
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COM3110
DEPARTMENT OF COMPUTER SCIENCE
TEXT PROCESSING
Autumn Semester 2013-2014
1. In the context of Information Retrieval, given the following documents:
Document 1: Sea shell, buy my sea shell!
Document 2: You may buy lovely SEA SHELL at the sea produce market.
Document 3: Product marketing in the Shelly sea is an expensive market. and the query:
Query 1: sea shell produce market
a) Apply the following term manipulations on document terms: stoplist removal, capi- talisation and stemming, showing the transformed documents. Explain each of these manipulations. Provide the stoplist used, making sure it includes punctuation, but no content words. [20%]
b) Show how Document 1, Document 2 and Document 3 would be represented using an inverted index which includes term frequency information. [10%]
c) Using term frequency (TF) to weight terms, represent the documents and query as vectors. Produce rankings of Document 1, Document 2 and Document 3 according to their relevance to Query 1 using two metrics: Cosine Similarity and Euclidean
Distance. Show which document is ranked first according to each of these metrics. [30%]
d) Explain the intuition behind using TF.IDF (term frequency inverse document fre- quency) to weight terms in documents. Include the formula (or formulae) for com- puting TF.IDF values as part of your answer. For the ranking in the previous question using cosine similarity, discuss whether and how using TF.IDF to weight terms instead of TF only would change the results. [20%]
e) Explain the metrics Precision, Recall and F-measure in the context of evaluation in Information Retrieval against a gold-standard set, assuming a boolean retrieval model. Discuss why it is not feasible to compute recall in the context of searches performed on very large collections of documents, such as the Web. [20%]
2. a) List and explain the three paradigms of Machine Translation. What is the dominant (most common) paradigm for open-domain systems nowadays and why is this paradigm more appealing than others, especially in scenarios such as online Machine Translation systems? [20%]
b) Lexical ambiguity is known to be one of the most challenging problems in any approach for Machine Translation. Explain how this problem is addressed in Phrase-based Sta- tistical Machine Translation approaches. [20%]
c) List and explain two metrics that can be used for evaluating Machine Translation systems (either manually or automatically). Discuss the advantages of automatic evaluation metrics over manual evaluation metrics. [20%]
d) Given the two scenarios:
Scenario 1: English-Arabic language pair, 50,000 examples of translations of very short sentences, on very repetitive material (technical documentation of a product).
Scenario 2: English-French, 500,000 examples of translations for open-domain and creative texts, like novels from many different writers.
In which of these scenarios would Statistical Machine Translation work better? Why would it work better than in the other scenario? [10%]
e) Explain the main advantage of Hierarchical Phrase-based Machine Translation mod- els over standard Phrase-based Statistical Machine Translation models. What does the phrase table of Hierarchical Phrase-based Machine Translation models look like? Given the following sentence pair and the existing phrases (in the phrase-table), which additional phrases could be generated with a Hierarchical Phrase-based Machine Trans- lation model?
Source: shall be passing on to you some comments
Target: werde Ihnen die entsprechenden Anmerkungen aush¨aandigen
Existing phrases:
Source |
Target |
shall be passing on to you some comments to you some comments |
werde aush¨aandigen Ihnen die entsprechenden Anmerkungen Ihnen die entsprechenden Anmerkungen |
[30%]
3. a) Differentiate subjectivity from sentiment. How are the tasks of Subjectivity Classifi- cation and Sentiment Analysis related? [10%]
b) Explain the steps involved in the lexicon-based approach to Sentiment Analysis of features in a sentence (e.g. features of a product, such as the battery of a mobile phone). Discuss the limitations of this approach. [20%]
c) Given the following sentences and opinion lexicon (adjectives only), apply the weighted lexical-based approach to classify EACH sentence as positive, negative or objective. Show the final emotion score for each sentence. In addition to use of the lexicon, make sure you consider any general rules that have an impact in the final decision. Explain these rules when they are applied. [20%]
boring -3
brilliant 2
Lexicon: good 3
horrible -5
happy 5
(S1) He is brilliant but boring.
(S2) I am not good today.
(S3) I am feeling HORRIBLE today, despite being happy with my achievement. (S4) He is extremely brilliant but boring, boring.
d) Specify the five elements of Bing Liu’s model for Sentiment Analysis, and exemplify them with respect to the following text. Identify the features present in the text, and for each indicate its sentiment value as either positive or negative. Discuss two language processing challenges in automating the identification of such elements. [30%]
“I am in love with my new Toshiba Portege z830-11j. With its i7 core processors, it is extremely fast. It is the lightest laptop I have ever had, weighting only 1 Kg. The SSD disk makes reading/writing operations very efficient. It is also very silent, the fan is hardly ever used. The only downside is the price: it is more expensive than any Mac. Lucia Specia, 10/04/2012.”
e) Differentiate direct from comparative Sentiment Analysis. What are the elements necessary in comparative models of Sentiment Analysis? [20%]
4. a) (i) Explain how the LZ77 compression method works. [30%]
(ii) Assuming the encoding representation presented in class (i.e. in the lectures of the Text Processing module), show what output would be produced by the LZ77 decoder for the following representation. Show how your answer is
derived. [15%]
(0, 0, b)(0, 0, e)(2, 2, n)(4, 4, e)(1, 3, b)(2, 1, n)
b) The writing script of the (fictitious) language Sinbada employs only the letters (s,i,n,b,a,d) and the symbol ~, used as a ‘space’ between words. Corpus analysis shows that the probabilities of these seven characters are as follows:
Symbol Probability
s 0.04
i 0.1
n 0.2
b 0.04
a 0.3
d 0.26
~ 0.06
(i) Sketch the algorithm for Huffman coding. Illustrate your answer by constructing a code tree for Sinbada, based on the above probabilities for its characters. [30%]
(ii) Given the code you have generated in 4(b)(i), what is the average bits-per- character rate that you could expect to achieve, if the code was used to com- press a large corpus of Sinbada text? [10%]
(iii) Use your code tree to encode the message “niad ~ badasina” and show the resulting binary encoding. How does the bits-per-character rate achieved on this message compare to the rate that you calculated in 4(b)(ii)? [15%]
2022-01-27