Natural Language Processing COMPSCI 423/723

Natural Language ProcessingCOMPSCI 423/723 Rohit Kate

Information Retrieval Some of the slides have been adapted from Ray Mooney’s IR course at UT Austin.

Information Retrieval(IR) • The indexing and retrieval of textual documents. • Searching for pages on the World Wide Web is the most recent “killer app.” • Concerned firstly with retrieving relevantdocuments to a query. • Concerned secondly with retrieving from large sets of documents efficiently.

Document corpus Query String 1. Doc1 2. Doc2 3. Doc3 . . Ranked Documents IR System IR System

Typical IR Task • Given: • A corpus of textual natural-language documents. • A user query in the form of a textual string. • Find: • A ranked set of documents that are relevant to the query.

IR Models • An information retrieval model specifies the details of: • Document representation • Query representation • Retrieval function • Determines a notion of relevance. • Notion of relevance can be binary or continuous (i.e. ranked retrieval).

IR Models • A document is typically represented by a bag of words (unordered words with frequencies). • Bag = set that allows multiple occurrences of the same element. • User specifies a set of desired terms with optional weights: • Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 > • Unweighted query terms: Q = < database; text; information >

Retrieval • Retrieval based on similarity between query and documents. • Output documents are ranked according to similarity to query.

Issues for an IR Model • How to determine important words in a document? • Word n-grams (and phrases, idioms,…)  terms • How to determine the degree of importance of a term within a document and within the entire collection? • How to determine the degree of similarity between a document and the query? • In the case of the web, what is a collection and what are the effects of links, formatting information, etc.?

The Vector-Space Model • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. Dimension = t = |vocabulary| • Each term, i, in a document or query, j, is given a real-valued weight, wij. • Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)

Term Weights: Term Frequency • More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j • May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij =fij / maxi{fij}

Term Weights: Inverse Document Frequency • Terms that appear in many different documents are less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents) • An indication of a term’s discrimination power. • Log used to dampen the effect relative to tf.

TF-IDF Weighting • A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfijlog2 (N/ dfi) • A term occurring frequently in the document but rarely in the rest of the collection is given high weight. • Many other ways of determining term weights have been proposed. • Experimentally, tf-idf has been found to work well.

Computing TF-IDF -- An Example Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6 B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0 C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

Query Vector • Query vector is typically treated as a document and also tf-idf weighted. • Alternative is for the user to supply weights for the given query terms.

Similarity Measure • A similarity measure is a function that computes the degree of similarity between two vectors. • Using a similarity measure between the query and each document: • It is possible to rank the retrieved documents in the order of presumed relevance. • It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled.

T3 5 D1 = 2T1+ 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 2 3 T1 D2 = 3T1 + 7T2 + T3 7 T2 Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 • Is D1 or D2 more similar to Q? • How to measure the degree of similarity? Distance? Angle? Projection?

Similarity Measure - Inner Product • Similarity between vectors for the document di and query q can be computed as the vector inner product (a.k.a. dot product): sim(dj,q) = dj•q = where wijis the weight of term i in document j andwiq is the weight of term i in the query

Properties of Inner Product • The inner product is unbounded. • Favors long documents with a large number of unique terms. • Measures how many terms matched but not how many terms are not matched.

Inner Product: Example D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3 Q = 0T1 + 0T2 + 2T3 sim(D1, Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2, Q) = 3*0 + 7*0 + 1*2 = 2

t3 1 D1 Q 2 t1 t2 D2 Cosine Similarity Measure • Cosine similarity measures the cosine of the angle between two vectors. • Inner product normalized by the vector lengths. CosSim(dj, q) = D1 = 2T1 + 3T2 + 5T3 CosSim(D1, Q) = 10 / (4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2, Q) = 2 / (9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3 D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product.

Comments on Vector Space Models • Simple, mathematically based approach. • Considers both local (tf) and global (idf) word occurrence frequencies. • Provides partial matching and ranked results. • Tends to work quite well in practice despite obvious weaknesses. • Allows efficient implementation for large document collections.

Problems with Vector Space Model • Missing semantic information (e.g. word sense). • Missing syntactic information (e.g. phrase structure, word order, proximity information). • Assumption of term independence (e.g. ignores synonomy). • Lacks control (e.g., requiring a term to appear in a document). • Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently.

retrieved & irrelevant Not retrieved & irrelevant Entire document collection irrelevant Relevant documents Retrieved documents retrieved & relevant not retrieved but relevant relevant retrieved not retrieved Precision and Recall

Precision and Recall • Precision • The ability to retrievetop-ranked documents that are mostly relevant. • Recall • The ability of the search to find all of the relevant items in the corpus.

Determining Recall is Difficult • Total number of relevant items is sometimes not available: • Sample across the database and perform relevance judgment on these items. • Apply different retrieval algorithms to the same database for the same query. The aggregate of relevant items is taken as the total relevant set.

Returns relevant documents but misses many useful ones too The ideal Returns most relevant documents but includes lots of junk Trade-off between Recall and Precision 1 Precision 0 1 Recall

Computing Recall/Precision Points • For a given query, produce the ranked list of retrievals. • Adjusting a threshold on this ranked list produces different sets of retrieved documents, and therefore different recall/precision measures. • Mark each document in the ranked list that is relevant according to the gold standard. • Compute a recall/precision pair for each position in the ranked list that contains a relevant document.

Computing Recall/Precision Points: Example 1 Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/2=1 R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 Missing one relevant document. Never reach 100% recall R=5/6=0.833; p=5/13=0.38

Computing Recall/Precision Points: Example 2 Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/3=0.667 R=3/6=0.5; P=3/5=0.6 R=4/6=0.667; P=4/8=0.5 R=5/6=0.833; P=5/9=0.556 R=6/6=1.0; p=6/14=0.429

Compare Two or More Systems • The curve closest to the upper right-hand corner of the graph indicates the best performance

Machine Translation Some of the slides have been adapted from Ray Mooney’s NLP course at UT Austin.

Machine Translation • Automatically translate one natural language into another. Mary didn’t slap the green witch. Maria no dió una bofetada a la bruja verde.

Word Alignment • Shows mapping between words in one language and the other. Mary didn’t slap the green witch. Maria no dió una bofetada a la bruja verde.

Translation Quality • Achieving literary quality translation is very difficult. • Existing MT systems can generate rough translations that frequently at least convey the gist of a document. • High quality translations possible when specialized to narrow domains, e.g. weather forcasts. • Some MT systems used in computer-aided translation in which a bilingual human post-edits the output to produce more readable accurate translations. • Frequently used to aid localization of software interfaces and documentation to adapt them to other languages.

Linguistic Issues Making MT Difficult • Morphological issues with complex word structure. • Syntactic variation between SVO (e.g. English), SOV (e.g. Hindi), and VSO (e.g. Arabic) languages. • SVO languages use prepositions • SOV languages use postpositions • Pro-drop languages regularly omit subjects that must be inferred.

Lexical Gaps • Some words in one language do not have a corresponding term in the other. • Rivière (river that flows into ocean) and fleuve (river that does not flow into ocean) in French • Schedenfraude (feeling good about another’s pain) in German. • Oyakoko (filial piety) in Japanese

Ambiguity and MachineTranslation • Sometimes syntactic and semantic ambiguities must be properly resolved for correct translation: • “John plays the guitar.” → “John toca la guitarra.” • “John plays soccer.” → “John juega el fútbol.” • Sometimes it’s best to preserve ambiguity: • “I saw the man on the hill with a telescope” is ambiguous in French as well

An Apocryphal Story • An apocryphal story is that an early MT system gave the following results when translating from English to Russian and then back to English: • “The spirit is willing but the flesh is weak.”  “The vodka is good but the meat is rotten.” • “Out of sight, out of mind.”  “Invisible idiot.”

Vauquois Triangle Interlingua Semantic Parsing Increasing depth of analysis Semantic Transfer Semantic structure Semantic structure Increasing amount of transfer knowledge SRL & WSD Syntactic structure Syntactic structure Syntactic Transfer parsing Direct translation Words Words Target Language Source Language

Direct Transfer • Morphological Analysis • Mary didn’t slap the green witch. → Mary DO:PAST not slap the green witch. • Lexical Transfer • Mary DO:PAST not slap the green witch. • Maria no dar:PAST una bofetada a la verde bruja. • Lexical Reordering • Maria no dar:PAST una bofetada a la bruja verde. • Morphological generation • Maria no dió una bofetada a la bruja verde.

Syntactic Transfer • Simple lexical reordering does not adequately handle more dramatic reordering such as that required to translate from an SVO to an SOV language. • Need syntactic transfer rules that map parse tree for one language into one for another. • English to Spanish: • NP → Adj Nom  NP → Nom ADJ • English to Japanese: • VP → V NP  VP → NP V • PP → P NP  PP → NP P

Semantic Transfer • Some transfer requires semantic information. • Semantic roles can determine how to properly express information in another language. • In Chinese, PPs that express a goal, destination, or benefactor occur before the verb but those expressing a recipient occur after the verb. • Transfer Rule • English to Chinese • VP → V PP[+benefactor]  VP → PP[+benefactor] V

Interlingua • Ideally the best way of doing MT • No need to have transfer rules between every pair of languages • Have a way to translate every language into a “meaning representation” and a way translate a “meaning representation” into every language • Huge savings in a multilingual environments • If N languages then instead of NxN transformation mechanisms only 2N transformation mechanisms needed • May lead to unnecessary disambiguation and too detailed semantic analysis than needed • Difficult to come up with one general “meaning representation”, but works for sublanguage domains (like weather forecasting)

Statistical MT • Manually encoding comprehensive bilingual lexicons and transfer rules is difficult. • SMT acquires knowledge needed for translation from a parallel corpus or bitext that contains the same set of documents in two languages. • The Canadian Hansards (parliamentary proceedings in French and English) is a well-known parallel corpus. • First align the sentences in the corpus based on simple methods that use coarse cues like sentence length to give bilingual sentence pairs.

Picking a Good Translation • A good translation should be faithful and correctly convey the information and tone of the original source sentence. • A good translation should also be fluent, grammatically well structured and readable in the target language. • Final objective:

Noisy Channel Model • Based on analogy to information-theoretic model used to decode messages transmitted via a communication channel that adds errors. • Assume that source sentence was generated by a “noisy” transformation of some target language sentence and then use Bayesian analysis to recover the most likely target sentence that generated it. Translate foreign language sentenceF=f1, f2, …fm to an English sentence Ȇ = e1, e2, …eI that maximizes P(E | F)

Bayesian Analysis of Noisy Channel Translation Model Language Model A decoder determines the most probable translation ȆgivenF

Language Model • Use a standard n-gram language model for P(E). • Can be trained on a large, unsupervised mono-lingual corpus for the target language E. • Could use a more sophisticated PCFG language model to capture long-distance dependencies. • Terabytes of web data have been used to build a large 5-gram model of English.

Word Alignment • Directly constructing phrase alignments is difficult, so rely on first constructing word alignments. • Can learn to align from supervised word alignments, but human-aligned bitexts are rare and expensive to construct. • Typically use an unsupervised EM-based approach to compute a word alignment from unannotated parallel corpus.

Natural Language Processing COMPSCI 423/723