COMP 791A: Statistical Language Processing

COMP 791A: Statistical Language Processing Corpus-Based Work Chap. 4

Using a Corpus • To approximate the probability distribution of language events, we use a training corpus • Statistical NLP seeks to automatically learn lexical and structural preferences from corpora.

Corpus • Large database of text & speech • Many types of text corpora exist • plain text, domain specific, tagged, parsed, parallel bilingual, … • Major suppliers: • Linguistic data Consortium (LDC) -- www.ldc.upenn.edu • European Language resources Associations (ELRA) --www.icp.grnet.fr/ELRA • To derive the needed probabilities, a corpus needs to be: • large • a representative sample of the population of interest

Low-Level Formatting Issues • Junk formatting & content • Removal of typesetter codes (ex. HTML tags), diagrams, tables, foreign words etc. • Also other problems if data was retrieved through OCR (unrecognized words) • Uppercase and Lowercase • should we keep the case or not? • “the”, “The” and “THE” should all be treated the same? • but in “George Brown” and “brown dog”, “brown” should be treated separately…

Finding Tokens and Sentences • Tokenization • divide the input text into units (called tokens) • each token is either a word or something else (ex. a number or a punctuation mark) • Mark sentence boundaries • Most sentences end with ‘.’, ‘?’ or ‘!’ • Can be confused by abbreviations

Tokenization --What is a word? • Graphic word (Kučera and Francis, 1967): • “A string of contiguous alphanumeric characters with white spaces on either side; may include hyphens and apostrophes, but no other punctuation marks” • But what about: “$22.50” “C++” “ :-)” • Main problems: • Periods • Abbreviation or end of sentence? • “etc.” “Calif.” “Wash.” • Is the period part of the word or not? • “Wash.” (Washington) need to keep the period to distinguish it from “wash” (the verb) • Single apostrophes • Part of the word or not? • “Peter’s sick” --> 1 word? or 2 words? • If 1 word, then problems in parsing… S--> NP VP • If 2 words, then should “Peter’s house” be considered 2 words?

Tokenization --What is a word? (con’t) • Hyphens • Line-breaks to improve justification of text or not? • Ex: “e-mail” “pro-life” “data-base”/”database”/”data base” • Diacritics • Remove them? • Homographs • Should we distinguish 2 words that have the same spelling but with unrelated senses • “Bow“: part of a ship / a knot of ribbon • “Saw”: instrument / past tense of “to see” • Word Segmentation in other languages: • Some languages have no whitespaces !!! • Ex: East-Asian languages • In German: “life insurance company employee” = “Lebenversicherungsgesellschaftsangestellter”

Tokenization --What is a word? (con’t) • Whitespace do not always indicate a word break • Ex: Do we really want to separate the phrases • “in spite of” • “as a matter of fact” • “work out” • If no, then what do we do with non-adjacent phrasal verbs? • “I could not work the answer out” • Variant forms of some semantic types • Ex. Telephone numbers • (514) 848-3074 • +1 514 848 3074 • +1 (514) 848 3074 • Speech corpora • More contractions, fillers (ex. “Um” “well” “euh”)

Tokenization -- Lemmatizer • What about morphological variants? • Should “give”, “gives”, “given”, “giver”… be considered different words? • Goal: “normalize” similar words • Two main approaches: • Stemming • Morphological Analysis

Stemming • Very “dumb” rules work well (for English and Romance languages) • Ex: the Porter stemmer • Strips off affixes and leaves the stem • give --> give, gives --> give + s, given --> give + en, … • uses simple rules: • IF word ends with “ies” but not with “eies” or “aies” THEN replace “ies” by “y” • IF word ends with “es” but not “aes”, “ees” and “oes” THEN replace “es” by ”e” • IF word ends with “s” but not “us” or “ss” THEN remove “s” • first applicable rule is applied • Advantage: Fast • Disadvantages: • Rules depend on the language • Unreadable results: • EX: “computers”, “computation”, “computational” --> “comput” • May reduce different words to same stem although they are actually distinct • stocks --> stock • stockings --> stock • arm --> arm • army --> arm • organization --> organ • university --> universe

Morphological Analyzer • Apply morphological rules • (XXXes,V) --> (XXXe,V) • (XXXes,N) --> (XXXe,N) • files --> (file,N) (file,V) • Check that (file,N) (file,V) is in dictionary • Advantages: • Identifies the root which is an actual word • Fewer errors than stemming • Disadvantage: • More complex, too slow

Sentences: What is a sentence? • Something ending with a ‘.’, ‘?’ or ‘!’ • True in 90% of the cases • But, sentences can be split up by other punctuation marks or quotes • Ex: nested phrases: • “You remind me,” she remarked, “of your mother.” • We usually use heuristic methods • But hand-coded heuristics… • Some effort to use statistical methods for sentence-boundary detection • Typical classification problem… • Classify a period as a end-of-sentence marker or not • Use features such as case, length, POS tag of words preceding the period,… • Use decision trees, neural networks… • Some techniques go up to 98-99% correct classification

Marked-Up Data: Mark-up Schemes • Schemes developed to mark up the structure of text • Different Mark-up schemes: • COCOA format • older, and rather ad-hoc • SGML • And other related encodings: HTML, XML

Example • Input text: Bogota, 9 jan 90 (EFE) -- [text] Ricardo Alfonso Castellar, mayor of Achi, in the northern department of Bolivar, who was kidnapped on 5 January, apparently by army of national liberation (ELN) guerrillas, was found dead today, according to authorities. Castellar was kidnapped on 5 january on the outskirts of Achi, about 850 km north of Bogota, by a group of armed men, who forced him to accompany them to an undisclosed location. Police sources in Cartagena reported that Castellar's body showed signs of torture and several bullet wounds. Castellar was attacked by ELN guerrillas while he was traveling in a boat down the Cauca river to the tenche area, a region within his jurisdiction. In Cartagena it was reported that Castellar faced a “revolutionary trial” by the ELN and that he was found guilty and executed.

Example (con’t) • Text with named entity tags: <ENAMEX TYPE= LOCATION> Bogota </ENAMEX>,<TIMEX TYPE= DATE> 9 jan 90 </TIMEX> ( <ENAMEX TYPE= ORGANIZATION> EFE </ENAMEX>) -- [text] <ENAMEX TYPE= PERSON> Ricardo Alfonso Castellar </ENAMEX>, mayor of <ENAMEX TYPE= LOCATION> Achi </ENAMEX>, in the northern department of <ENAMEX TYPE= LOCATION> Bolivar </ENAMEX>, who was kidnapped on <TIMEX TYPE= DATE>5 January</TIMEX>, apparently by <ENAMEX TYPE= ORGANIZATION> army of national liberation (ELN) </ENAMEX> guerrillas, was found dead today, according to authorities. <ENAMEX TYPE= PERSON> Castellar </ENAMEX> was kidnapped on <TIMEX TYPE= DATE> 5 january </TIMEX>on the outskirts of <ENAMEX TYPE= LOCATION> Achi </ENAMEX>, about 850 km north of <ENAMEX TYPE= LOCATION> Bogota </ENAMEX>, by a group of armed men, who forced him to accompany them to an undisclosed location. ...

Example (con’t) • Text with coreference tags: Bogota, 9 jan 90 (EFE) -- [text] <COREF ID="1" MIN="Ricardo Alfonso Castellar "> Ricardo Alfonso Castellar </COREF>,<COREF ID= "2" MIN=" mayor" REF="1" TYPE="IDENT"> mayor of Achi </COREF>, in the northern department of Bolivar, who was kidnapped on 5 January, apparently by <COREF ID= "3" MIN="army "> army of national liberation (ELN) guerrillas </COREF>, was found dead today, according to authorities. <COREF ID="4" MIN="Castellar" REF="1" TYPE="IDENT"> Castellar </COREF> was kidnapped on 5 january on the outskirts of Achi, about 850 km north of Bogota, by <COREF ID="5" MIN= "group" REF="3" TYPE="IDENT"> a group of armed men </COREF>, who forced <COREF ID="6" MIN="him" REF="1" TYPE="IDENT"> him </COREF>...

Example: (con’t) • Interpretation of coreference tags Bogota, 9 jan 90 (EFE) -- [text] Ricardo Alfonso Castellar, mayor of Achi, in the northern department of Bolivar, who was kidnapped on 5 January, apparently by army of national liberation (ELN) guerrillas, was found dead today, according to authorities. Castellar was kidnapped on 5 january on the outskirts of Achi, about 850 km north of Bogota, by a group of armed men, who forced him to accompany them to an undisclosed location. Police sources in Cartagena reported that Castellar's body showed signs of torture and several bullet wounds. Castellar was attacked by ELN guerrillas while he was traveling in a boat down the Cauca river to the tenche area, a region within his jurisdiction. In Cartagena it was reported that Castellar faced a “revolutionary trial” by the ELNand that he was found guilty and executed.

Marked-Up Data: Grammatical Coding • to indicate the various parts of speech of tokens • Different Tag Sets have been used • Brown Tag Set: 87/179 tags • Penn Treebank (most used): 45 tags • London-Lund:197 tags • CLAWS1: 132 tags • CLAWS2: 166 tags • CLAWS c5: 62 tags

The design of a Tag Set • Target feature (classification): • Tags are used to tell (the user) useful information about the grammatical class of a word • Predictive feature: • Tags are used (by the system) to predict the behavior of other words in the context • Example, in the Brown tag set: • VBG: verb, present participle • But they can be used as Gerund or as Noun • Gerund: “While purchasing/VBG a gift, I noticed that I was out of money.” • Noun: “Concordia’s purchasing/VBG? Department is closed.” • 2 conflicting goals: • splitting a tag improves prediction but makes classification harder

COMP 791A: Statistical Language Processing