How to add a new language on the NLP map: Tools and resources you can build Daniela G ÎFU

“ALEXANDRU IOAN CUZA” UNIVERSITATY OF IAŞI FACULTY OF COMPUTER SCIENCE How to add a new language on the NLP map:Tools and resources you can build Daniela GÎFU http://profs.info.uaic.ro/~daniela.gifu/

1. Monolingual NLPBuilding resources and tools for a new language • Construction of basic resources and tools starting with a corpus: • language models • unsupervised syntactic analysis – POS tagging • clustering of similar entities – words, phrases, texts

Part-of-speech Tagging • Probabilistic methods • (minimally) supervised • unsupervised

What are Parts-of-speech? • POS aka grammatical categories, syntactic tags, POS tags, word classes, … • Syntactic categories that words belong to • N, V, Adj/Adv, Prep, Aux, • POS class: open / closed, • categories: lexical / functional

POS Examples N noun baby, toy V verb see, kiss ADJ adjective tall, grateful, alleged ADV adverb quickly, frankly, ... P preposition in, on, near DET determiner the, a, that WhPron wh-pronoun who, what, which, … COORD coordinator and, or open class closed class

Substitution Test • Two words belong to the same category if replacing one with another does not change the grammaticality of a sentence. • The _____ is angry. • The ____ dog is angry. • Fifi ____ . • Fifi ____ the book.

POS Tags • There is no standard set of POS tags • Some use coarse classes: e.g., N, V, A, Aux, …. • Others prefer finer distinctions (e.g., Penn Treebank): • PRP: personal pronouns (you, me, she, he, them, him, her, …) • PRP$: possessive pronouns (my, our, her, his, …) • NN: singular common nouns (sky, door, theorem, …) • NNS: plural common nouns (doors, theorems, women, …) • NNP: singular proper names (Fifi, IBM, Canada, …) • NNPS: plural proper names (Americas, Carolinas, …)

PRP PRP$

Part of Speech Tagging • Words often have more than one POS: back • The back door • On my back • Win the voters back • Promised to back the bill • The POS tagging problem is to determine the POS tag for a particular instance of a word.

POS Ambiguity (in the Brown Corpus) Unambiguous (1 tag): 35,340 Ambiguous (2-7 tags): 4,100 (Derose, 1988)

Applications of POS Tagging • text-to-speech (how do we pronounce “lead”?) • can write regexps like Det Adj* N* over the output • preprocessing to speed up parser (but a little dangerous) • if you know the tag, you can back off to it in other tasks • Back-off: trim the info you know at that point

POS Tagging Examples • Input: the lead paint is unsafe • Output: the/Det lead/N paint/N is/V unsafe/Adj • Can be challenging: • I know that • I know that block • I know that blocks the sun • new words (OOV= out of vocabulary); words can be whole phrases (“I can’t believe it’s not butter”)

Why POS Tagging? • “Simplest” case of recovering surface, underlying form via statistical means • We are modeling p(word seq, tag seq) • The tags are hidden, but we see the words • Is tag sequence X likely with these words?

Current Performance • How many tags are correct? • About 97% currently – fully supervised • But baseline is already 85-90% • Baseline algorithm: • Tag every word with its most frequent tag • Tag unknown words as nouns • How well do people do?

What Information to Use in Tagging? • Each unknown tag is constrained by its word and by the tags to its immediate left and right. • But those tags are unknown too … PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb

POS Tagging with HMM S1 S2 S3 S4 S5

Building HMM for POS Tagging • Supervised training: • Use a tagged corpus to obtain transition and emission probabilities • Unsupervised training: • Use a plain-text corpus and a lexicon to obtain transition and emission probabilities by applying the EM algorithm. • Co-training: partly supervised • Training with a small tagged corpus and a large plain-text corpus.

Training with Tagged Corpus • Pierre NNP Vinken NNP , , 61 CD years NNS old JJ , , will MD join VB the DT board NN as IN a DT nonexecutive JJ director NN Nov. NNP 29 CD . . • Mr. NNP Vinken NNP is VBZ chairman NN of IN Elsevier NNP N.V. NNP , , the DT Dutch NNP publishing VBG group NN . . • Rudolph NNP Agnew NNP , , 55 CD years NNS old JJ and CC former JJ chairman NN of IN Consolidated NNP Gold NNP Fields NNP PLC NNP , , was VBD named VBN a DT nonexecutive JJ director NN of IN this DT British JJ industrial JJ conglomerate NN . . c(JJ)=7 c(JJ, NN)=4, P(NN|JJ)=4/7

Estimation of Probabilities • Transition Probability • Emission Probability • Smoothing • Dealing with unknown words

DT N V P N Viterbi Example Using HMM for Part of Speech Tagging (POS) • POS example: The students went to class • Why POS is difficult: • Words often have more than one word class • Ambiguity Are the fish biting today? They fish for pearls in the lake.

Viterbi Example Using HMM for Part of Speech Tagging (POS) Assumption: A word depends probabilistically on just its own part of speech (tag) which in turn depends on the part of speech of the proceeding word

Viterbi Example Using HMM for Part of Speech Tagging (POS) M: There are M types of POS tags (N,V,P) V: Vocabulary size • The alphabet corresponds to the English vocabulary S: States represent POS tags X: Observations are the words Calculate the most likely sequence of POS tagsstatesthat have caused this observation FISH SLEEP

Viterbi Algorithm • The Viterbi algorithm is used to compute the most likely tag sequence in O(W x T2) time, where T is the number of possible part-of-speech tags and W is the number of words in the sentence. • The algorithm sweeps through all the tag possibilities for each word, computing the best sequence leading to each possibility. The key that makes this algorithm efficient is that we only need to know the best sequences leading to the previous word because of the Markov assumption.

Computing the Probability of a Sentence and Tags • We want to find the sequence of tags that maximizes the formula • P (T1..Tn| wi..wn), which can be estimated as: • P (Ti| Ti−1) is computed by multiplying the arc values in the HMM. • P (wi| Ti) is computed by multiplying the lexical generation probabilities associated with each word

The Viterbi Algorithm • Let T = # of part-of-speech tags • W = # of words in the sentence • /* Initialization Step */ • for t = 1 to T • Score(t, 1) = Pr(W ord1| T agt) * Pr(T agt| φ) • BackPtr(t, 1) = 0; • /* Iteration Step */ • for w = 2 to W • for t = 1 to T • Score(t, w) = Pr(W ordw| T agt) * MAXj=1,T(Score(j, w-1) * Pr(T agt| Tagj)) • BackPtr(t, w) = index of j that gave the max

The Viterbi Algorithm • /* Sequence Identification */ • Seq(W ) = t that maximizes Score(t,W ) • for w = W -1 to 1 • Seq(w) = BackPtr(Seq(w+1),w+1)

Simple POS HMM .2 Start Noun Verb End .8 .7 .8 .2 .1 .1 .1 www.cs.nyu.edu/courses/spring03/G22.2590-001/Viterbi.ppt

Word Emission ProbabilitiesP(word|state) • A two word language: “fish” and “sleep” • Noun: • fish: 0.8 • Sleep:0.2 • Verb • Fish: 0.5 • Sleep: 0.6

START FISH SLEEP END START NOUN VERB END SEQUENCE: FISH SLEEP STATES OBSERVATIONS

SEQUENCE: FISH SLEEP DECODE: SLEEP -> VERB FISH -> NOUN START FISH SLEEP END 0 START 1 0 0 NOUN 0 .64 .0128 0 0 VERB 0 .10 .3072 .21504 0 0 0 END

Unsupervised Training • Suppose we have a corpus with the following two sentences. • Some of the words have more than one part of speech. • Can we tag the corpus without examples? A lion ran to the rock D N V P D N Aux V The cat slept on the mat D N V P D N V R S1: S2:

Fractional Counts • S1 has 4 possible tag sequences. Let’s say they are equally likely. • c(D)=2, c(N)=1.5, c(P)=0.5, c(V)=1.5 • c(D,N)=1.5, c(D,V)=0.5, c(V, Aux)=0.5 S1: A lion ran to the rock D N V P D N Aux V D N V P D N D N V Aux D N D N V P D V D N V Aux D V 0.25 0.25 0.25 0.25

c(D)=2, c(N)=1.5, c(P)=0.5, c(V)=1.5, … • c(D,N)=1.5, c(D,V)=0.5, c(V, P)=0.5, … S2: The cat slept on the mat D N V P D N V R D N V P D N D V V P D N D N V R D N D V V R D N 0.25 0.25 0.25 0.25

Expectation Maximization • Expectation Maximization is an algorithm for unsupervised learning. • Big question: • If we don’t know what is the correct answer, how do we learn? • Before answering the big question, let’s first ask what are we going to learn. • In many application, we want to learn a probabilistic model of the data.

POS as an Example • Given: • A large text corpus (with no tags) • A lexicon that specifies the possible POS categories of each word. • Learn a probability model q, such that • For any input sentence of words w1:N, we can find a sequence of tags l1:N that has the maximum probability given the words: argmaxl1:NPq(l1:N |w1:N)

EM Magic • The Big Question (in terms of POS tagging) • If we do not have examples of which word is assigned which tag, what is the criterion for determining which probability model q is the best? • Answer • The best model gives the highest probability for the text corpus argmax P(S)

HMM Taggers: Supervised vs. Unsupervised • Training Method • Supervised • Relative Frequency • Unsupervised • Maximum Likelihood training with random start • Read corpus, take counts and make translation tables • Use Forward-Backward or Viterbi to estimate lexical probabilities • Compute most likely hidden state sequence • Determine POS role that each state most likely plays

Unsupervised HMM Taggers • When? • To tag text in a foreign language for which training corpora do not exist at all • To tag a text from a specialized domain with word generation probabilities that are different from those in available training texts • Initialization of Model Parameter • Randomly initialize all lexical probabilities involved in the HMM • Use dictionary information • Jelinek’s method • Kupiec’s method

Unsupervised HMM Taggers • Equivalence class • to group all the words according to the set of their possible tags • Ex) bottom, top g JJ-NN class • Jelinek’s method • Assuming that words occur equally likely with each of their possible tags

Unsupervised HMM Taggers • Kupiec’s method • The total number of parameters is reduced and can be estimated more reliably • Does not include the 100 most frequent words in equivalence classes, but treats as one-word classes ‘metawords’ All words with the same possible POS

Conclusions I • 7000 languages worldwide, but resources only for a few dozens. [Bird, S et al., 2009, Natural Languages Processing with Python: Analyzing Text with Natural Language Toolkit - http://victoria.lviv.ua/html/fl5/NaturalLanguageProcessingWithPython.pdf] http://www.ethnologue.com Hopeless situation? • No! The multilingual Web can provide the resources required to build basic tools for a new language “in one-person day”

Conclusions II • Languages share similarities and have differences: Why? • because of similarities we can port algorithms from one language to the next; • differences show us how rich and wonderful languages are, and give us interesting research topics.

Thank you!

How to add a new language on the NLP map: Tools and resources you can build Daniela G ÎFU