1 / 23

Introduction to (still more) Computational Linguistics

Introduction to (still more) Computational Linguistics. Pawel Sirotkin 28.11-01.12.2008, Riga. Rule-based CL. Rule-based CL Rules have to be generated by hand Easily tailored to fit (or test) a particular theory First results with just a handful of rules But:

Download Presentation

Introduction to (still more) Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga

  2. Rule-based CL • Rule-based CL • Rules have to be generated by hand • Easily tailored to fit (or test) a particular theory • First results with just a handful of rules • But: • Very hard to get “all” the rules • Rules may conflict • Rules are language- and domain-specific Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  3. Statistical CL • Needed: an algorithm that can create rules • Algorithm needs training data to learn • More and more data around • Digitalized literature, official documents, corpora • These rules can be applied to new texts • Good points: • Largely independent from language, domain etc. • Computational power available in abundance Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  4. A brief aside: Corpora • First major corpus: Brown Corpus (mid-60ies)‏ • 500 samples of 2000 words each • From newspapers, fiction and non-fiction books • Around 80 part-of-speech-tags • Tagging took over 15 year to be completed • Modern corpora: BNC, COCA, ... • Sometimes hundreds of millions of words • Written and spoken texts • More or less syntactic and semantic annotation Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  5. Part-of-Speech Tagging • Linguistic background • What are parts of speech? • How do we recognize them? • Practical usage • What are POS taggers good for? • What should they do? • Implementation • What are the possible problems? • What are the possible solutions? Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  6. Parts of speech • Nouns, verbs, adjectives… I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. (Martin Luther King) • How many nouns are there in this text? Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  7. Parts of speech • Nouns, verbs, adjectives… I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. (MartinLutherKing) • What defines a noun? Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  8. What defines a part of speech? • Noun • a word (other than a pronoun) used to identify any of a class of people, places, or things (common noun), or to name a particular one of these (proper noun) [OED] • Semantic definition • any member of a class of words that typically can be combined with determiners to serve as the subject of a verb, can be interpreted as singular or plural, can be replaced with a pronoun, and refer to an entity, quality, state, action, or concept [Merriam-Webster] • Syntactic and semantic definition Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  9. What parts of speech are there? • More (closed) word classes in English • More (or less, or different) word classes in other languages • Different word classes in different linguistic models Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  10. How to recognize word classes? • Substitution test • The small boy sits in a car. • The, a, this: determiner • Small, big, angry, clever: adjectives • Boy, girl, cat, doll: nouns • Sits, cries, sleeps: verbs • In, on, outside: prepositions Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  11. Why do we need POS tags? • Main aim: disambiguation • Useful for most advanced CLP applications • Machine translation • Named Entity Recognition/Extraction • Anaphora resolution • etc. Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  12. Part-of-Speech Tagger • Not surprisingly, an application for determining parts of speech in a text • NotADV surprisinglyADV, anDET applicationN forPREP determiningV partsN ofPREP speechN inPREP aDET textN Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  13. Part-of-Speech Tagging – rules? • Rule-based POS Tagging? • Possible rules (simplified): • If ends in „est“, then it‘s an adjective (superlative form) • Pest? Rest? • If ends in „ed“, it‘s a verb (past or participle form) • Bed? Sled? • Rules of this kind are few and unreliable • Largest problem: they don’t help with the ambiguous words! Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  14. Part-of-Speech Tagging – rules! • The wind is blowing. • How do we know wind is a noun and not a verb? • Because it appears after an article and before a verb • ART ___ VERB  ART NOUN VERB • We need rules about inter-word relations • Hard to say what the rules are: • The cromulent wind • The cromulent wind up Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  15. Part-of-Speech Tagging: Stats • Wind: 76% noun usage, 24% verb usage • ART ___ VERB: 72% noun, 1% adverb • The wind blows: • Verb probability: 24% x 0% = 0% • Adverb probability: 0% x 1% = 0% • Noun probability: 76% x 72% = 55% Careful! The numbers are invented, and the calculation is more complex than that. Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  16. What do we need? This is a simple sentence. This text, excogitated by Dr. Samākslots of New York, is a bit more complicated. It consists of a few longer-than-usual sentences; also, it has punctuation etc. It will help us to learn the complexities of part-of-speech tagging, or POST. Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  17. We need… • A tokenizer to split the text into tokens • Tag probabilities for the tokens • E.g. left: 46% adjective, 31% noun, 23% verb • Tag sequence probabilities • E.g. ADJ ___ NOUN: 57% noun, 43% adjective • How long should the sequences be? • Methods for estimating unknown words • E.g. 80% proper noun probability if capitalized • No closed word classes Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  18. Tag probabilities The wind blows. • The: 98% article, 2% adverb • Wind: 76% noun, 24% verb • Blows: 53% verb, 47% noun • Article  Noun: 72%, Article  Verb 1% • Adverb  Noun 0%, Adverb  Verb 6% • Noun  Verb 61%, Noun  Noun 4% • Verb  Verb 3%, Verb  Noun 59%. Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  19. Tag probability calculation The wind blows. • Article – noun – verb: 98% x 72% x 76% x 61% x 53% = 17% • Article – noun – noun: 98% x 72% x 76% x 4% x 47% = 10% • Article – verb – noun: 98% x 1% x 24% x 39% x 47% = 0.04% • Article – verb – verb: 98% x 1% x 24% x 3% x 53% = 0.0004% • … • The complexity of calculations explodes when the length of the sentences and the number of tags increase. Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  20. Hidden Markov Models The wind blows 98% 76% 52% 72% 61% ? article noun verb 1% 4% 0% 2% 2% 24% 47% ? 6% 59% adverb verb noun Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  21. Viterbi Algorithm The wind blows 98% 54% 76% 52% 72% 61% article noun verb 1% 4% 0% 2% 2% 24% 0.22% 47% 6% 59% adverb verb noun article: 98% adverb: 2% article – noun: 54% article – noun – verb: 18% article – noun – verb: 17% article – verb: 0.22% article – verb: 0.2% article – noun – noun: 1% adverb – noun: 0% article – verb – verb: 0.02% adverb – verb: 0.02% article – noun – noun: 0.05% Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  22. HMMs – the theory • A five-tuple (S, K, Π, A, B) • Set of states S • here: the possible tags at any point • Output alphabet K • here: the possible tokens • Initial probabilities Π • here: probabilities for first item in a sentence/text • State transition probabilities A • here: tag sequence probabilities • Symbol emission probabilities B • Here: token-tag-probabilities Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

  23. POST: Current state • Baseline approach (tagging each token with most frequently used tag) delivers up to 90% accuracy • State-of-the-art taggers reach 96-97% accuracy • But: Given an average sentence length of 20 words in a newspaper text, we get errors in most sentences! • POS taggers are used as a first step in most complex CL applications • Some free online taggers: CLAWS, CST, CCG… Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

More Related