1 / 29

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing. Marti Hearst Sept 8, 2004. Today. Tokenizing using Regular Expressions Elementary Morphology Frequency Distributions in NLTK. Tokenizing in NLTK. The Whitespace Tokenizer doesn’t work very well What are some of the problems?

kaliska
Download Presentation

SIMS 290-2: Applied Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004

  2. Today • Tokenizing using Regular Expressions • Elementary Morphology • Frequency Distributions in NLTK

  3. Tokenizing in NLTK • The Whitespace Tokenizer doesn’t work very well • What are some of the problems? • NLTK provides an easy way to incorporate regex’s into your tokenizer • Uses python’s regex package (re) • http://docs.python.org/lib/re-syntax.html Modified from Dorr and Habash (after Jurafsky and Martin)

  4. Regex’s for Tokenizing • Build up your recognizer piece by piece • Make a string of regex’s combined with OR’s • Put each one in a group (surrounded by parens) • Things to recognize: • urls • words with hyphens in them • words in which hyphens should be removed (end of line hyphens) • Numerical terms • Words with apostrophes Modified from Dorr and Habash (after Jurafsky and Martin)

  5. Regex’s for Tokenizing • Here are some I put together: • url = r'((http:\/\/)?[A-Za-z]+(\.[A-Za-z]+){1,3}(\/)?(:\d+)?)‘ • Allows port number but no argument variables. • hyphen = r'(\w+\-\s?\w+)‘ • Allows for a space after the hyphen • apostro = r'(\w+\'\w+)‘ • numbers = r'((\$|#)?\d+(\.)?\d+%?)‘ • Needs to handle large numbers with commas • punct = r'([^\w\s]+)‘ • wordr = r'(\w+)‘ • A nice python trick: • regexp = string.join([url, hyphen, apostro, numbers, wordr, punct],"|") • Makes one string in which a “|” goes in between each substring

  6. Regex’s for Tokenizing • More code: • import string • from nltk.token import * • from nltk.tokenizer import * • t = Token(TEXT='This is the girl\'s depart- ment.') • regexp = string.join([url, hyphen, apostrophe, numbers, wordr, punct],"|") • RegexpTokenizer(regexp,SUBTOKENS='WORDS').tokenize(t) • print t['WORDS'] [<This>, <is>, <the>, <girl's>, <depart- ment>, <store>, <.>]

  7. Tokenization Issues • Sentence Boundaries • Include parens around sentences? • What about quotation marks around sentences? • Periods – end of line or not? • We’ll study this in detail in a couple of weeks. • Proper Names • What to do about • “New York-New Jersey train”? • “California Governor Arnold Schwarzenegger”? • Clitics and Contractions Modified from Dorr and Habash (after Jurafsky and Martin)

  8. Morphology • Morphology: • The study of the way words are built up from smaller meaning units. • Morphemes: • The smallest meaningful unit in the grammar of a language. • Contrasts: • Derivational vs. Inflectional • Regular vs. Irregular • Concatinative vs. Templatic (root-and-pattern) • A useful resource: • Glossary of linguistic terms by Eugene Loos • http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm Modified from Dorr and Habash (after Jurafsky and Martin)

  9. Examples (English) • “unladylike” • 3 morphemes, 4 syllables un- ‘not’ lady ‘(well behaved) female adult human’ -like ‘having the characteristics of’ • Can’t break any of these down further without distorting the meaning of the units • “technique” • 1 morpheme, 2 syllables • “dogs” • 2 morphemes, 1 syllable -s, a plural marker on nouns Modified from Dorr and Habash (after Jurafsky and Martin)

  10. Morpheme Definitions • Root • The portion of the word that: • is common to a set of derived or inflected forms, if any, when all affixes are removed • is not further analyzable into meaningful elements • carries the principle portion of meaning of the words • Stem • The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added. • Affix • A bound morpheme that is joined before, after, or within a root or stem. • Clitic • a morpheme that functions syntactically like a word, but does not appear as an independent phonological word • Spanish: un beso, las aguas • English: Hal’s (genetive marker) Modified from Dorr and Habash (after Jurafsky and Martin)

  11. Inflectional vs. Derivational • Word Classes • Parts of speech: noun, verb, adjectives, etc. • Word class dictates how a word combines with morphemes to form new words • Inflection: • Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast. • Doesn’t change the word class • Usually produces a predictable, nonidiosyncratic change of meaning. • Derivation: • The formation of a new word or inflectable stem from another word or stem. Modified from Dorr and Habash (after Jurafsky and Martin)

  12. Inflectional Morphology • Adds: • tense, number, person, mood, aspect • Word class doesn’t change • Word serves new grammatical role • Examples • come is inflected for person and number: The pizza guy comes at noon. • las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s las manzanas rojas (‘the red apples’) Modified from Dorr and Habash (after Jurafsky and Martin)

  13. Derivational Morphology • Nominalization (formation of nouns from other parts of speech, primarily verbs in English): • computerization • appointee • killer • fuzziness • Formation of adjectives (primarily from nouns) • computational • clueless • Embraceable • Diffulcult cases: • building  from which sense of “build”? • A resource: • CatVar: Categorial Variation Database http://clipdemos.umiacs.umd.edu/catvar Modified from Dorr and Habash (after Jurafsky and Martin)

  14. Concatinative Morphology • Morpheme+Morpheme+Morpheme+… • Stems: also called lemma, base form, root, lexeme • hope+ing  hoping hop  hopping • Affixes • Prefixes: Antidisestablishmentarianism • Suffixes: Antidisestablishmentarianism • Infixes: hingi (borrow) – humingi (borrower) in Tagalog • Circumfixes: sagen (say) – gesagt (said) in German • Agglutinative Languages • uygarlaştıramadıklarımızdanmışsınızcasına • uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına • Behaving as if you are among those whom we could not cause to become civilized Modified from Dorr and Habash (after Jurafsky and Martin)

  15. Templatic Morphology • Roots and Patterns • Example: Hebrew verbs • Root: • Consists of 3 consonants CCC • Carries basic meaning • Template: • Gives the ordering of consonants and vowels • Specifies semantic information about the verb • Active, passive, middle voice • Example: • lmd (to learn or study) • CaCaC -> lamad (he studied) • CiCeC -> limed (he taught) • CuCaC -> lumad (he was taught) Modified from Dorr and Habash (after Jurafsky and Martin)

  16. Nouns and Verbs (in English) • Nouns have simple inflectional morphology • cat • cat+s, cat+’s • Verbs have more complex morphology Modified from Dorr and Habash (after Jurafsky and Martin)

  17. Nouns and Verbs (in English) • Nouns • Have simple inflectional morphology • Cat/Cats • Mouse/Mice, Ox, Oxen, Goose, Geese • Verbs • More complex morphology • Walk/Walked • Go/Went, Fly/Flew Modified from Dorr and Habash (after Jurafsky and Martin)

  18. Regular (English) Verbs Modified from Dorr and Habash (after Jurafsky and Martin)

  19. Irregular (English) Verbs Modified from Dorr and Habash (after Jurafsky and Martin)

  20. “To love” in Spanish Modified from Dorr and Habash (after Jurafsky and Martin)

  21. Syntax and Morphology • Phrase-level agreement • Subject-Verb • John studies hard (STUDY+3SG) • Noun-Adjective • Las vacas hermosas • Sub-word phrasal structures • שבספרינו • ש+ב+ספר+ים+נו • That+in+book+PL+Poss:1PL • Which are in our books Modified from Dorr and Habash (after Jurafsky and Martin)

  22. Phonology and Morphology • Script Limitations • Spoken English has 14 vowels • heed hid hayed head had hoed hood who’d hide how’d taught Tut toyenough • English Alphabet has 5 • Use vowel combinatios: far fair fare • Consonantal doubling (hopping vs. hoping) Modified from Dorr and Habash (after Jurafsky and Martin)

  23. Computational Morphology • Approaches • Lexicon only • Rules only • Lexicon and Rules • Finite-state Automata • Finite-state Transducers • Systems • WordNet’s morphy • PCKimmo • Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay • Accurate but complex • http://www.sil.org/pckimmo/ • Two-level morphology • Commercial version available from InXight Corp. • Background • Chapter 3 of Jurafsky and Martin • A short history of Two-Level Morphology • http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/ Modified from Dorr and Habash (after Jurafsky and Martin)

  24. Porter Stemmer • Discount morphology • So not all that accurate • Uses a series of cascaded rewrite rules • ATIONAL -> ATE (relational -> relate) • ING ->  if stem contains vowel (motoring -> motor) Modified from Dorr and Habash (after Jurafsky and Martin)

  25. Porter Stemmer Step 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible Modified from Dorr and Habash (after Jurafsky and Martin)

  26. Porter Stemmer • Errors of Omission • European Europe • analysis analyzes • matrices matrix • noise noisy • explain explanation • Errors of Commission • organization organ • doing doe • generalization generic • numerical numerous • university universe Modified from Dorr and Habash (after Jurafsky and Martin)

  27. Computational Morphology WORD STEM (+FEATURES)* • cats cat +N +PL • cat cat +N +SG • cities city +N +PL • geese goose +N +PL • ducks (duck +N +PL) or (duck +V +3SG) • merging merge +V +PRES-PART • caught (catch +V +PAST-PART) or (catch +V +PAST) Modified from Dorr and Habash (after Jurafsky and Martin)

  28. Lexicon-only Morphology • The lexicon lists all surface level and lexical level pairs • No rules … • Analysis/Generation is easy • Very large for English • What about • Arabic or • Turkish or • Chinese? acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$ Modified from Dorr and Habash (after Jurafsky and Martin)

  29. For Next Week • Software status: • Software on 3 lab machines, more coming • Lecture on Monday Sept 13: • Part of speech tagging • For Wed Sept 15 • Do exercises 1-3 in Tutorial 2 (Tokenizing) • Do the following exercises from Tutorial 3 (Tagging) 1a-h 2, 3, 4, 5a-b • Turn them in online (I’ll have something available for this by then)

More Related