1 / 15

Stemming, tagging and chunking

Stemming, tagging and chunking. Text analysis short of parsing. Word-based analysis. Whereas parsing gives a full syntactic analysis, sometimes it is sufficient to have less detailed information In many applications we are more interested in words But what do we mean by “word”?. Words.

tilly
Download Presentation

Stemming, tagging and chunking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stemming, tagging and chunking Text analysis short of parsing

  2. Word-based analysis • Whereas parsing gives a full syntactic analysis, sometimes it is sufficient to have less detailed information • In many applications we are more interested in words • But what do we mean by “word”?

  3. Words • Naïve definition of a word: sequence of characters surrounded separated from each other by a space • But punctuation marks are usually attached to words • Though not all punctuation marks are word-delimiters, e.g. possessive apostrophe, hyphen

  4. Words • We may want to treat hyphenated and compound words as one word, or two • By the same token we may want to treat word sequences as if they were a single word • In addition, a given “word” can have different word forms, depending on inflections, or even conventions of orthography

  5. Tokenization • The simplest form of analysis is to reduce different word forms into tokens • Also called “normalization” • For example, if you want to count how many times a given word occurs in a text • Or you want to search for texts containing certain words (e.g. Google)

  6. Stemming • Stemming is the particular case of tokenization which reduces inflected forms to a single base form or stem • (Recall our discussion of stem ~ base form ~ dictionary form ~ citation form) • Stemming algorithms are basic string-handling algorithms, which depend on rules which identify affixes that can be stripped

  7. Stemming • As we know, morphology can be less than straightforward, so a stemmer has to “know” about rules such as consonant doubling, y→i, etc. • Also has to know about irregularities • And to avoid overgeneration • For this it probably needs a dictionary

  8. Stemming • Best known stemming algorithm for English is Martin Porter’s stemmer, published in 1979 • Original use was in information retrieval • In computational terms, it is really just a sophisticated string-handling algorithm • In linguistic terms, it is interesting in that it captures generalisations about English morphology

  9. Word categories • A.k.a. parts of speech (POSs) • Important and useful to identify words by their POS • To distinguish homonyms • To enable more general word searches • POS familiar (?) from school and/or language learning (noun, verb, adjective, etc.)

  10. Word categories • Recall that we distinguished • open-class categories (noun, verb, adjective, adverb) • Closed-class categories (preposition, determiner, pronoun, conjunction, …) • While the big four are fairly clearcut, it is less obvious exactly what and how many closed-class categories there may be

  11. POS tagging • Labelling words for POS can be done by dictionary lookup and/or some sort of process • Identifying POS can be seen as a prerequisite to parsing, and/or a result of morphological analysis in its own right • However, there are some differences: • Parsers often work with the most simple set of word categories, subcategorized by feature (or attribute-value) schemes • Indeed the parsing procedure may contribute to the disambiguation of homonyms

  12. POS tagging • POS tagging, per se, aims to identify word-category information somewhat independently of sentence structure … • … and typically uses rather different means • POS tags are generally shown as labels on words: John/NPN saw/VB the/AT book/NCN on/PRP the/AT table/NN ./PNC • We’ll return to tagging in detail, but first let’s mention …

  13. Chunking • Like parsing except that it aims only to identify major constituents • And does not attempt to identify structure, neither internal (within the chunk), nor external (between chunks) • Chunking will leave some parts of the text unanalysed • Example: [NP [NP G.K. Chesterton ], [NP [NP author ] of [NP [NP The Man ] who was [NP Thursday ] ] ] ]

  14. Chunking • Chunks can be represented like tags or like parse trees

  15. Chunk parser • A “chunk” is a continuous non-overlapping sequence of words • Chunker finds such sequences, often using tagged text as input • Chunk rules can be as simple as regular expressions • Chunkers can allow embedding, but typically only to a shallow level • Another example: (S: (NP: I) saw (NP: the big dog) . )

More Related