1 / 16

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations. Xiaoyan Li Spring 2006. Topics. 5-step Documents preprocessing Porter stemming algorithm Text compression. Five-Step Document Preprocessing. Lexical analysis of the text

zoey
Download Presentation

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intelligent Information RetrievalCS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006

  2. Topics • 5-step Documents preprocessing • Porter stemming algorithm • Text compression

  3. Five-Step Document Preprocessing • Lexical analysis of the text • How to treat digits, hyphens, punctuation marks, the case of letters • Elimination of stopwords • Words with low discrimination values • Stemming • Removing prefixes and suffixes • Selection of index terms • Determine which words/stems will be used as indexing elements • Construction of term categorization structures • a thesaurus,

  4. Step 1: Lexical analysis of the text • Converting the text of a document (a large string/or a stream of characters) to a stream of words • Word separators (English, Chinese) • How to deal with digits, punctuation marks, hyphens, and the case of letters

  5. Step 2: Elimination of stopwords • Frequent words in the collection • Not good discriminators • Filtered out as potential index terms • Elimination of stopwords reduces the size of the indexing structure considerable. • 40% or more • Examples • Articles, prepositions, conjunctions, etc. • Even some verbs, adverbs and adjectives

  6. Step 3: Stemming • Problem with perfect match: • One query word “connect” and its multiple “connected”, “connecting”, “connects” in different documents • Stemming: Reduce variants of the same root word to a common concept • Stemming also reduces the number of distinct index terms • The Porter Algorithm

  7. Stemming Approaches • Table lookup • Generation is complex • Final tables are often incomplete • Affix removal • Suffix vs. prefix (e.g. mega-volt) • Doesn’t always work, esp. not in German • Successor variety stemming • More complex than suffix removal • Uses (e.g.) linguistic approaches and techniques from morphology • N-grams • General clustering approach which can also be used for stemming

  8. Step 4: Selection of index terms • Full text representation vs. selected set of terms as index terms • Many distinct automatic approaches • The identification of noun groups (Inquery system) • Most of the semantics is carried by the noun words in a sentence • Combine nearby nouns into noun groups.

  9. Step 5: Construction of term categorization structures • A thesaurus • A standard vocabulary for indexing and searching • Relationships among indexed terms • Assist users with locating terms for proper query formulation • An example of an entry in Roget’s thesaurus • Cowardly adjective • Ignobly lacking in courage: cowardly turncoats • Syns: chicken (slang), chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang).

  10. Thesauri • Indexed terms • Denotes a concept, basic semantic unit • Can be individual words, group of words, or phrases • Terms are basically nouns • Terms can also be verbs in gerund form whenever they are used as nouns. (teaching, acting etc.) • Relationships • A set of related terms to a entry is mostly composed of synonyms or near-synonyms.

  11. The Use of Thesauri in IR • Selecting related terms in a thesaurus to reformulate a query when initial query words are erroneous and improper. • Unfortunately, this approach does not work well in general. • Relationships captured in a thesaurus are not valid in the local context of a given query. • An alternative: determine thesaurus-like relationships at query time • Challenging for web search- can’t afford the effort for each individual query

  12. The Porter Algorithm • Special algorithm for the English language based on suffix removal • 5 successive distinct phases, applied to words sequentially one after another • Example: Remove plural ‘s’ and ‘sses’ Rules: sses -> ss, s -> NIL (obey order!)

  13. Porter Algorithm • Conventions • C: consonant, V: vowel, L: consonant or vowel • Combination of C, V, L to define patterns • Operators ”+” and “*” to form complex patterns • *: zero or more repetitions of a given pattern: (V*C) • +: one of more repetitions of a given pattern :( (C)*((V)+(C)+)+(V)*) • Statements/commands • Rule-base statements • Single rule: If (*V*L) then ed Nil (remove ed) • Multiple rules: • Select rule with longest suffix{ sses ss ies i; ss ss; s-> }

  14. Try Porter Algorithm • Played • Classes • Policy • Position • Capability • Active, actively, activity

  15. The Porter Algorithm: advantages & disadvantages • Advantage: Easy algorithm with good results • abate abated abatement abatements abates -->abat • Disadvantage: Not always correct, e.g. • Same root for police – policy, execute –executive, … • Different root for european – europe, search – searcher,

  16. Next Lecture: • Compression. Ch. 7

More Related