Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations

Intelligent Information RetrievalCS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006

Topics • 5-step Documents preprocessing • Porter stemming algorithm • Text compression

Five-Step Document Preprocessing • Lexical analysis of the text • How to treat digits, hyphens, punctuation marks, the case of letters • Elimination of stopwords • Words with low discrimination values • Stemming • Removing prefixes and suffixes • Selection of index terms • Determine which words/stems will be used as indexing elements • Construction of term categorization structures • a thesaurus,

Step 1: Lexical analysis of the text • Converting the text of a document (a large string/or a stream of characters) to a stream of words • Word separators (English, Chinese) • How to deal with digits, punctuation marks, hyphens, and the case of letters

Step 2: Elimination of stopwords • Frequent words in the collection • Not good discriminators • Filtered out as potential index terms • Elimination of stopwords reduces the size of the indexing structure considerable. • 40% or more • Examples • Articles, prepositions, conjunctions, etc. • Even some verbs, adverbs and adjectives

Step 3: Stemming • Problem with perfect match: • One query word “connect” and its multiple “connected”, “connecting”, “connects” in different documents • Stemming: Reduce variants of the same root word to a common concept • Stemming also reduces the number of distinct index terms • The Porter Algorithm

Stemming Approaches • Table lookup • Generation is complex • Final tables are often incomplete • Affix removal • Suffix vs. prefix (e.g. mega-volt) • Doesn’t always work, esp. not in German • Successor variety stemming • More complex than suffix removal • Uses (e.g.) linguistic approaches and techniques from morphology • N-grams • General clustering approach which can also be used for stemming

Step 4: Selection of index terms • Full text representation vs. selected set of terms as index terms • Many distinct automatic approaches • The identification of noun groups (Inquery system) • Most of the semantics is carried by the noun words in a sentence • Combine nearby nouns into noun groups.

Step 5: Construction of term categorization structures • A thesaurus • A standard vocabulary for indexing and searching • Relationships among indexed terms • Assist users with locating terms for proper query formulation • An example of an entry in Roget’s thesaurus • Cowardly adjective • Ignobly lacking in courage: cowardly turncoats • Syns: chicken (slang), chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang).

Thesauri • Indexed terms • Denotes a concept, basic semantic unit • Can be individual words, group of words, or phrases • Terms are basically nouns • Terms can also be verbs in gerund form whenever they are used as nouns. (teaching, acting etc.) • Relationships • A set of related terms to a entry is mostly composed of synonyms or near-synonyms.

The Use of Thesauri in IR • Selecting related terms in a thesaurus to reformulate a query when initial query words are erroneous and improper. • Unfortunately, this approach does not work well in general. • Relationships captured in a thesaurus are not valid in the local context of a given query. • An alternative: determine thesaurus-like relationships at query time • Challenging for web search- can’t afford the effort for each individual query

The Porter Algorithm • Special algorithm for the English language based on suffix removal • 5 successive distinct phases, applied to words sequentially one after another • Example: Remove plural ‘s’ and ‘sses’ Rules: sses -> ss, s -> NIL (obey order!)

Porter Algorithm • Conventions • C: consonant, V: vowel, L: consonant or vowel • Combination of C, V, L to define patterns • Operators ”+” and “*” to form complex patterns • *: zero or more repetitions of a given pattern: (V*C) • +: one of more repetitions of a given pattern :( (C)*((V)+(C)+)+(V)*) • Statements/commands • Rule-base statements • Single rule: If (*V*L) then ed Nil (remove ed) • Multiple rules: • Select rule with longest suffix{ sses ss ies i; ss ss; s-> }

Try Porter Algorithm • Played • Classes • Policy • Position • Capability • Active, actively, activity

The Porter Algorithm: advantages & disadvantages • Advantage: Easy algorithm with good results • abate abated abatement abatements abates -->abat • Disadvantage: Not always correct, e.g. • Same root for police – policy, execute –executive, … • Different root for european – europe, search – searcher,

Next Lecture: • Compression. Ch. 7

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations