Terms and Query Operations. Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992. Chapter 7 - 9. Lexical Analysis and Stoplists. Chapter 7. Lexical Analysis for Automatic Indexing.
Information Retrieval: Data Structures and Algorithms
by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
Chapter 7 - 9
association measures are calculated between pairs of terms
1. Construction of vocabulary
normalization and selection of terms
phrase construction depending on the coordination level desired
2. Similarity computations between terms
identify the significant statistical associations between terms
3. Organization of vocabulary
organize the selected vocabulary into a hierarchy on the basis
of the associations computed in step 2.
COHESION(ti, tj)=size-factor* co-occurrence-frequency/(frequency(ti)*frequency(tj)) where size-factor is the size of thesaurus vocabulary 4. If cohesion is above a second threshold, retain the phrase
1. Identify a set of frequency ranges.
2. Group the vocabulary terms into different classes based on
their frequencies and the ranges selected in step 1.
3. The highest frequency class is assigned level 0, the next, level
1, and so on.
4. Parent-child links are determined between adjacent levels as follows. For each term t in level i, compute similarity between t and every term in level i-1. Term t becomes the child of the most similar term in level i-1. If more than one term in level i-1qualifies for this, then each becomes a parent of t. In other words, a term is allowed to have multiple parents.
5. After all terms in level i have been linked to level i-1 terms,
check level i-1terms and identify those that have no children.
Propagate such terms to level i by creating an identical
“dummy” term as its child.
6. Perform steps 4 and 5 for each level starting with level.