CS533 Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #22 April 21, 1999

This lecture • Concept based retrieval • Phrase-based thesaurus • Phrasefinder

Concept based retrieval • Concept indexing • Concept dictionary • Concept based ranking functions

Concept indexing • The index should contain concepts • Concepts are usually phrases • To index: • phrase dictionary and • phrase recognition procedures

Concept indexing • Question: should the occurrence of individual phrase words be indexed as well as the phrase? • In Smart, the occurrence of the phrase “neural networks” in a document is added to the inverted lists for “neural network”, “neural” and “networks”

Dictionaries • Free on-line dictionary of computing (FOLDOC) http://wombat.doc.ic.ac.uk/ • Computing concepts • Webster http://c.gp.cs.cmu.edu:5103/prog/webster • word based

Generating a phrase dictionary • Dictionaries are usually word based • Phrases are often domain specific • New concepts/collocational expressions are constantly being coined

Collocational expressions sequences of words whose exact and unambiguous meaning cannot be derived directly from the meaning of their components

Collocational expressions • “foreign minister” - does it mean a clergy man who is foreign? • “pencil sharpener” - a person whose profession is to sharpen items, and specializes in pencils? • “high school” - a school with a tall building? • “abortion rights”?

New Collocations • Some short lived, others may have a longer life: • “Iran-contra”, “million man march” • “information highway”, • “world wide web” and • “mobile computing” • Collocations are easier to recognize than general phrases

Phrase recognition • The phrase “text analysis system” may appear as: • system analyzes the text • text is analyzed by the system • system carries out text analysis • text is subject to analysis by the system • text is subjected to system analysis

Phrase recognition • The phrase may also use synonyms such as • “document” and • “information item” instead of • “text”, etc.

Phrase recognition • Inverted indexes with position information provide - limited phrase recognition • To be able to recognize all occurrences need natural language understanding techniques

Query/document similarity • A weight must be assigned to concepts (in the query and in the documents) • The similarity function should be based on concept similarity

Similarity in Smart • Smart uses tfxidf weight also for phrases • Phrases are added to the dictionary • Similarity = “word similarity” + “phrase similarity”/ 2 • Normalization is based only on words

Phrase generation - for phrase dictionary • Methods: • Co-occurrence information • Syntactic analysis based on dictionary • Parts of speech tagging (POS tagging) + syntactical phrases • the phrases are syntactical units

Co-occurrence information • Not useful for small collections • When word sequences co-occur only a few times, it is difficult to differentiate between phrases and chance co-occurrences • More refined methods needed to select good phrases.

Co-occurrence information • Works well with very large data bases (TREC) • When sequences of words co-occur near each other many times, co-occurrence is usually meaningful.

Smart’s phrase generation • Takes every pair of consecutive index terms, in each document, and counts how many times they co-occurred in the whole collection. • Word order is ignored • “inform(ation) retriev(al)” and “retriev(al of) inform(ation)” are considered equal • Are “blind Venetian” and “Venetian blind” equal?

Smart’s phrase generation • For the TREC collection a co-occurrence count of 25+ indicated a phrase • Process takes 5.8 hours, for 800,000 full text items • 4,700,000 phrases are generated, 158,000 phrases appearing 25+ times remain

Smart’s phrase generation • For pairs that qualify compute the cohesion value • Size-factor is related to thesaurus size • Retain if cohesion above threshold

Collocation identification • Co-occurrence information used in NLP to find associations between terms • Needed in both natural language understanding and language generation: • “strong tea”, and “powerful car” • not “powerful tea” and “strong car”

Mutual information statistic • I(x;y)= -log2 (P(x))-(-log2 (P(x|y)) = log2 (P(x|y)/P(x)) = log2(P(x,y)/P(x)P(y)) • P(x), P(y) and P(x,y) are the probabilities of x, y, and “x and y”

Mutual information statistic • If x is associated with y, P(x,y) will be greater than chance P(x)P(y) and therefore I(x;y)>0 • If pancake is associated with syrup

Mutual information statistic • Extreme case 1: pancake is always followed by syrup

Mutual information statistic • Extreme case 2: No association between x and y, P(x|y)=P(x) and I(x;y)~log1=0

Mutual information statistic • There is a negative association between x and y, when P(x|y)<P(x) and I(x;y)<0

Mutual information statistic

Mutual information estimation • Estimate using occurrence information in corpus: • N number words in corpus • f(x) number occurrences of x • f(y) number occurrences of y • f(x,y) number of occurrences of x followed by y

Mutual information estimation • P(x)=f(x)/N • P(y)=f(y)/N • P(x,y)=f(x,y)/N • With this estimate P(x,y)=0 when f(x,y)=0 • I(x ; y)=log2(Nf(x,y)/f(x)f(y))

Mutual information estimation • Note: Even when “x followed by y” does not appear in the N words of the corpus, there may be some probability of “x followed by y” occurring in the language • (In experiments I>16 were noun-phrases)

Syntactic analysis • Eliminate sequences such as “adverb-adjective” or, • “adverb-noun” • generated by using co-occurrence information

Syntactic analysis • Used by Cornell group, before part-of-speech taggers became available. • A parser was used to generate subject, object and verb phrases • The index phrases were chosen from within these syntactic phrases

Syntactic analysis • Problem: a sentence can have many possible derivations (sentence of 10 words can have 100 derivations) • The syntactic parser cannot resolve all ambiguities • Phrases formed using parser did not provide better retrieval than those using co-occurrence

POS tagging is a hard problem • Words may have more than one possible part of speech • The word cook, can be a noun or a verb • The word dish has a sense as a noun, and a sense as a verb • The word still has senses as a noun, verb, adjective and adverb

Part-of-speech-tagging • Human readers are able to assign part-of-speech tags to words in a sentence • Human readers use the context provided by the sentence to determine a word’s tag

Part-of-speech-tagging • POS tagging algorithms based on a similar idea. • A word and its immediate neighbors are considered • Algorithm computes the most probable part-of-speech tag based on statistical probabilities derived from a tagged subcollection

A simple tagging solution • Estimate the probability that word W belongs to POS category C, by computing the maximum of P(C|W) over all possible categories for W • W=spring; Number words=1,273,000 words, 1,000 occurrences of spring, 600 as a noun N, 400 as a verb V • Conclude new occurrence of spring is a noun

A simple tagging solution • Using maximum likelihood estimator get 90% accuracy • Very high probability that a sentence with 10 words will have at least one POS tag error

A POS tagger that uses local context • Given sentence w1,…,wT, find C1,…, CT that maximizes P(C1,…, CT | w1,…,wT) • Using Bayes rule:

Assumptions • We can discard denominator (not affect maximum) and maximize • Approximate P(C1,…, CT ) using n-gram model • bigram • trigram

Example • Assuming bigram: • Assume that there are 558 occurrences of V N, and there are 1000 occurrences of N, P(N|V)=558/1000

Assumptions • Approximate P(w1,…,wT| C1,…, CT)  • Example (# times the is an ART)/(#times an ART occurs)=300/600=1/2

Formula for Bigram approximation • Compute the maximum of

Viterbi Algorithm for bigrams • Given a word sequence w1,…,wT, a list of lexical categories L1,…,LN • The lexical probabilities P(wt | Li) • The bigram probabilities P(Lk | Lj) • Determine the mose likely sequence of categories C1,…,CT for the word sequence

Viterbi • The algorithm computes the maximum probability SCORE(i, t) for i=1,…,N and t=1,…, T: SCORE(i,t)=max{P(C1,C2,…, Ct-1=Lk, Ct=Li)* P(w1,…,wt| C1,C2,…, Ct-1=Lk, Ct=Li)|k=1,…,N}= max{P(C1,C2,…, Ct-1=Lk)* P(w1,…,wt-1| C1,C2,…, Ct-1=Lk)P(Li|Lk)P(wt|Li) |k=1,…,N}=

Viterbi max{SCORE(k,t-1)P(Li|Lk)P(wt|Li)|k=1,…,N} and saves in BPTR the k that gave the maximum To derive the categories: C(T)=k that maximized SCORE(k,T) for i= T-1 to 1 do C(i)=BPTR(C(i +1), i+1)

Viterbi for i = 1 to N do //initialize SCORE(i,1)=P(Li|)*P(w1|Li)//column 1 BPTR(i, 1)=0 for t = 2 to T do//columns 2 to T for i =1 to N do SCORE(i, t)=... BPTR(i, t) index of k that gave max

Croft’s - Phrasefinder • Phrasefinder deals successfully with some of the drawback of corpus-based thesauri • Uses “paragraphs” instead of whole document for computing co-occurrences of terms

Croft’s - Phrasefinder • Creates good multi-word phrases • Includes phrases in thesaurus • Multi-word phrases are much more specific than single words. • Tend to have a single meaning

CS533 Information Retrieval