640 likes | 661 Views
Learn about concept-based retrieval and indexing in information systems, including collocational expressions, phrase recognition, and query/document similarity. Explore methods for phrase generation and collocation identification. Understand the use of mutual information statistics in identifying associations between terms.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #22 April 21, 1999
This lecture • Concept based retrieval • Phrase-based thesaurus • Phrasefinder
Concept based retrieval • Concept indexing • Concept dictionary • Concept based ranking functions
Concept indexing • The index should contain concepts • Concepts are usually phrases • To index: • phrase dictionary and • phrase recognition procedures
Concept indexing • Question: should the occurrence of individual phrase words be indexed as well as the phrase? • In Smart, the occurrence of the phrase “neural networks” in a document is added to the inverted lists for “neural network”, “neural” and “networks”
Dictionaries • Free on-line dictionary of computing (FOLDOC) http://wombat.doc.ic.ac.uk/ • Computing concepts • Webster http://c.gp.cs.cmu.edu:5103/prog/webster • word based
Generating a phrase dictionary • Dictionaries are usually word based • Phrases are often domain specific • New concepts/collocational expressions are constantly being coined
Collocational expressions sequences of words whose exact and unambiguous meaning cannot be derived directly from the meaning of their components
Collocational expressions • “foreign minister” - does it mean a clergy man who is foreign? • “pencil sharpener” - a person whose profession is to sharpen items, and specializes in pencils? • “high school” - a school with a tall building? • “abortion rights”?
New Collocations • Some short lived, others may have a longer life: • “Iran-contra”, “million man march” • “information highway”, • “world wide web” and • “mobile computing” • Collocations are easier to recognize than general phrases
Phrase recognition • The phrase “text analysis system” may appear as: • system analyzes the text • text is analyzed by the system • system carries out text analysis • text is subject to analysis by the system • text is subjected to system analysis
Phrase recognition • The phrase may also use synonyms such as • “document” and • “information item” instead of • “text”, etc.
Phrase recognition • Inverted indexes with position information provide - limited phrase recognition • To be able to recognize all occurrences need natural language understanding techniques
Query/document similarity • A weight must be assigned to concepts (in the query and in the documents) • The similarity function should be based on concept similarity
Similarity in Smart • Smart uses tfxidf weight also for phrases • Phrases are added to the dictionary • Similarity = “word similarity” + “phrase similarity”/ 2 • Normalization is based only on words
Phrase generation - for phrase dictionary • Methods: • Co-occurrence information • Syntactic analysis based on dictionary • Parts of speech tagging (POS tagging) + syntactical phrases • the phrases are syntactical units
Co-occurrence information • Not useful for small collections • When word sequences co-occur only a few times, it is difficult to differentiate between phrases and chance co-occurrences • More refined methods needed to select good phrases.
Co-occurrence information • Works well with very large data bases (TREC) • When sequences of words co-occur near each other many times, co-occurrence is usually meaningful.
Smart’s phrase generation • Takes every pair of consecutive index terms, in each document, and counts how many times they co-occurred in the whole collection. • Word order is ignored • “inform(ation) retriev(al)” and “retriev(al of) inform(ation)” are considered equal • Are “blind Venetian” and “Venetian blind” equal?
Smart’s phrase generation • For the TREC collection a co-occurrence count of 25+ indicated a phrase • Process takes 5.8 hours, for 800,000 full text items • 4,700,000 phrases are generated, 158,000 phrases appearing 25+ times remain
Smart’s phrase generation • For pairs that qualify compute the cohesion value • Size-factor is related to thesaurus size • Retain if cohesion above threshold
Collocation identification • Co-occurrence information used in NLP to find associations between terms • Needed in both natural language understanding and language generation: • “strong tea”, and “powerful car” • not “powerful tea” and “strong car”
Mutual information statistic • I(x;y)= -log2 (P(x))-(-log2 (P(x|y)) = log2 (P(x|y)/P(x)) = log2(P(x,y)/P(x)P(y)) • P(x), P(y) and P(x,y) are the probabilities of x, y, and “x and y”
Mutual information statistic • If x is associated with y, P(x,y) will be greater than chance P(x)P(y) and therefore I(x;y)>0 • If pancake is associated with syrup
Mutual information statistic • Extreme case 1: pancake is always followed by syrup
Mutual information statistic • Extreme case 2: No association between x and y, P(x|y)=P(x) and I(x;y)~log1=0
Mutual information statistic • There is a negative association between x and y, when P(x|y)<P(x) and I(x;y)<0
Mutual information estimation • Estimate using occurrence information in corpus: • N number words in corpus • f(x) number occurrences of x • f(y) number occurrences of y • f(x,y) number of occurrences of x followed by y
Mutual information estimation • P(x)=f(x)/N • P(y)=f(y)/N • P(x,y)=f(x,y)/N • With this estimate P(x,y)=0 when f(x,y)=0 • I(x ; y)=log2(Nf(x,y)/f(x)f(y))
Mutual information estimation • Note: Even when “x followed by y” does not appear in the N words of the corpus, there may be some probability of “x followed by y” occurring in the language • (In experiments I>16 were noun-phrases)
Syntactic analysis • Eliminate sequences such as “adverb-adjective” or, • “adverb-noun” • generated by using co-occurrence information
Syntactic analysis • Used by Cornell group, before part-of-speech taggers became available. • A parser was used to generate subject, object and verb phrases • The index phrases were chosen from within these syntactic phrases
Syntactic analysis • Problem: a sentence can have many possible derivations (sentence of 10 words can have 100 derivations) • The syntactic parser cannot resolve all ambiguities • Phrases formed using parser did not provide better retrieval than those using co-occurrence
POS tagging is a hard problem • Words may have more than one possible part of speech • The word cook, can be a noun or a verb • The word dish has a sense as a noun, and a sense as a verb • The word still has senses as a noun, verb, adjective and adverb
Part-of-speech-tagging • Human readers are able to assign part-of-speech tags to words in a sentence • Human readers use the context provided by the sentence to determine a word’s tag
Part-of-speech-tagging • POS tagging algorithms based on a similar idea. • A word and its immediate neighbors are considered • Algorithm computes the most probable part-of-speech tag based on statistical probabilities derived from a tagged subcollection
A simple tagging solution • Estimate the probability that word W belongs to POS category C, by computing the maximum of P(C|W) over all possible categories for W • W=spring; Number words=1,273,000 words, 1,000 occurrences of spring, 600 as a noun N, 400 as a verb V • Conclude new occurrence of spring is a noun
A simple tagging solution • Using maximum likelihood estimator get 90% accuracy • Very high probability that a sentence with 10 words will have at least one POS tag error
A POS tagger that uses local context • Given sentence w1,…,wT, find C1,…, CT that maximizes P(C1,…, CT | w1,…,wT) • Using Bayes rule:
Assumptions • We can discard denominator (not affect maximum) and maximize • Approximate P(C1,…, CT ) using n-gram model • bigram • trigram
Example • Assuming bigram: • Assume that there are 558 occurrences of V N, and there are 1000 occurrences of N, P(N|V)=558/1000
Assumptions • Approximate P(w1,…,wT| C1,…, CT) • Example (# times the is an ART)/(#times an ART occurs)=300/600=1/2
Formula for Bigram approximation • Compute the maximum of
Viterbi Algorithm for bigrams • Given a word sequence w1,…,wT, a list of lexical categories L1,…,LN • The lexical probabilities P(wt | Li) • The bigram probabilities P(Lk | Lj) • Determine the mose likely sequence of categories C1,…,CT for the word sequence
Viterbi • The algorithm computes the maximum probability SCORE(i, t) for i=1,…,N and t=1,…, T: SCORE(i,t)=max{P(C1,C2,…, Ct-1=Lk, Ct=Li)* P(w1,…,wt| C1,C2,…, Ct-1=Lk, Ct=Li)|k=1,…,N}= max{P(C1,C2,…, Ct-1=Lk)* P(w1,…,wt-1| C1,C2,…, Ct-1=Lk)P(Li|Lk)P(wt|Li) |k=1,…,N}=
Viterbi max{SCORE(k,t-1)P(Li|Lk)P(wt|Li)|k=1,…,N} and saves in BPTR the k that gave the maximum To derive the categories: C(T)=k that maximized SCORE(k,T) for i= T-1 to 1 do C(i)=BPTR(C(i +1), i+1)
Viterbi for i = 1 to N do //initialize SCORE(i,1)=P(Li|)*P(w1|Li)//column 1 BPTR(i, 1)=0 for t = 2 to T do//columns 2 to T for i =1 to N do SCORE(i, t)=... BPTR(i, t) index of k that gave max
Croft’s - Phrasefinder • Phrasefinder deals successfully with some of the drawback of corpus-based thesauri • Uses “paragraphs” instead of whole document for computing co-occurrences of terms
Croft’s - Phrasefinder • Creates good multi-word phrases • Includes phrases in thesaurus • Multi-word phrases are much more specific than single words. • Tend to have a single meaning