1 / 64

CS533 Information Retrieval

CS533 Information Retrieval. Dr. Michal Cutler Lecture #22 April 21, 1999. This lecture. Concept based retrieval Phrase-based thesaurus Phrasefinder. Concept based retrieval. Concept indexing Concept dictionary Concept based ranking functions. Concept indexing.

lung
Download Presentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #22 April 21, 1999

  2. This lecture • Concept based retrieval • Phrase-based thesaurus • Phrasefinder

  3. Concept based retrieval • Concept indexing • Concept dictionary • Concept based ranking functions

  4. Concept indexing • The index should contain concepts • Concepts are usually phrases • To index: • phrase dictionary and • phrase recognition procedures

  5. Concept indexing • Question: should the occurrence of individual phrase words be indexed as well as the phrase? • In Smart, the occurrence of the phrase “neural networks” in a document is added to the inverted lists for “neural network”, “neural” and “networks”

  6. Dictionaries • Free on-line dictionary of computing (FOLDOC) http://wombat.doc.ic.ac.uk/ • Computing concepts • Webster http://c.gp.cs.cmu.edu:5103/prog/webster • word based

  7. Generating a phrase dictionary • Dictionaries are usually word based • Phrases are often domain specific • New concepts/collocational expressions are constantly being coined

  8. Collocational expressions sequences of words whose exact and unambiguous meaning cannot be derived directly from the meaning of their components

  9. Collocational expressions • “foreign minister” - does it mean a clergy man who is foreign? • “pencil sharpener” - a person whose profession is to sharpen items, and specializes in pencils? • “high school” - a school with a tall building? • “abortion rights”?

  10. New Collocations • Some short lived, others may have a longer life: • “Iran-contra”, “million man march” • “information highway”, • “world wide web” and • “mobile computing” • Collocations are easier to recognize than general phrases

  11. Phrase recognition • The phrase “text analysis system” may appear as: • system analyzes the text • text is analyzed by the system • system carries out text analysis • text is subject to analysis by the system • text is subjected to system analysis

  12. Phrase recognition • The phrase may also use synonyms such as • “document” and • “information item” instead of • “text”, etc.

  13. Phrase recognition • Inverted indexes with position information provide - limited phrase recognition • To be able to recognize all occurrences need natural language understanding techniques

  14. Query/document similarity • A weight must be assigned to concepts (in the query and in the documents) • The similarity function should be based on concept similarity

  15. Similarity in Smart • Smart uses tfxidf weight also for phrases • Phrases are added to the dictionary • Similarity = “word similarity” + “phrase similarity”/ 2 • Normalization is based only on words

  16. Phrase generation - for phrase dictionary • Methods: • Co-occurrence information • Syntactic analysis based on dictionary • Parts of speech tagging (POS tagging) + syntactical phrases • the phrases are syntactical units

  17. Co-occurrence information • Not useful for small collections • When word sequences co-occur only a few times, it is difficult to differentiate between phrases and chance co-occurrences • More refined methods needed to select good phrases.

  18. Co-occurrence information • Works well with very large data bases (TREC) • When sequences of words co-occur near each other many times, co-occurrence is usually meaningful.

  19. Smart’s phrase generation • Takes every pair of consecutive index terms, in each document, and counts how many times they co-occurred in the whole collection. • Word order is ignored • “inform(ation) retriev(al)” and “retriev(al of) inform(ation)” are considered equal • Are “blind Venetian” and “Venetian blind” equal?

  20. Smart’s phrase generation • For the TREC collection a co-occurrence count of 25+ indicated a phrase • Process takes 5.8 hours, for 800,000 full text items • 4,700,000 phrases are generated, 158,000 phrases appearing 25+ times remain

  21. Smart’s phrase generation • For pairs that qualify compute the cohesion value • Size-factor is related to thesaurus size • Retain if cohesion above threshold

  22. Collocation identification • Co-occurrence information used in NLP to find associations between terms • Needed in both natural language understanding and language generation: • “strong tea”, and “powerful car” • not “powerful tea” and “strong car”

  23. Mutual information statistic • I(x;y)= -log2 (P(x))-(-log2 (P(x|y)) = log2 (P(x|y)/P(x)) = log2(P(x,y)/P(x)P(y)) • P(x), P(y) and P(x,y) are the probabilities of x, y, and “x and y”

  24. Mutual information statistic • If x is associated with y, P(x,y) will be greater than chance P(x)P(y) and therefore I(x;y)>0 • If pancake is associated with syrup

  25. Mutual information statistic • Extreme case 1: pancake is always followed by syrup

  26. Mutual information statistic • Extreme case 2: No association between x and y, P(x|y)=P(x) and I(x;y)~log1=0

  27. Mutual information statistic • There is a negative association between x and y, when P(x|y)<P(x) and I(x;y)<0

  28. Mutual information statistic

  29. Mutual information estimation • Estimate using occurrence information in corpus: • N number words in corpus • f(x) number occurrences of x • f(y) number occurrences of y • f(x,y) number of occurrences of x followed by y

  30. Mutual information estimation • P(x)=f(x)/N • P(y)=f(y)/N • P(x,y)=f(x,y)/N • With this estimate P(x,y)=0 when f(x,y)=0 • I(x ; y)=log2(Nf(x,y)/f(x)f(y))

  31. Mutual information estimation • Note: Even when “x followed by y” does not appear in the N words of the corpus, there may be some probability of “x followed by y” occurring in the language • (In experiments I>16 were noun-phrases)

  32. Syntactic analysis • Eliminate sequences such as “adverb-adjective” or, • “adverb-noun” • generated by using co-occurrence information

  33. Syntactic analysis • Used by Cornell group, before part-of-speech taggers became available. • A parser was used to generate subject, object and verb phrases • The index phrases were chosen from within these syntactic phrases

  34. Syntactic analysis • Problem: a sentence can have many possible derivations (sentence of 10 words can have 100 derivations) • The syntactic parser cannot resolve all ambiguities • Phrases formed using parser did not provide better retrieval than those using co-occurrence

  35. POS tagging is a hard problem • Words may have more than one possible part of speech • The word cook, can be a noun or a verb • The word dish has a sense as a noun, and a sense as a verb • The word still has senses as a noun, verb, adjective and adverb

  36. Part-of-speech-tagging • Human readers are able to assign part-of-speech tags to words in a sentence • Human readers use the context provided by the sentence to determine a word’s tag

  37. Part-of-speech-tagging • POS tagging algorithms based on a similar idea. • A word and its immediate neighbors are considered • Algorithm computes the most probable part-of-speech tag based on statistical probabilities derived from a tagged subcollection

  38. A simple tagging solution • Estimate the probability that word W belongs to POS category C, by computing the maximum of P(C|W) over all possible categories for W • W=spring; Number words=1,273,000 words, 1,000 occurrences of spring, 600 as a noun N, 400 as a verb V • Conclude new occurrence of spring is a noun

  39. A simple tagging solution • Using maximum likelihood estimator get 90% accuracy • Very high probability that a sentence with 10 words will have at least one POS tag error

  40. A POS tagger that uses local context • Given sentence w1,…,wT, find C1,…, CT that maximizes P(C1,…, CT | w1,…,wT) • Using Bayes rule:

  41. Assumptions • We can discard denominator (not affect maximum) and maximize • Approximate P(C1,…, CT ) using n-gram model • bigram • trigram

  42. Example • Assuming bigram: • Assume that there are 558 occurrences of V N, and there are 1000 occurrences of N, P(N|V)=558/1000

  43. Assumptions • Approximate P(w1,…,wT| C1,…, CT)  • Example (# times the is an ART)/(#times an ART occurs)=300/600=1/2

  44. Formula for Bigram approximation • Compute the maximum of

  45. Viterbi Algorithm for bigrams • Given a word sequence w1,…,wT, a list of lexical categories L1,…,LN • The lexical probabilities P(wt | Li) • The bigram probabilities P(Lk | Lj) • Determine the mose likely sequence of categories C1,…,CT for the word sequence

  46. Viterbi • The algorithm computes the maximum probability SCORE(i, t) for i=1,…,N and t=1,…, T: SCORE(i,t)=max{P(C1,C2,…, Ct-1=Lk, Ct=Li)* P(w1,…,wt| C1,C2,…, Ct-1=Lk, Ct=Li)|k=1,…,N}= max{P(C1,C2,…, Ct-1=Lk)* P(w1,…,wt-1| C1,C2,…, Ct-1=Lk)P(Li|Lk)P(wt|Li) |k=1,…,N}=

  47. Viterbi max{SCORE(k,t-1)P(Li|Lk)P(wt|Li)|k=1,…,N} and saves in BPTR the k that gave the maximum To derive the categories: C(T)=k that maximized SCORE(k,T) for i= T-1 to 1 do C(i)=BPTR(C(i +1), i+1)

  48. Viterbi for i = 1 to N do //initialize SCORE(i,1)=P(Li|)*P(w1|Li)//column 1 BPTR(i, 1)=0 for t = 2 to T do//columns 2 to T for i =1 to N do SCORE(i, t)=... BPTR(i, t) index of k that gave max

  49. Croft’s - Phrasefinder • Phrasefinder deals successfully with some of the drawback of corpus-based thesauri • Uses “paragraphs” instead of whole document for computing co-occurrences of terms

  50. Croft’s - Phrasefinder • Creates good multi-word phrases • Includes phrases in thesaurus • Multi-word phrases are much more specific than single words. • Tend to have a single meaning

More Related