1 / 47

Information Retrieval

Information Retrieval. Index. Query. Parse. Rank. Pre-process. Information Retrieval Process. Information need. Collections. How is the query constructed?. How is the text processed?. text input. Example: Information Needs. Sometimes very specific

landry
Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval

  2. Index Query Parse Rank Pre-process Information Retrieval Process Information need Collections How is the query constructed? How is the text processed? text input

  3. Example: Information Needs • Sometimes very specific • <title> Falkland petroleum exploration • <desc> Description: What information is available on petroleum exploration in the South Atlantic near the Falkland Islands? • <narr> Narrative: Any document discussing petroleum exploration in the South Atlantic near the Falkland Islands is considered relevant. Documents discussing petroleum exploration in continental South America are not relevant. • Sometimes very vague • I am going to Kyoto, Japan for a conference in two months. What should I know?

  4. Relevance • In what ways can a document be relevant to a query? • Answer precise question precisely. • Partially answer question. • Suggest a source for more information. • Give background information. • Remind the user of other knowledge. • Others ...

  5. Relevance • How relevant is the document • for this user for this information need. • Subjective, but • Measurable to some extent • How often do people agree a document is relevant to a query • How well does it answer the question? • Complete answer? Partial? • Background Information? • Hints for further exploration?

  6. Document Representation • Information needs and documents are usually represented as sets/bags of terms. • Bag: allow multiple instances of the same element • Terms: words, phrases • To stem or not to stem • Annotation with location information: title, heading

  7. courtesy of Phillip Resnik Bag of Words Example Stop Word List Indexed Term Document 1 Document 2 Document 1 aid 0 1 for The quick brown fox jumped over the lazy dog’s back. all 0 1 is back 1 0 of ‘s brown 1 0 the come 0 1 to dog 1 0 fox 1 0 Document 2 good 0 1 jump 1 0 lazy 1 0 Now is the time for all good men to come to the aid of their party. men 0 1 now 0 1 over 1 0 party 0 1 quick 1 0 their 0 1 time 0 1

  8. Types of Queries • Boolean Query • Does the document satisfy the Boolean expression? • “java” AND “compilers” AND (“unix” OR “linux”) • Vector Query • How similar is the document to the query? • [(java 3) (compiler 2) (unix 1) (linus 1)] • Probabilistic Query • What is the probability that the document is generated by the query?

  9. Boolean Model of Retrieval • Pros • Easy to understand/clear semantics • AND means ‘all’, OR means ‘any’ • Usually computationally efficient • Cons • Difficult to rank results • Rigid: either get too much or too little • AND means ‘all’, OR means ‘any’ • When the information need is complex, it is hard to formulate it as a Boolean query.

  10. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn Vector Space Model • A collection of n documents with t distinct terms can be represented by a (sparse) matrix. • A query can also be represented as a vector like a document

  11. Star Doc about movie stars Doc about astronomy Doc about mammal behavior Diet Docs as Vectors

  12. T3 5 D1 = 2T1+ 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 2 3 T1 D2 = 3T1 + 7T2 + T3 7 T2 Geometric Interpretation Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 • Is D1 or D2 more similar to Q? • How to measure the degree of similarity? Distance? Angle? Projection? Assumption: Documents that are “close together” in space are similar in meaning.

  13. Term Weights • The weight wij reflects the importance of the term Tiin document Dj. • Intuitions: • A term that appears in many documents is not important: e.g., the, going, come, … • If a term is frequent in a document, it is probably important in that document.

  14. Assigning Weights to Terms • Binary Weights • Raw term frequency • tf x idf • Recall the Zipf distribution • Want to weight terms highly if they are • frequent in relevant documents … BUT • infrequent in the collection as a whole • Pointwise Mutual Information

  15. Binary Weights • Only the presence (1) or absence (0) of a term is included in the vector

  16. Raw Term Weights • The frequency of occurrence for the term in each document is included in the vector

  17. Inverse Document Frequency • IDF provides high values for rare words and low values for common words For a collection of 10000 documents

  18. Term Weights: tf x idf • Term frequency (tf) • the frequency count of a term in a document • Inverse document frequency (idf) • The amount of information contained in the statement “Document X contains the term Ti”. • Assign a tf * idf weight to each term in each document

  19. tf x idf

  20. Term Weights: Pointwise Mutual Information • Pointwise Mutual Information measures the strength of association between two elements (a document and a term). • Observed frequency vs. expected frequency • MI weight is insensitive to stemming and the use of stop word list [Pantel and Lin 02]

  21. What Else can be Terms? • Letter n-grams • Phrases • Relations • Semantic categories

  22. Similarity Measure • Define a similarity measure between a query and a document • Cosine • Dice • Return the documents that are the most similar to the query

  23. Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

  24. Implementation of VS Model • Does the query vector need to be compared with EVERY document vector? • Does the query vector need to be compared with the vectors of documents that contain any of query terms?

  25. What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) • Coverage of Information • Form of Presentation • Effort required/Ease of Use • Time and Space Efficiency • Recall • proportion of relevant material actually retrieved • Precision • proportion of retrieved material actually relevant effectiveness

  26. Relevant vs. Retrieved All docs Retrieved Relevant

  27. Precision vs. Recall All docs Retrieved Relevant

  28. Get as much good stuff while at the same time getting as little junk as possible. Why Precision and Recall?

  29. Retrieved vs. Relevant Documents Very high precision, very low recall Relevant

  30. Retrieved vs. Relevant Documents High recall, but low precision Relevant

  31. Retrieved vs. Relevant Documents High precision, high recall (at last!) Relevant

  32. There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall Note: this is an AVERAGE over MANY queries Precision/Recall Curves precision x x x x recall

  33. Difficult to determine which of these two hypothetical results is better: Precision/Recall Curves x precision x x x recall

  34. Average Precision • IR systems typically output a ranked list of documents • For each relevant document, compute the precision up to that point • Average over all precision values computed this way.

  35. Interpolated Average Precision • Precision may go up when going down the ranked list. • Intuitively, this should only go down. • Interpolated Average Precision • for each recall level in 0%, 10%, 20%, … • compute the highest precision after recall reached that point • take the average of the max precision scores

  36. F-Measure • Sometime only one pair of precision and recall is available. • e.g., filtering task • F-Measure • >1: precision is more important • <1: recall is more important • Usually =1

  37. Text Categorization • Goal: classify documents into predefined categories • Approaches: • Naïve Bayes • Nearest Neighbor • SVM

  38. Naïve Bayes Method • Knowledge Base contains • A set of hypotheses • A set of evidences • Probability of an evidence given a hypothesis • Given • A sub set of the evidences known to be present in a situation • Find • the hypothesis with the highest posterior probability: P(H|E1, E2, …, Ek). • The probability itself does not matter so much.

  39. Naïve Bayes Method • Assumptions • Hypotheses are exhaustive and mutually exclusive • H1 v H2 v … v Hk • ¬ (Hi ^ Hj) for any i≠j • Evidences are conditionally independent given a hypothesis • P(E1, E2,…, Ek|H) = P(E1|H)…P(Ek|H) • P(H | E1, E2,…, Ek) = P(E1, E2,…, Ek, H)/P(E1, E2,…, Ek) = P(E1, E2,…, Ek|H)P(H)/P(E1, E2,…, Ek)

  40. Naïve Bayes Method • The goal is to find H that maximize P(E1, E2,…, Ek|H) • Since P(E1, E2,…, Ek|H) = P(E1, E2,…, Ek|H)P(H)/P(E1, E2,…, Ek) and P(E1, E2,…, Ek) is the same for different hypotheses, • Maximizing P(E1, E2,…, Ek|H) is equivalent to maximizing P(E1, E2,…, Ek|H)P(H)= P(E1|H)…P(Ek|H)P(H) • Naïve Bayes Method • Find a hypothesis that maximizes P(E1|H)…P(Ek|H)P(H)

  41. Example: Play Tennis? Predict playing tennis when <sunny, cool, high, strong> What probability should be used to make the prediction? How to compute the probability?

  42. Probabilities of Individual Attributes • Given the training set, we can compute the probabilities

  43. Example: Play Tennis P(+| sunny, cool, high, strong) vs. P(−| sunny, cool, high, strong) P(sunny|+)P(cool|+)P(high|+)P(strong|+)P(+) vs. P(sunny|−)P(cool|−)P(high|−)P(strong|−)P(−)

  44. Application: Spam Detection • Spam • Dear sir, We want to transfer to overseas ($ 126,000.000.00 USD) One hundred and Twenty six million United States Dollars) from a Bank in Africa, I want to ask you to quietly look for a reliable and honest person who will be capable and fit to provide either an existing …… • Legitimate email • Ham: for lack of better name.

  45. Hypotheses: {Spam, Ham} • Evidence: a document • The document is treated as a set (or bag) of words • Knowledge • P(Spam) • The prior probability of an e-mail message being a spam. • How to estimate this probability? • P(w|Spam) • the probability that a word is w if we know w is chosen from a spam. • How to estimate this probability?

  46. Other Text Categorization Algorithms • Support Vector Machine • often has the best performance. • K-Nearest Neighbor

More Related