1 / 66

Vector and Probabilistic Ranking

Vector and Probabilistic Ranking. Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval. Review. Inverted files The Vector Space Model Term weighting. Inverted indexes.

tadhg
Download Presentation

Vector and Probabilistic Ranking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Information Organization and Retrieval

  2. Review • Inverted files • The Vector Space Model • Term weighting Information Organization and Retrieval

  3. Inverted indexes • Permit fast search for individual terms • For each term, you get a list consisting of: • document ID • frequency of term in doc (optional) • position of term in doc (optional) • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2 • Also used for statistical ranking algorithms Information Organization and Retrieval

  4. How Inverted Files are Used Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings Information Organization and Retrieval

  5. Vector Space Model • Documents are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents Information Organization and Retrieval

  6. Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning. Information Organization and Retrieval

  7. Vector Space Documentsand Queries t1 t3 D2 D9 D1 D4 D11 D5 D3 D6 D10 D8 t2 D7 Boolean term combinations Q is a query – also represented as a vector Information Organization and Retrieval

  8. Documents in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2 Information Organization and Retrieval

  9. Assigning Weights to Terms • Binary Weights • Raw term frequency • tf x idf • Recall the Zipf distribution • Want to weight terms highly if they are • frequent in relevant documents … BUT • infrequent in the collection as a whole • Automatically derived thesaurus terms Information Organization and Retrieval

  10. Binary Weights • Only the presence (1) or absence (0) of a term is included in the vector Information Organization and Retrieval

  11. Raw Term Weights • The frequency of occurrence for the term in each document is included in the vector Information Organization and Retrieval

  12. Assigning Weights • tf x idf measure: • term frequency (tf) • inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution • Goal: assign a tf * idf weight to each term in each document Information Organization and Retrieval

  13. tf x idf Information Organization and Retrieval

  14. Inverse Document Frequency • IDF provides high values for rare words and low values for common words For a collection of 10000 documents Information Organization and Retrieval

  15. Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient Information Organization and Retrieval

  16. Vector Space Visualization Information Organization and Retrieval

  17. Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2 Information Organization and Retrieval

  18. K-Means Clustering • 1 Create a pair-wise similarity measure • 2 Find K centers using agglomerative clustering • take a small sample • group bottom up until K groups found • 3 Assign each document to nearest center, forming new clusters • 4 Repeat 3 as necessary Information Organization and Retrieval

  19. Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 • Cluster sets of documents into general “themes”, like a table of contents • Display the contents of the clusters by showing topical terms andtypical titles • User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes” Information Organization and Retrieval

  20. S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated Information Organization and Retrieval

  21. Another use of clustering • Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. • “Project” these onto a 2D graphical representation: Information Organization and Retrieval

  22. Clustering Multi-Dimensional Document Space(image from Wise et al 95) Information Organization and Retrieval

  23. Clustering Multi-Dimensional Document Space(image from Wise et al 95) Information Organization and Retrieval

  24. Concept “Landscapes” Disease Pharmocology Anatomy Legal Hospitals • (e.g., Lin, Chen, Wise et al.) • Too many concepts, or too coarse • Single concept per document • No titles • Browsing without search Information Organization and Retrieval

  25. Clustering • Advantages: • See some main themes • Disadvantage: • Many ways documents could group together are hidden • Thinking point: what is the relationship to classification systems and facets? Information Organization and Retrieval

  26. Today • Vector Space Ranking • Probabilistic Models and Ranking • (lots of math) Information Organization and Retrieval

  27. tf x idf normalization • Normalize the term weights (so longer documents are not unfairly given more weight) • normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive. Information Organization and Retrieval

  28. Vector space similarity(use the weights to compare the documents) Information Organization and Retrieval

  29. Vector Space Similarity Measurecombine tf x idf into a similarity measure Information Organization and Retrieval

  30. To Think About • How does this ranking algorithm behave? • Make a set of hypothetical documents consisting of terms and their weights • Create some hypothetical queries • How are the documents ranked, depending on the weights of their terms and the queries’ terms? Information Organization and Retrieval

  31. Computing Similarity Scores 1.0 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1.0 Information Organization and Retrieval

  32. Computing a similarity score Information Organization and Retrieval

  33. Vector Space with Term Weights and Cosine Matching Di=(di1,wdi1;di2, wdi2;…;dit, wdit) Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit) Term B 1.0 Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) Q D2 0.8 0.6 0.4 D1 0.2 0 0.2 0.4 0.6 0.8 1.0 Term A Information Organization and Retrieval

  34. Weighting schemes • We have seen something of • Binary • Raw term weights • TF*IDF • There are many other possibilities • IDF alone • Normalized term frequency Information Organization and Retrieval

  35. Term Weights in SMART • SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell. • Designed for laboratory experiments in IR • Easy to mix and match different weighting methods • Really terrible user interface • Intended for use by code hackers (and even they have trouble using it) Information Organization and Retrieval

  36. Term Weights in SMART • In SMART weights are decomposed into three factors: Information Organization and Retrieval

  37. SMART Freq Components Binary maxnorm augmented log Information Organization and Retrieval

  38. Collection Weighting in SMART Inverse squared probabilistic frequency Information Organization and Retrieval

  39. Term Normalization in SMART sum cosine fourth max Information Organization and Retrieval

  40. Problems with Vector Space • There is no real theoretical basis for the assumption of a term space • it is more for visualization that having any real basis • most similarity measures work about the same regardless of model • Terms are not really orthogonal dimensions • Terms are not independent of all other terms Information Organization and Retrieval

  41. Probabilistic Models • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Rely on accurate estimates of probabilities Information Organization and Retrieval

  42. Probability Ranking Principle • If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data. Stephen E. Robertson, J. Documentation 1977 Information Organization and Retrieval

  43. Model 1 – Maron and Kuhns • Concerned with estimating probabilities of relevance at the point of indexing: • If a patron came with a request using term ti, what is the probability that she/he would be satisfied with document Dj? Information Organization and Retrieval

  44. Bayes Formula • Bayesian statistical inference used in both models… Information Organization and Retrieval

  45. Model 1 Bayes • A is the class of events of using the library • Di is the class of events of Document i being judged relevant • Ij is the class of queries consisting of the single term Ij • P(Di|A,Ij) = probability that if a query is submitted to the system then a relevant document is retrieved Information Organization and Retrieval

  46. Model 2 – Robertson & Sparck Jones Given a term t and a query q Document Relevance + - + r n-r n - R-r N-n-R+r N-n R N-R N Document indexing Information Organization and Retrieval

  47. Robertson-Spark Jones Weights • Retrospective formulation -- Information Organization and Retrieval

More Related