1 / 30

Level Search Filtering for IR Model Reduction

Level Search Filtering for IR Model Reduction. Michael W. Berry Xiaoyan (Kathy) Zhang Padma Raghavan Department of Computer Science University of Tennessee. Computational Models for IR. 1 . Need framework for designing concept-based IR models.

rhian
Download Presentation

Level Search Filtering for IR Model Reduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Level Search Filtering for IR Model Reduction Michael W. Berry Xiaoyan (Kathy) Zhang Padma Raghavan Department of Computer Science University of Tennessee

  2. Computational Models for IR 1. Need framework for designing concept-based IR models. 2. Can we draw upon backgrounds and experiences of computer scientists and mathematicians? 3. Effective indexing should address issues of scale and accuracy. IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  3. The Vector Space Model • Represent terms and documents as vectors in k-dimensional space • Similarity computed by measures such as cosine or Euclidean distance • Early prototype - SMART system developed by Salton et al. [70’s, 80’s] IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  4. Motivation for LSI Two fundamental query matching problems: synonymy (image, likeness, portrait, facsimile, icon) polysemy (Adam’s apple, patient’s discharge, culture) IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  5. Motivation for LSI Approach Treat word to document association data as an unreliable estimate of a larger set of applicable words. Goal Cluster similar documents which may share no terms in a low-dimensional subspace (improve recall). IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  6. LSI Approach • Preprocessing Compute low-rank approximation to the original term-by-document (sparse) matrix • Vector Space Model Encode terms and documents using factors derived from SVD (ULV, SDD) • Postprocessing Rank similarity of terms and docs to query via Euclid. distances or cosines IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  7. Doc Vectors SVD Encoding docs Ak Uk Sk VkT Ak is the best rank-k approx. to term-by-document matrix A terms = Term Vectors IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  8. Vector Space Dimension • Want minimum no. of factors (k ) that discriminates most concepts • In practice, k ranges between 100 and 300 but could be much larger. • Choosing optimal k for different collections is challenging. IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  9. Strengths of LSI • Completely automatic no stemming required, allow misspellings • Multilanguage search capability Landauer (Colorado), Littman (Duke) • Conceptual IR capability (Recall) Retrieve relevant documents that do not contain any search terms IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  10. Changing the LSI Model • Updating Folding-in new terms or documents [Deerwester et al. ‘90] SVD-updating [O’Brien ‘94], [Simon & Zha ‘97] • Downdating Modify SVD w.r.t. term or document deletions [Berry & Witter ‘98] IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  11. Recent LSI-based Research • Implementation of kd-trees to reduce query matching complexity (Hughey & Berry ‘00, Info. Retrieval ) • Unsupervised learning model for data mining electronic commerce data (J. Jiang et al. ‘99, IDA) IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  12. Recent LSI-based Research • Nonlinear SVD approach for constraint-based feedback (E. Jiang & Berry ‘00, Lin. Alg. & Applications) • Future incorporation of up- and down-dating into LSI-based client/servers IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  13. Information Filtering Concept: Reduce a large document collection to a reasonably sized set of potential retrievable documents. Goal: Produce a relatively small subset containing a high proportion of relevant documents. IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  14. Approach: Level Search Reduce sparse SVD computation cost by selecting a small subset from the original term-by-document matrix Use undirected graphic model . . . Term or document: vertices Term weight: edge weight Term in document or document containing term: edges in graph IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  15. Level Search IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  16. Similarity measures • Recall: ratio of no. of documents retrieved that are relevant to totalno. of relevant documents. • Precision: ratio of no. of documents retrieved that are relevant to total no. of documents. retrieved IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  17. Test Collections Collection Matrix Size (Docs  Terms  Non-zeros) MEDLINE1033  5831  52009 TIME 425  10804  68240 CISI 1469  5609  83602 FBIS 4974  42500  1573306 IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  18. Avg Recall & Submatrix Sizes for LS Collection Avg R %D %T %N MEDLINE85.7 24.8 63.2 27.8 TIME 69.4 15.3 61.9 22.7 CISI 55.1 21.4 64.1 25.2 FBIS 82.1 28.5 55.0 52.9 Mean67.8 18.2 53.4 27.0 IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  19. Results for MEDLINE 5,831 terms 1,033 docs IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  20. Results for CISI 5,609 terms 1,469 docs IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  21. Results for TIME 10,804 terms 425 docs IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  22. Results for FBIS (TREC-5) 42,500 terms 4,974 docs IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  23. Level Search with Pruning IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  24. Effects of Pruning 17,903 terms 1,086 docs (TREC5) IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  25. Effects of Pruning 230 terms/doc 29 terms/query IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  26. Impact • Level Search is a simple and cost-effective filtering method for LSI; scalable IR. • May reduce the effective term-by-document matrix size by 75% with nosignificant loss of LSI precision (less than 5%). IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  27. Some Future Challenges for LSI • Agent-based software for indexing remote/distributed collections • Effective updating with global weighting • Incorporate phrases and proximity • Expand cosine matching to incorporate other similarity-based data (e.g., images) • Optimal number of dimensions IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  28. LSI Web Site http://www.cs.utk.edu/~lsi Investigators Papers Demo’s Software IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  29. SIAM Book (June’99) Document File Prep. Vector Space Models Matrix Decompositions Query Management Ranking & Relevance Feedback User Interfaces A Course Project Further Reading IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

  30. CIR00 Workshophttp://www.cs.utk.edu/cir00 10-22-00, Raleigh NC Invited Speakers: I. Dhillon (Texas) C. Ding (NERSC) K. Gallivan (FSU) D. Martin (UTK) H. Park (Minnesota) B. Pottenger (Lehigh) P. Raghavan (UTK) J. Wu (Boeing) IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

More Related