1 / 22

Adding Semantics to Information Retrieval

Adding Semantics to Information Retrieval. By Kedar Bellare. 20 th April 2003. Motivation. Current IR techniques term-based Semantics of document and query not considered Problems like polysemy and synonymy Lot of advances in NLP and Statistical Modeling of Semantics

york
Download Presentation

Adding Semantics to Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adding Semantics to Information Retrieval By KedarBellare 20th April 2003

  2. Motivation • Current IR techniques term-based • Semantics of document and query not considered • Problems like polysemy and synonymy • Lot of advances in NLP and Statistical Modeling of Semantics • Is Semantic IR really required?

  3. Organization • Traditional IR • Statistics for Semantics – Latent Semantic Indexing • Semantic Resources for Semantics – Use ofSemantic Nets, Conceptual Graphs, WordNet etc. in IR. • Conclusion

  4. Information Retrieval An information retrieval system does not inform the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.

  5. A Typical IR System

  6. Current IR • Preprocessing of Documents • Inverted Index • Removing stopwords and Stemming • Representation of Documents • Vector Space Model – TF and IDF • Document Clustering • Improvements to the above • Better weighting of Document Vectors • Link analysis – PageRank and Anchor Text

  7. Latent Semantic Indexing • Problems with Traditional Approaches • Synonymy – Automobile and Car • Polysemy – Jaguar means both a Car and Animal • LSI – Linear Algebra for capturing “Latent Semantics” of documents • Method of dimensionality reduction

  8. LSI • Compares document vectors in Latent Semantic Space • Two documents can have high similarity value even if no terms shared • Attempts to remove minor differences in terminology during indexing • Truncated SVD – used for construction of Latent Semantic Space

  9. Singular Value Decomposition • Given a term-document matrix At x d converts it into product of three matrices Tt x r, Sr x r and Dd x r such that A = T S DT • T and D are orthogonal, S is diagonal and r is rank of A • Reduced space corresponds to axes of greatest variation

  10. What LSI does? • Uses truncated SVD • Instead of r – dimensional space uses a factor k Āt x d = Tt x k Sk x k DTd x k • Truncated SVD – captures underlying structure in association of terms and documents

  11. Using the SVD model • Comparison of terms – entries of the matrix T S2T T • Comparison of documents – entries of the matrix D S2DT • Comparison of term and document – entries of the matrix TSDT • Query in SVD model – q’ = qT T S-1

  12. Example of LSI

  13. Why LSI works? • Although lot of empirical evidence no concrete proof of why LSI works • No major degradation – Theorem of Eckart and Young • States that the distance of two matrices is minimum • Still does not explain improvements in recall and precision

  14. Why LSI works? (contd.) • Papadimitriou et. al. • Assumes documents generated from set of topics with disjoint vocabularies • If term-document matrix A is perturbed, they prove that LSI recovers topic information and removes the noise • Kontostathis et. al. • Essentially claims that LSI’s ability to trace term co-occurrences is what helps in improved recall

  15. Advantages & Disadvantages • Advantages • Synonymy • Term Dependence • Disadvantages • Storage • Efficiency

  16. Semantic Resources • Semantic Nets - E.g. John gave Mary the book • Applied in UNL – Eg. Only a few farmers could use information technology in early 1990s

  17. Semantic Resources (contd.) • Conceptual Graphs – E.g.A bird is singing in a Sycamore tree • Conceptual Dependency – E.g. I gave the man a book • Lexical Resources – WordNet

  18. Applications of Semantic Resources in IR • UNL • Used in improving document vectors • Conceptual Graphs • Graph matching of query and document • CDs • FERRET – Comparison of CD patterns • WordNet • Query Expansion using WordNet

  19. Conclusion • Various things need to be considered before applying to Web • Storage • Efficiency • Knowledge Content of Query • Clearly, semantic method needed for eliminating synonymy and polysemy • Currently, traditional models with minor hacks serve the purpose • However, in conclusion : Statistical or Conceptual or combination of both to model Document Semantics is definitely required

  20. References [1] M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), pages 573–595, 1995. [2] S. Chakrabarti. Mining the Web - Discovering Knowledge from Hypertext Data. Morgan Kau.mann Publishers, San Francisco, 2002. [3] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by Latent Semantic Analysis. Journal of the Society for Information Science 41 (6), pages 391–407, 1990. [4] A. Kontostathis and W. M. Pottenger. A mathematical view of Latent Semantic Indexing: Tracing Term Co-occurences. Technical report, Lehigh University, 2002. [5] R. Mandala, T. Takenobu, and T. Hozumi. The use of WordNet in Information Retrieval. In COLING/ACL Workshop on the Usage of WordNet in Natural Language Processing Systems, pages 31–37, 1998.

  21. References (contd.) [6] M. L. Mauldin. Retrieval performance in FERRET: a conceptual information retrieval system. In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 347–355. ACM Press, 1991. [7] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J.Miller. Introduction to WordNet: an on-line lexical database. International Journal of Lexicography 3 (4), pages 235 – 244, 1990. [8] M. Montes-y-Gomez, A. Lopez, and A. F. Gelbukh. Information retrieval with Conceptual Graph matching. In Database and Expert Systems Applications, pages 312–321, 2000. [9] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent Semantic Indexing: A probabilistic analysis. pages 159–168, 1998. [10] E. Rich and K. Knight. Artificial Intelligence. Tata McGraw-Hill Publishers, New Delhi, 2002.

  22. References (contd.) [11] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. [12] C. Shah, B. Chowdhary, and P. Bhattacharyya. Constructing better Document Vectors using Universal Networking Language (UNL). In Proceedings of International Conference on Knowledge-Based Computer Systems (KBCS) 2002. NCST, Navi Mumbai, India, 1995. [13] H. Uchida, M. Zhu, and S. T. Della. UNL : A gift for a millenium. Technical report, The United Nations University, 2000. [14] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.

More Related