1 / 27

Latent Semantic Analysis

Latent Semantic Analysis. Problem Introduction. Traditional term-matching method doesn’t work well in information retrieval We want to capture the concepts instead of words. Concepts are reflected in the words. However, One term may have multiple meaning

latriciam
Download Presentation

Latent Semantic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent Semantic Analysis

  2. Problem Introduction • Traditional term-matching method doesn’t work well in information retrieval • We want to capture the concepts instead of words. Concepts are reflected in the words. However, • One term may have multiple meaning • Different terms may have the same meaning.

  3. The Problem • Two problems that arose using the vector space model: • synonymy: many ways to refer to the same object, e.g. car and automobile • leads to poor recall • polysemy: most words have more than one distinct meaning, e.g.model, python, chip • leads to poor precision

  4. The Problem • Example: Vector Space Model • (from Lillian Lee) auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Synonymy Will have small cosine but are related Polysemy Will have large cosine but not truly related

  5. LSI (Latent Semantic Analysis) • LSI approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. • The goal is to find effective models to represent the relationship between terms and documents. Hence a set of terms, which is by itself incomplete and unreliable, will be replaced by some set of entities which are more reliable indicants. • Terms that did not appear in a document may still associate with a document. • LSI derives uncorrelated index factors that might be considered artificial concepts.

  6. Some History • Latent Semantic Indexing was developed at Bellcore (now Telcordia) in the late 1980s (1988). It was patented in 1989. • http://lsi.argreenhouse.com/lsi/LSI.html

  7. Some History • The first papers about LSI: • Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Using latent semantic analysis to improve information retrieval." In Proceedings of CHI'88: Conference on Human Factors in Computing, New York: ACM, 281-285. • Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R.A. (1990) "Indexing by latent semantic analysis." Journal of the Society for Information Science, 41(6), 391-407. • Foltz, P. W. (1990) "Using Latent Semantic Indexing for Information Filtering". In R. B. Allen (Ed.) Proceedings of the Conference on Office Information Systems, Cambridge, MA, 40-47.

  8. LSA • But first: • What is the difference between LSI and LSA??? • LSI refers to using it for indexing or information retrieval. • LSA refers to everything else.

  9. LSA • Idea (Deerwester et al): “We would like a representation in which a set of terms, which by itself is incomplete and unreliable evidence of the relevance of a given document, is replaced by some other set of entities which are more reliable indicants. We take advantage of the implicit higher-order (or latent) structure in the association of terms and documents to reveal such relationships.”

  10. SVD (Singular Value Decomposition) • How to learn the concepts from data? • SVD is applied on the term-document matrix to derive the latent semantic structure model. • What is SVD?

  11. Singular Value Decomposition documents * * * * * S D 0 0 X T terms 0 m x m m x d t x d t x m documents Select first k singular values * * * * * S D T ^ X = terms k x k k x d t x d t x k SVD Basics =

  12. SVD Basics II Rank-reduced Singular Value Decomposition (SVD) performed on matrix • all but the k highest singular values are set to 0 • produces k-dimensional approximation of the original matrix (in least-squares sense) • this is the “semantic space” • Compute similarities between entities in semantic space (usually with cosine)

  13. SVD • SVD of the term-by-document matrix X: • If the singular values of S0 are ordered by size, we only keep the first k largest values and get a reduced model: • doesn’t exactly match X and it gets closer as more and more singular values are kept • This is what we want. We don’t want perfect fit since we think some of 0’s in X should be 1 and vice versa. • It reflects the major associative patterns in the data, and ignores the smaller, less important influence and noise.

  14. Fundamental Comparison Quantities from the SVD Model • Comparing Two Terms: the dot product between two row vectors of reflects the extent to which two terms have a similar pattern of occurrence across the set of document. • Comparing Two Documents: dot product between two column vectors of • Comparing a Term and a Document

  15. LSI Paper example Index terms in italics

  16. term-document Matrix

  17. Singular Value Decomposition documents * * * * * S D 0 0 X T terms 0 m x m m x d t x d t x m documents Select first k singular values * * * * * S D T ^ X = terms k x k k x d t x d t x k Latent Semantic Indexing =

  18. T0

  19. S0

  20. D0

  21. SVD with minor terms dropped TS define coordinates for documents in latent space

  22. Terms Graphed in Two Dimensions

  23. Documents and Terms

  24. Change in Text Correlation

  25. Summary • Some Issues • SVD Algorithm complexity O(n^2k^3) • n = number of terms • k = number of dimensions in semantic space (typically small ~50 to 350) • for stable document collection, only have to run once • dynamic document collections: might need to rerun SVD, but can also “fold in” new documents

  26. Summary • Some issues • Finding optimal dimension for semantic space • precision-recall improve as dimension is increased until hits optimal, then slowly decreases until it hits standard vector model • run SVD once with big dimension, say k = 1000 • then can test dimensions <= k • in many tasks 150-350 works well, still room for research

  27. Summary • Some issues • SVD assumes normally distributed data • term occurrence is not normally distributed • matrix entries are weights, not counts, which may be normally distributed even when counts are not

More Related