240 likes | 355 Views
Latent Semantic Indexing (LSI) is a powerful technique addressing significant issues in information retrieval, such as synonymy and polysemy. It captures implicit associations among terms, enabling better document retrieval through a probabilistic analysis. LSI employs Singular Value Decomposition (SVD) to reduce dimensionality in the term-document matrix, enhancing similarity measurements and clustering capabilities. Applications range from query optimization to psychological testing and essay scoring, showcasing LSI's versatility in various fields. This method not only streamlines information retrieval but also fosters deeper insights into semantic relationships.
E N D
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala
Motivation • Application in several areas: • querying • clustering, identifying topics • Other: • synonym recognition (TOEFL..) • Psychology test • essay scoring
Motivation • Latent Semantic Indexing is • Latent: Captures associations which are not explicit • Semantic: Represents meaning as a function of similarity to other entities • Cool: Lots of spiffy applications, and the potential for some good theory too
Overview • IR and two classical problems • How LSI works • Why LSI is effective: A probabilistic analysis
Information Retrieval • Text corpus with many documents (docs) • Given a query, find relevant docs • Classical problems: • synonymy: missing docs with reference to “automobile” when querying on “car” • polysemy: retrieving docs on internet when querying on “surfing” • Solution: Represent docs (and queries) by their underlying latent concepts
Information Retrieval • Represent each document as a word vector • Represent corpus as term-document matrix (T-D matrix) • A classical method: • Create new vector from query terms • Find documents with highest dot-product
Latent Semantic Indexing(LSI) • Process term-document (T-D) matrix to expose statistical structure • Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff • Related to principal component analysis (PCA), multiple dimensional scaling (MDS)
Parameters • U = universe of terms • n = number of terms • m = number of docs • A = n x m matrix with rank r • columns represent docs • rows represent terms
Singular Value Decomposition(SVD) • LSI uses SVD, a linear analysis method:
SVD • r is the rank of A • D diagonal matrix of the r singular values • U and V matrices composed of orthonormal columns • SVD is always possible • numerical methods for SVD exist • run time: O(m n c), where c denotes the average number of words per document
Synonymy • LSI used in several ways: e.g. detecting synonymy • A measure of similarity for two terms: • In original space: dot product of rows (terms) and of ( , entry in ) • Better: dot product of rows and of • ( , entry in )
Synonymy (intuition) • Consider the term-term autocorrelation matrix • If two terms co-occur (e.g. supply-demand) we get nearly identical rows • Yields a small eigenvalue for • The eigenvector will likely be projected out in as it gives a weak eigenvalue
A Performance Evaluation • Landauer & Dumais • Perform LSI on 30,000 encyclopedia articles • Take synonym test from TOEFL • Choose most similar word • LSI - 64.4% (52.2% corrected for guessing) • People - 64.5% (52.7% corrected for guessing) • Correlated .44 with incorrect alternatives
A Probabilistic Analysisoverview • The model: • Topics sufficiently disjoint • Each doc drawn from a single (random) topic • Result: • With high probability (whp) : • Docs from the same topic will be similar • Docs from different topics will be dissimilar
The Probabilistic Model • K topics, each corresponding to a set of words • The sets are mutually disjoint • Below, all random choices are made uniformly at random • A corpus of m docs, each doc created as follows..
The Probabilistic Model (cont.) • choosing a doc: • choose length of the doc • choose a topic • Repeat times: • With prob choose a word from topic • With prob choose a word from other topics
Set up • Let vector assigned to doc by the rank-k LSI performed on the corpus. • The rank-k LSI is -skewed if • (intuition) Docs from the same topic should be similar (high dot product), …
The Result • Theorem: Assume the corpus is created from the model just described (k topics, etc.) . Then the rank-k LSI is -skewed with probability
Proof Sketch • Show with k topics, we obtain k orthogonal subspaces • Assume strictly disjoint topics ( ) • show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra-topic) • ( ) relax by using a matrix perturbation analysis
Extensions • Theory should go beyond explaining (ideally) • Potential for speed up: • project the doc vectors onto a suitably small space • perform LSI on this space • Yields O(m( n + c log n)) compared to O(mnc)
Future work • Learn more abstract algebra (math)! • Extensions: • docs spanning multiple topics? • polysemy? • other positive properties? • Another important role of theory: • Unify and generalize: spectral analysis has found applications elsewhere in IR