Latent Semantic Indexing: A probabilistic Analysis

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala

Motivation • Application in several areas: • querying • clustering, identifying topics • Other: • synonym recognition (TOEFL..) • Psychology test • essay scoring

Motivation • Latent Semantic Indexing is • Latent: Captures associations which are not explicit • Semantic: Represents meaning as a function of similarity to other entities • Cool: Lots of spiffy applications, and the potential for some good theory too

Overview • IR and two classical problems • How LSI works • Why LSI is effective: A probabilistic analysis

Information Retrieval • Text corpus with many documents (docs) • Given a query, find relevant docs • Classical problems: • synonymy: missing docs with reference to “automobile” when querying on “car” • polysemy: retrieving docs on internet when querying on “surfing” • Solution: Represent docs (and queries) by their underlying latent concepts

Information Retrieval • Represent each document as a word vector • Represent corpus as term-document matrix (T-D matrix) • A classical method: • Create new vector from query terms • Find documents with highest dot-product

Document vector space

Latent Semantic Indexing(LSI) • Process term-document (T-D) matrix to expose statistical structure • Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff • Related to principal component analysis (PCA), multiple dimensional scaling (MDS)

Parameters • U = universe of terms • n = number of terms • m = number of docs • A = n x m matrix with rank r • columns represent docs • rows represent terms

Singular Value Decomposition(SVD) • LSI uses SVD, a linear analysis method:

SVD • r is the rank of A • D diagonal matrix of the r singular values • U and V matrices composed of orthonormal columns • SVD is always possible • numerical methods for SVD exist • run time: O(m n c), where c denotes the average number of words per document

T-D Matrix Approximation

Synonymy • LSI used in several ways: e.g. detecting synonymy • A measure of similarity for two terms: • In original space: dot product of rows (terms) and of ( , entry in ) • Better: dot product of rows and of • ( , entry in )

“Semantic” Space

Synonymy (intuition) • Consider the term-term autocorrelation matrix • If two terms co-occur (e.g. supply-demand) we get nearly identical rows • Yields a small eigenvalue for • The eigenvector will likely be projected out in as it gives a weak eigenvalue

A Performance Evaluation • Landauer & Dumais • Perform LSI on 30,000 encyclopedia articles • Take synonym test from TOEFL • Choose most similar word • LSI - 64.4% (52.2% corrected for guessing) • People - 64.5% (52.7% corrected for guessing) • Correlated .44 with incorrect alternatives

A Probabilistic Analysisoverview • The model: • Topics sufficiently disjoint • Each doc drawn from a single (random) topic • Result: • With high probability (whp) : • Docs from the same topic will be similar • Docs from different topics will be dissimilar

The Probabilistic Model • K topics, each corresponding to a set of words • The sets are mutually disjoint • Below, all random choices are made uniformly at random • A corpus of m docs, each doc created as follows..

The Probabilistic Model (cont.) • choosing a doc: • choose length of the doc • choose a topic • Repeat times: • With prob choose a word from topic • With prob choose a word from other topics

Set up • Let vector assigned to doc by the rank-k LSI performed on the corpus. • The rank-k LSI is -skewed if • (intuition) Docs from the same topic should be similar (high dot product), …

The Result • Theorem: Assume the corpus is created from the model just described (k topics, etc.) . Then the rank-k LSI is -skewed with probability

Proof Sketch • Show with k topics, we obtain k orthogonal subspaces • Assume strictly disjoint topics ( ) • show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra-topic) • ( ) relax by using a matrix perturbation analysis

Extensions • Theory should go beyond explaining (ideally) • Potential for speed up: • project the doc vectors onto a suitably small space • perform LSI on this space • Yields O(m( n + c log n)) compared to O(mnc)

Future work • Learn more abstract algebra (math)! • Extensions: • docs spanning multiple topics? • polysemy? • other positive properties? • Another important role of theory: • Unify and generalize: spectral analysis has found applications elsewhere in IR

Latent Semantic Indexing: A probabilistic Analysis