Latent Semantic Analysis A Gentle Tutorial Introduction Tutorial Resources cis.paisley.ac.uk/giro-ci0/GU_LSA_TUT

Latent Semantic AnalysisA Gentle Tutorial IntroductionTutorial Resourceshttp://cis.paisley.ac.uk/giro-ci0/GU_LSA_TUT M.A. Girolami University of Glasgow DCS Tutorial

Contents • Latent Semantic Analysis • Motivation • Singular Value Decomposition • Term Document Matrix Structure • Query and Document Similarity in Latent Space • Probabilistic Views on LSA • Factor Analytic Model • Generative Model Representation • Alternate Basis to the Principal Directions • Latent Semantic & Document Clustering (In the Bar later) • Principal Direction Clustering • Hierarchic Clustering with LSA University of Glasgow DCS Tutorial

Latent Semantic Analysis • Motivation • Lexical matching at term level inaccurate (claimed) • Polysemy – words with number of ‘meanings’ – term matching returns irrelevant documents – impacts precision • Synonomy – number of words with same ‘meaning’ – term matching misses relevant documents – impacts recall • LSA assumes that there exists a LATENT structure in word usage – obscured by variability in word choice • Analogous to signal + additive noise model in signal processing University of Glasgow DCS Tutorial

Latent Semantic Analysis • Word usage defined by term and document co-occurrence – matrix structure • Latent structure / semantics in word usage • Clustering documents or words – no shared space • Two mode factor analysis – dyadic decomposition into ‘latent semantic’ factor space - employing - Singular Value Decomposition • Cubic Computational Scaling – reasonable ! University of Glasgow DCS Tutorial

Singular Value Decomposition • M× N, Term × Document matrix (M >> N) D = [d1, d2, …, dN] and d= [t1, t2, …, tM]T Consider linear combination of terms u1t1+ u2t2+ … + uMtM = uTd which maximises E{(uTd)2} = E{uTddTu} = uT E{ddT}u ≈ uTDDTu Subject touTu = 1 University of Glasgow DCS Tutorial

Singular Value Decomposition Maximise uTDDTu s.tuTu = 1 Construct Langrangian uTDDTu–λuTu Vector of partial derivatives set to zero DDTu –λu =(DDT –λI) u = 0 As u ≠ 0 then DDT –λI must be singular i.e |DDT –λI|= 0 This is a polynomial in λ of degree M with characteristic roots – called the eigenvalues (German eigen = own, unique to, particular to) University of Glasgow DCS Tutorial

Singular Value Decomposition The first root is called the prinicipal eigenvalue which has an associated orthonormal (uTu = 1) eigenvectoru Subsequent roots are ordered such that λ1> λ2 >… > λM with rank(D) non-zero values. Eigenvectors form an orthonormal basis i.e. uiTuj = δij The eigenvalue decomposition of DDT = UΣUT whereU = [u1, u2, …, uM] and Σ= diag[λ1, λ2, …, λM] Similarly the eigenvalue decomposition ofDTD = VΣVT The SVD is closely related to the above D=U Σ1/2 VT The left eigenvectors U, right eigenvectors V, singular values = square root of eigenvalues. University of Glasgow DCS Tutorial

SVD Properties • D=U S VT= ∑i=1..NσiuiviT and DK=∑i=1..KσiuiviT = UK SK VKTandK<N : UK TUK = IK = VK TVK • ThenDKis best rank K approximation to D,inF norm sense • K-dim orthonormal projections S-1K UK TD=VKTpreserve the maximum amount of variability • Under the assumption that columns of D are multivariate Gaussian then V defines principal axes of ellipse of constant varianceλi in original space University of Glasgow DCS Tutorial

D -- 10 x 2 U -- 10 x 2 S -- 2 x 2 V T -- 2 x 2 2.9002 3.6790 4.0860 5.2366 1.9954 3.3687 3.5069 1.6748 4.4620 2.7684 -2.9444 -4.6447 -4.1132 -4.7043 -3.6208 -5.0181 -3.0558 -4.1821 -6.1204 -2.4790 -0.2750 -0.1242 -0.3896 -0.1846 -0.2247 -0.2369 -0.2150 0.3514 -0.3005 0.3318 0.3177 0.2906 0.3682 0.0833 0.3613 0.2319 0.3027 0.1861 0.3563 -0.6935 -0.6960 -0.7181 0.7181 -0.6960 16.9491 0 0 3.8491 SVD Example University of Glasgow DCS Tutorial

SVD Properties • There is an implicit assumption that the observed data distribution is multivariate Gaussian • Can consider as a probabilistic generative model – latent variables are Gaussian – sub-optimal in likelihood terms for non-Gaussian distribution • Employed in signal processing for noise filtering – dominant subspace contains majority of information bearing part of signal • Similar rationale when applying SVD to LSI University of Glasgow DCS Tutorial

Computing SVD • Power Method one numerical approach Random initialisation of vector u0 Set u1u = DDTu0 and u1 = u1u / √ (u1u)T u1u then u2u = DDTu1 and u2 = u2u / √ (u2u)T u2u Then uiu = DDTui-1 and ui = uiu / √ (uiu)T uiu As i  ∞, ui u1, √ (uiu)T uiuλ1 • Subsequent EV’s use deflation u1u = (DDT - λ1u1u1T)u0 • Note for term document matrix computation of u1 Inexpensive – subsequent ev’s require matrix-vector operations on dense matrix. University of Glasgow DCS Tutorial

Term Document Matrix Structure • Create artificially heterogeneous collection • 100 documents from 3 distinct newsgroups • Indexed using standard stop word list • 12418 distinct terms • Term × Document Matrix (12418 × 300) • 8% fill of sparse matrix • Sort terms by rank – structure apparent • Matrix of cosine similarity between documents • Clear structure apparent University of Glasgow DCS Tutorial

Term Document Matrix Structure University of Glasgow DCS Tutorial

Query and Document Similarity in Latent Space • Rank 3 D3 = σ1u1v1T+ σ2u2 v2T+ σ3u3 v3T • Projection into 3-d Latent Semantic Space • of all documents achieved by S3-1U3TD • A query q in theLSA space S3-1U3Tq • Similarity in LSA space • (S3-1U3Tq)T S3-1U3TD • = qTU3S3-1S3-1U3TD • = qTU3∑3-1U3TD • = qT expD =qT Θ D • LSA similarity metric Θ – term expansion University of Glasgow DCS Tutorial

Query and Document Similarity in Latent Space • Project documents into 3-D latent space • Project query University of Glasgow DCS Tutorial

Random Projections • Important theoretical result • Random projection from M - dim to L - dim space • Where L << M then • Euclidean distance and angles (norms and inner products) are preserved with high probability • LSA can then be performed using SVD on the reduced dimensional L × N matrix (less costly) University of Glasgow DCS Tutorial

University of Glasgow DCS Tutorial

LSA Performance • LSA consistently improves recall on standard test collections (precision/recall generally improved) • Variable performance on larger TREC collections • Dimensionality of Latent Space – a magic number – 300 – 1000 seems to work fine – no satisfactory way of assessing value. • Computational cost – at present – prohibitive University of Glasgow DCS Tutorial

Probabilistic Views on LSA • Factor Analytic Model • Generative Model Representation • Alternate Basis to the Principal Directions University of Glasgow DCS Tutorial

Factor Analytic Model • d = Af + n • p(d) = ∑f p(d|f)p(f) • This probabilistic representation underlies LSA where prior and likelihood are both multivariate Gaussian. University of Glasgow DCS Tutorial

Generative ModelRepresentation • Generate a document d with probability p(d) • Having observed d generate a semantic factor with probability p(f|d) • Having observed a semantic factor generate a word with probability p(w|f) University of Glasgow DCS Tutorial

P(d) Factor 3 Factor 2 Documents P(w|f) P(f|d) Factor 1 Generative ModelRepresentation The cat sat on the mat and the quick brown fox jumped… spider University of Glasgow DCS Tutorial

Generative ModelRepresentation • Model representation as joint probability p(d,w) = p(d)p(w|d) = p(d)∑f p(w|f)p(f|d) w and d conditionally independent given f • p(d,w) = ∑f p(w|f)p(f)p(d|f) • Note similarity with DK=∑i=1..KσiuiviT University of Glasgow DCS Tutorial

P(w=spider|f4)=0.6 P(w=spider|f4)=0.02 P(w=spider|f4)=0.01 P(w=spider|f4)=0.1 p(d,w) = p(d)∑f p(w|f)p(f|d) = 0.001 p(f=4|d)=0.05 p(f=1|d)=0.6 p(f=2|d)=0.1 p(f=3|d)=0.25 The cat sat on the mat and the quick brown fox jumped… Documents P(d) = 0.003 University of Glasgow DCS Tutorial

Generative ModelRepresentation • Distributions of p(f|d) and p(w|f) are multinomial – counts in successive trials • More appropriate than Gaussian • Note that Term × Document matrix is a sample from the true distribution pt(d, w) • ∑ijD(i,j) log p(dj, wi) – cross-entropy between model and realisation – maximise likelihood that the model p(dj, wi) generated the realisation D – subject to conditions on p(f|d) and p(w|f) University of Glasgow DCS Tutorial

Generative ModelRepresentation • Estimation of p(f|d) and p(w|f) requires use of a standard EM algorithm. • Expectation Maximisation • General iterative method for ML parameter estimation • Ideal for ‘missing variable’ problems • Estimate p(f|d,w) using current estimates of p(w|f) and p(f|d) • Estimate new values of p(w|f) and p(f|d) using current estimate of p(f|d,w) University of Glasgow DCS Tutorial

Generative ModelRepresentation • Once parameters estimated • p(f|d) gives posterior probability that Semantic factor ‘f’ is associated with d • p(w|f) gives the probability of word ‘w’ being generated from Semantic factor ‘f’ • Nice clear interpretation unlike U and V terms in SVD • ‘Sparse’ representation – unlike SVD University of Glasgow DCS Tutorial

Generative ModelRepresentation • Take the toy collection generated – estimate p(f|d) and p(w|f) • Graphical representation of p(f|d) University of Glasgow DCS Tutorial

Generative ModelRepresentation • Ordered representation of p(w|f) University of Glasgow DCS Tutorial

Alternate Basis to the Principal Directions • Similarity between query and documents can be assessed in ‘factor’ space – vis. LSA • Sim = ∑f p(f|q) p(f|D) averaged product of query and doc posterior probabilities over all ‘factors’ – latent space • Alternately note that D and q are sample instances from an unknown distribution • All probabilities – word counts – estimated from D ‘noisy’ • Employ p(dj, wi) as ‘smoothed’ version of tf and use ‘cosine’ measure ∑i p(D, wi) × qi ‘query expansion’ University of Glasgow DCS Tutorial

Alternate Basis to the Principal Directions • Both forms of matching shown to improve on LSA (MED,CRAN,CACM) • Elegant statistically principled approach – can employ (in theory) Bayesian model assessment techniques. • Likelihood nonlinear function of parameters p(f|d) and p(w|f) – Huge parameter space – small number of relative samples – high bias and variance expected • Assessment of correlation with likelihood and P/R – yet to be studied in depth University of Glasgow DCS Tutorial

Conclusions • SVD defined basis provide P/R improvements over term matching • Interpretation difficult • Optimal dimension – open question • Variable performance on LARGE coll’s • Supercomputing muscle required • Probabilistic approaches provide improvements over SVD • Clear interpretation of decomposition • Optimal dimension – open question • High variability of results due to nonlinear optimisation over HUGE parameter space • Improvements marginal in relation to cost University of Glasgow DCS Tutorial

Latent Semantic & Hierarchic Document Clustering • Had enough ? …. • ….. To the Bar… University of Glasgow DCS Tutorial

Latent Semantic Analysis A Gentle Tutorial Introduction Tutorial Resources cis.paisley.ac.uk/giro-ci0/GU_LSA_TUT

Latent Semantic Analysis A Gentle Tutorial Introduction Tutorial Resources cis.paisley.ac.uk/giro-ci0/GU_LSA_TUT

Presentation Transcript

An Introduction to Latent Semantic Analysis

Latent Semantic Analysis (LSA)

Latent Semantic Indexing: A probabilistic Analysis

A Short Semantic Web Tutorial

Semantic Web Technologies: A Tutorial

Latent Semantic Analysis

IR Models: Latent Semantic Analysis

Semantic Search Tutorial Introduction

An Introduction to Latent Semantic Analysis

Indexing by Latent Semantic Analysis

Introduction to the Semantic Web Tutorial

Probabilistic Latent Semantic Analysis

Dirichlet Process Mixtures A gentle tutorial

Latent Semantic Analysis

Introducing Latent Semantic Analysis

Analysis Tools Tutorial: Introduction

Latent Semantic Indexing: A probabilistic Analysis

Latent Semantic Analysis (LSA)

Introduction to the Semantic Web Tutorial

Latent Semantic Analysis