390 likes | 477 Views
Explore the world of Latent Semantic Indexing with this comprehensive guide by Adam Carlson. Learn about discourse segmentation, text tiling, using co-occurrence information, and more. Discover how LSI captures the underlying structure of a corpus and facilitates applications in information retrieval, education, and cognitive science. Gain insights into document vector space, semantic space, and singular value decomposition. Delve into LSI tricks and tips for improved text coherence and cognitive learning. Enhance your understanding of standard vector space retrieval in LSI space, cross-language retrieval, document routing/filtering, and essay grading. Witness the applications of LSI in education, text selection, and cognitive science analysis.
E N D
Latent Semantic IndexingorHow I Learned to Stop Worrying and Love Math I Don’t Understand Adam Carlson
Outline • Discourse Segmentation • LSI Motivation • Math - How to do LSI • Applications • More Math - Why does it work • Wacky Ideas CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Discourse Segmentation • Some collections (like the web) have high variance in document length • Sometimes things like sentences or paragraphs work, sometimes they don’t • Would like to segment documents according to topic CS590Q W99 - Latent Semantic Indexing - Adam Carlson
TextTiling • Break document into units of fixed length • Score cohesion between units • Look for patterns of low cohesion surrounded by high cohesion • Indicates change of subject • Found good agreement with human judges • Possible application for LSI measures of coherence CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Using Co-occurrence Information • Major problems with word-matching • Synonymy (one meaning, many words) • Polysemy (one word, many meanings) • Solutions • Concept search • Query expansion • Clustering Latent Semantic Indexing almost CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Latent Semantic Indexing is ... • Latent • Captures underlying structure of corpus • Semantic • Groups words by “conceptual” similarity • Cool • Lots of neat applications • Not Silver Bullet • Not really semantic, just MDS, expensive CS590Q W99 - Latent Semantic Indexing - Adam Carlson
What is LSI • Restructures vector space so that co-occurrences are mapped together • Captures transitive co-occurrence relations • Application of dimensional reduction to term-document matrix • Throw out da noise, bring in da regularities • Form of clustering CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Document vector space CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Semantic Space CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Singular Value Decomposition CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Term-Document Matrix Approximation CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Properties of  • Best least-squares approximation of A given only k dimensions • Terms and documents which were similar in A are more similar in  • This measure of similarity is transitive So what can we do with this? CS590Q W99 - Latent Semantic Indexing - Adam Carlson
LSI Tricks and Tips • Use  to query using standard cosine measure • Use Uk·Dk for term similarity • Use Dk·VkT for document similarity CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Information Retrieval Improve retrieval Cross-language retrieval Document routing/filtering Measuring text coherence Cognitive Science Learning synonyms Subject matter knowledge Word sorting behavior Lexical priming Education Essay grading Text selection Applications CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Standard Vector Space Retrieval in LSI Space • Improves recall at expense of precision • Compared to term-document vector space, SMART and Vorhees [Deerwester et al. 1990] • LSI did best on MED dataset • SMART did best on CISI dataset • but LSI was comparable to SMART when stemming was added CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Cross Language Retrieval • Train on multilingual corpora using “combined” documents • Add in single language documents • Query in LSI space [Landauer & Littman 1990] French & English [Landauer, Littman & Stornetta 1992] Japanese & English [Young 1994] Greek & English [Dumais, Landauer & Littman 1996] Comparisons between LSI, no-LSI and Machine Translation CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Document Routing/Filtering • Match reviewers with papers to be reviewed based on reviewers’ publications [Dumais & Nielsen 1992] • Select papers for researchers to read based on other papers they liked [Foltz & Dumais 1992] CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Train LSI on encyclopedia articles Test against TOEFL synonym test Results comparable to (non-native) college applicants [Landauer & Dumais 1996] Train on introductory Psychology texts Receive passing grade on multiple-choice questions (but did worse than students) [Landauer, Foltz & Laham 1998] LSI goes to college CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Essay Grading • Several techniques • Use essay (or sentences from essay) to query into textbook or database of graded essays • Grade based on cosine from text or closest graded essay • More consistent than expert human graders • Is that good? [Landauer, Laham & Foltz 1998] CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Routing Meets Education • Run LSA on a bunch of texts at different levels of sophistication • Have student write short essay about topic • Use essay as query to select most appropriate text for student [Wolfe, Schreiner, Rehder, Laham Foltz, Kintch and Landauer 1998] CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Measuring Text Coherence • Use LSI to compute cosine of each sentence with following one [Foltz, Kintch & Landauer 1998] • Correlates highly with established methods • Can indicate where coherence breaks down • Can be used to measure how semantic content changes across a text (discourse segmentation?) CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Outline • Discourse Segmentation • LSI Motivation • Math - How to do LSI • Applications • More Math - Why does it work • Wacky Ideas CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Least Squares ApproximationWhy does it work? 1st Attempt • Â is best least-squares approximation to A using just k dimensions CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Least Squares cont. • Why does this work • Are these the regularities we want to capture • Why approximate at all? (hint: overfitting) Not very convincing CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Neural Network ExplanationWhy does it work? 2nd Attempt • Consider fully connected 3 layer network • First layer is terms • Middle layer has k units • Last layer is documents • Weights on hidden layer will adjust to group terms that appear in similar documents and documents containing similar terms • This is analogous to the SVD matrices CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Spectral AnalysisWhy does it work? 3rd Attempt • Kleinberg’s “Authoritative Sources” • A link provides evidence of authority • Authoritative sources are pointed to by hubs • Hubs point to authoritative sources • Give every page some “weight” • Move weight back and forth across links • Stabilizes with authority and hubs • Equivalent to spectral analysis - eigenstuff CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Spectral Analysis cont. • Co-occurrence instead of authority • Links are documents with the same word • Similar documents have many similar words • Similar words occur in similar documents • Turn Kleinberg crank and get: • Authoritative sources = similar documents • Hubs = words that occur in similar documents • Doesn’t exactly fit (asymmetric) CS590Q W99 - Latent Semantic Indexing - Adam Carlson
More EigenexplanationWhy does it work? 4th Attempt • Rank of a matrix is a measure of how much information it contains • Rows which are linear combinations of each other can be removed • In this case, some singular values will be 0 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Eigenvalues cont. • Consider vectors of terms X, Y and Z • X = [1 1 0 0 1 0 ... ] • Y = [0 0 1 1 0 0 ... ] • Z = [1 1 2 2 0 1 ... ] • Z » X + 2Y • Some singular value of A is low • By forcing that singular value to 0, we merge X, Y and Z CS590Q W99 - Latent Semantic Indexing - Adam Carlson
LSI Theory • Under certain assumptions • Corpus has k topics • Each topic has n>l unique terms • Documents can cover multiple topics • 95% of content words in document are on-topic • LSI is guaranteed to separate documents into proper topics • Speedup with random projection [Papdimitriou, Raghavan, Tamaki & Vempala 1998] CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Related Techniques • PCA/Factor analysis/Multi-dimensional scaling • Neural nets • Kohonen Maps CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Dimensionality Reduction • Dimensionality reduction takes high-dimensional data and re-expresses it in a lower dimension • PCA • If you were only allowed 1 line to represent all the data, what would it be • The one that explains the greatest variance • Recur CS590Q W99 - Latent Semantic Indexing - Adam Carlson
PCA cont. CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Wacky ideas • Hierarchical concept clustering • Measure spatial deviations • Communication barriers • Language drift • Statistical/Symbolic Hybrids CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Hierarchical Concept Clustering • LSI doesn’t handle polysemy well • Find subspaces which separate polysemous words into different clusters • Hopefully those subspaces correspond to topics • Lather, rinse, repeat CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Finding Communication Barriers • Want to find terms which have different meanings in different corpora • Judge words by the company they keep • Look for words which are in cohesive clusters in both corpora but the terms in those clusters are different CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Communication Barriers cont. • Tried with pro-choice/pro-life corpora • Poor results • Didn’t use cohesive clusters • Not enough data • Highly variable data • Possible fix - start with baseline corpus and measure drift as other corpora are merged in CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Tracking Language Drift • Follow changes in clusters as a corpus grows • Hierarchical Agglomerative Clustering may have discontinuities • Use these to mark significant changes CS590Q W99 - Latent Semantic Indexing - Adam Carlson
Hybrid Approach • Merge statistical analysis (LSI) with symbolic analysis (MindNet) • Use LSI term similarity metric to assign strengths to MindNet relations • Incorporate syntactic information • Preprocess documents, adding POS or attachment information to words • Time-N Flies-V Like-AVP An-Det Arrow-N CS590Q W99 - Latent Semantic Indexing - Adam Carlson