Latent Semantic Indexing and Beyond

Latent Semantic Indexingand Beyond Leif Grönqvist (lgr@msi.vxu.se) School of Mathematics and Systems Engineering The Swedish Graduate School of Language Technology NoDaLiDa 2003: Leif Grönqvist

What is Latent Semantic Indexing? • LSI uses a kind of vector model • The classical IR vector model groups documents with many terms in common • But • Documents could have a very similar content, using different vocabularies • The terms used in the document may not be the most representative • LSI uses the distribution of all terms in all documents when comparing two documents! NoDaLiDa 2003: Leif Grönqvist

A traditional vector model for IR • The starting point is a term-document-matrix, both for the traditional vector model and LSI • We can calculate similarities between terms or documents using the cosine • We can also (trivially) find relevant terms for a document • Problems: • The term “trees” seems relevant to the m-documents, but is not present in m4 • cos(c1,c5)=0 just as cos(c1,m3)=0 NoDaLiDa 2003: Leif Grönqvist

A toy example NoDaLiDa 2003: Leif Grönqvist

How does LSI work? • The idea is to try to use latent information like: • word1 and word2 are often found together, so maybe doc1 (containing word1) and doc2 (containing word2 ) are related? • doc3 and doc4 have many words in common so maybe the words they don’t have in common are related? NoDaLiDa 2003: Leif Grönqvist

How does LSI work? cont’d • In the classical vector model, a document vector (from our toy example) is 12-dimensional and the term vectors are 9-dimensional • What we want to do is to project these vector into a vector space with lower dimensionality • One way is to use Singular Value Decomposition (SVD) • We decompose the original matrix into three new matrices NoDaLiDa 2003: Leif Grönqvist

What SVD gives us X=T0S0D0: X, T0, S0, D0 are matrices NoDaLiDa 2003: Leif Grönqvist

Using the SVD • The matrices make it easy to project term and document vectors into a m-dimensional space (m ≤ min (terms, docs)) using ordinary linear algebra • We can select m easily just by using as many rows/columns of T0, S0, D0 as we want • To get an idea, let’s use m=2 and recalculate a new (approximated) X – it will still be a t x d matrix NoDaLiDa 2003: Leif Grönqvist

C1 C2 C3 C4 C5 M1 M2 M3 M4 Human .16 .40 .38 .47 .18 -.05 -.12 -.16 -.09 Interface .14 .37 .33 .40 .16 -.03 -.07 -.10 -.04 Computer .15 .51 .36 .41 .24 .02 .06 .09 .12 User .26 .84 .61 .70 .39 .03 .08 .12 .19 System .45 1.23 1.05 1.27 .56 -.07 -.15 -.21 -.05 Response .16 .58 .38 .42 .28 .06 .13 .19 .22 Time .16 .58 .38 .42 .28 .06 .13 .19 .22 EPS .22 .55 .51 .63 .24 -.07 -.14 -.20 -.11 Survey .10 .53 .23 .21 .27 .14 .44 .44 .42 Trees -.06 .23 -.14 -.27 .14 .24 .77 .77 .66 Graph -.06 .34 -.15 -.30 .20 .31 .98 .98 .85 Minors -.04 .25 -.10 -.21 .15 .22 .71 .71 .62 We can recalculate X with m=2 NoDaLiDa 2003: Leif Grönqvist

What does the SVD give? • Susan Dumais 1995: “The SVD program takes the ltc transformed term-document matrix as input, and calculates the best "reduced-dimension" approximation to this matrix.” • Michael W Berry 1992: “This important result indicates that Ak is the best k-rank approximation (in at least squares sense) to the matrix A. • Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way. NoDaLiDa 2003: Leif Grönqvist

Algorithms for dimensional reduction • Singular Value Decomposition (SVD) • This is a mathematically complicated (based on eigen-values) way to find an optimal vector space in a specific number of dimensions • Computationally heavy - maybe 20 hours for a one million documents newspaper corpus • Uses often the entire document as context • Random Indexing (RI) • Select some dimensions randomly • Not as heavy to calculate, but more unclear (for me) why it works • Uses a small context, typically 1+1 – 5+5 words • Neural nets, Hyperspace Analogue to Language, etc. NoDaLiDa 2003: Leif Grönqvist

Some applications • Automatic generation of a domain specific thesaurus • Keyword extraction from documents • Find sets of similar documents in a collection • Find documents related to a given document or a set of terms NoDaLiDa 2003: Leif Grönqvist

Problems and questions • How can we interpret the similarities as different kinds of relations? • How can we include document structure and phrases in the model? • Terms are not really terms, but just words • Ambiguous terms pollute the vector space • How could we find the optimal number of dimensions for the vector space? NoDaLiDa 2003: Leif Grönqvist

stefan edberg edberg 0.918 cincinnatis 0.887 edbergs 0.883 världsfemman 0.883 stefans 0.883 tennisspelarna 0.863 stefan 0.861 turneringsseger 0.859 queensturneringen 0.858 växjöspelaren 0.852 grästurnering 0.847 bengt johansson johansson 0.852 johanssons 0.704 bengt 0.678 centerledare 0.674 miljöcentern 0.667 landsbygdscentern 0.667 implikationer 0.645 ickesocialistisk 0.643 centerledaren 0.627 regeringsalternativet 0.620 vagare 0.616 An example based on 50 000 newspaper articles NoDaLiDa 2003: Leif Grönqvist

bengt 1.000 westerberg 0.912 folkpartiledaren 0.899 westerbergs 0.893 fpledaren 0.864 socialminister 0.862 försvarsfrågorna 0.860 socialministern 0.841 måndagsresor 0.840 bulldozer 0.838 skattesubventionerade 0.833 barnomsorgsgaranti 0.829 johansson 1.000 johanssons 0.800 olof 0.684 centerledaren 0.673 valperiod 0.668 centerledarens 0.654 betongpolitiken 0.650 downhill 0.640 centerfamiljen 0.635 centerinflytande 0.634 brokrisen 0.632 gödslet 0.628 Bengt Johansson is just Bengt + Johansson – something is missing! NoDaLiDa 2003: Leif Grönqvist

A small experiment • I want the model to know the difference between Bengt and Bengt • Make a frequency list for all n-tuples up to n=5 with a frequency>1 • Keep all words in the bags, but add the tuples, with space replaced by -, as words • Run the LSI again • Now bengt-johansson is a word, and bengt-johansson is NOT Bengt + Johansson Number of terms grows a lot! NoDaLiDa 2003: Leif Grönqvist

bengt-johansson 1.000 dubbellandskamperna 0.954 pettersson-sävehof 0.952 kristina-jönsson 0.950 fanns-svenska-glädjeämnen0.945 johan-pettersson-sävehof 0.942 martinsson-karlskrona 0.938 förbundskaptenen-bengt-bengan-johansson0.932 förbundskaptenen-bengt-bengan0.932 sjumålsskytt 0.931 svenska-damhandbollslandslaget0.928 stankiewicz 0.926 em-par 0.925 västeråslaget 0.923 jan-stankiewicz 0.923 handbollslandslag 0.922 bengt-johansson-tt 0.921 st-petersburg-sverige 0.921 petersburg-sverige 0.921 sjuklistan 0.920 olsson-givetvis 0.920 emtruppen 0.919 … johansson 0.567 bengt 0.354 olof 0.181 centerledaren 0.146 westerberg 0.061 folkpartiledaren 0.052 And the top list for Bengt-Johansson NoDaLiDa 2003: Leif Grönqvist

The new vector space model • It is clear that it is now possible to find terms closely related to Bengt Johansson – the handball coach • But is the model better for single words and for document comparison as well? What do you think? • More “words” than before – hopefully it improves the result just as more data does • At least no reason for a worse result... Or? NoDaLiDa 2003: Leif Grönqvist

An example document REGERINGSKRIS ELLER INTE PARTILEDARNA I SISTAMINUTEN ÖVERLÄGGNINGAR OM BRON Under onsdagskvällen satt partiledarna i regeringen i sista minutenöverläggningar om Öresundsbron Centerledaren Olof Johansson var den förste som lämnade överläggningarna På torsdagen ska regeringen ge ett besked Det måste dock enligt statsminister Carl Bildt inte innebära ett ja eller ett nej till bron … NoDaLiDa 2003: Leif Grönqvist

0.986 underkänner 0.982 irhammar 0.977 partiledarna 0.970 godkände 0.962 delade-meningar 0.960 regeringssammanträde 0.957 riksdagsledamot 0.957 bengt-westerberg 0.954 materialet 0.952 diskuterade 0.950 folkpartiledaren 0.949 medierna 0.947 motsättningarna 0.946 vilar 0.944 socialminister-bengt-westerberg Closest terms in each model 0.967 partiledarna 0.921 miljökrav 0.921 underkänner 0.918 tolkar 0.897 meningar 0.888 centerledaren 0.886 regeringssammanträde 0.880 slottet 0.880 rosenbad 0.877 planminister 0.866 folkpartiledaren 0.855 thurdin 0.845 brokonsortiet 0.839 görel 0.826 irhammar NoDaLiDa 2003: Leif Grönqvist

Closest document in both models BILDT LOVAR BESKED OCH REGERINGSKRIS HOTAR Det blir ett besked under torsdagen men det måste inte innebära ett ja eller nej från regeringen till Öresundsbroprojektet Detta löfte framförde statsminister Carl Bildt under onsdagen i ett antal varianter Samtidigt skärptes tonen mellan honom och miljöminister Olof Johansson och stämningen tydde på annalkande regeringskris De båda har under den långa broprocessen undvikit att uttala sig kritiskt om varandra och därmed trappa upp motsättningarna Men nu menar Bildt att centern lämnar sned information utåt Johansson och planminister Görel Thurdin anser å andra sidan att regeringen bara kan säga nej till bron om man tar riktig hänsyn till underlaget för miljöprövningen … NoDaLiDa 2003: Leif Grönqvist

NoDaLiDa 2003: Leif Grönqvist

Documents with better ranking in the tuple model 2602 .848 4 .492 12 BRON KAN BLI VALFRÅGA SÄGER JOHANSSON Om det lutar åt ett ja i regeringen av politiska skäl då är naturligtvis den här frågan en viktig valfråga … 2367 .804 10 .434 19 INTE EN KRITISK RÖST BLAND CENTERPARTISTERNA TILL BROBESKEDET En etappseger för miljön och centern En eloge till Olof Johansson Görel Thurdin och Carl Bildt … NoDaLiDa 2003: Leif Grönqvist

Documents with better ranking in the phrase model 1567 .456 73 .601 5 ALF SVENSSON TOPPNAMN I STOCKHOLM Kds-ledaren Alf Svensson toppar kds riksdagslista för Stockholms stad och Michael Stjernström sakkunnig i statsrådsberedningen har en valbar andra plats … 1371 .456 74 .601 6 BENGT WESTERBERG BARNPORREN MÅSTE STOPPAS Folkpartiledaren Bengt Westerberg lovade på onsdagen att regeringen ska göra allt för att stoppa barnporren … NoDaLiDa 2003: Leif Grönqvist

Hmm, adding n-grams was maybe too simple... • If the bad result is due to overtraining, it could help to remove the words I build phrases from… • Another way to try is to use a dependency parser to find more meaningful phrases, not just n-grams A new test following 1 above: NoDaLiDa 2003: Leif Grönqvist

bengt-johansson 1.000 tomas-svensson 0.931 sveriges-handbollslandslag 0.912 förbundskapten-bengt-johansson 0.898 handboll 0.897 svensk-handboll 0.896 handbollsem 0.894 carlen 0.883 lagkaptenen-carlen 0.869 förbundskapten-johansson 0.863 ola-lindgren 0.863 bengan-johansson 0.862 erik-hajas 0.854 mats-olsson 0.854 carlen-magnus-wislander 0.852 handbollens 0.851 magnus-andersson 0.851 halvlek-svenskarna 0.849 teka-santander 0.849 storskyttarna 0.849 förbundskaptenen-bengt-johansson 0.845 målvakten-mats-olsson 0.845 danmark-tvåa 0.843 handbollsspelare 0.839 sveriges-handbollsherrar 0.836 lag-ibland 0.835 Ok, the words inside tuples are now removed NoDaLiDa 2003: Leif Grönqvist

bengt-johansson 1.000 förbundskapten-bengt-johansson 0.907 förbundskaptenen-bengt-johansson 0.835 jonas-johansson 0.816 förbundskapten-johansson 0.799 johanssons 0.795 svenske-förbundskaptenen-bengt-johansson 0.792 bengan 0.786 carlen 0.777 bengan-johansson 0.767 johansson-andreas-dackell 0.765 förlorat-matcherna 0.750 ck-bure 0.748 daniel-johansson 0.748 målvakten-mats-olsson 0.747 jörgen-jönsson-mikael-johansson 0.744 kicki-johansson 0.744 mattias-johansson-aik 0.741 thomas-johansson 0.739 handbollsnation 0.738 mikael-johansson 0.737 förbundskaptenen-bengt-johansson-valde 0.736 johansson-mats-olsson 0.736 sveriges-handbollslandslag 0.736 ställningen-33-matcher 0.736 And now pseudo documents are added for each tuple NoDaLiDa 2003: Leif Grönqvist

What I still have to do something about • Find a better LSI/SVD package than the one I have (old C-code from 1990), or maybe writing it myself... • Get the phrases into the model in some way When these things are done I could: • Try to interpret various relations from similarities in a vector space mode • Try to solve the “number of optimal dimensions”-problem • Explore what the length of the vectors mean NoDaLiDa 2003: Leif Grönqvist

Latent Semantic Indexing and Beyond

Latent Semantic Indexing and Beyond

Presentation Transcript

Latent Semantic Indexing: A probabilistic Analysis

Detecting Cyberbullying using Latent Semantic Indexing(LSI)

Latent Semantic Indexing

Latent Semantic Indexing

LATENT SEMANTIC INDEXING

Lecture 14: Latent Semantic Indexing +

Dimensionality reduction by random projection and latent semantic indexing

INF 141 IR Metrics Latent Semantic Analysis and Indexing

Latent Semantic Indexing SI650: Information Retrieva l

LATENT SEMANTIC INDEXING

Paper: Indexing by Latent Semantic Analysis

Latent Semantic Indexing

Indexing by Latent Semantic Analysis

Latent Semantic Indexing for the Routing Problem

Latent Semantic Indexing

Latent Semantic Indexing and Beyond

Lecture 13: Matrix Factorization and Latent Semantic Indexing

Latent Semantic Indexing

Latent Semantic Indexing

Lecture 15: Latent Semantic Indexing

Latent Semantic Indexing: A probabilistic Analysis

Lecture 13: Matrix Factorization and Latent Semantic Indexing