1 / 19

EE3J2 Data Mining Lecture 8 Latent Semantic Indexing Martin Russell

EE3J2 Data Mining Lecture 8 Latent Semantic Indexing Martin Russell. Objectives. To explain vector representation of documents To understand cosine distance To understand, intuitively, Latent Semantic Indexing (LSI) To understand, intuitively, how to do Latent Semantic Analysis (LSA).

avak
Download Presentation

EE3J2 Data Mining Lecture 8 Latent Semantic Indexing Martin Russell

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EE3J2 Data MiningLecture 8Latent Semantic IndexingMartin Russell EE3J2 Data Mining

  2. Objectives • To explain vector representation of documents • To understand cosine distance • To understand, intuitively, Latent Semantic Indexing (LSI) • To understand, intuitively, how to do Latent Semantic Analysis (LSA) EE3J2 Data Mining

  3. Vector Notation for Documents • Suppose that we have a set of documents D = {d1, d2, … ,dN} think of this as the corpus for IR • Suppose that the number of different words in the whole corpus is V • Now suppose a document din D contains M different terms: {ti(1), ti(2),…, ti(M)) • Finally, suppose term ti(m)occurs ni(m)times EE3J2 Data Mining

  4. Notice that this is the term frequencyfi(1)dfrom text IR Vector Notation • The document d can be represented exactly as the V dimensional vector vec(d) i(1)th place i(2)th place i(M)th place EE3J2 Data Mining

  5. Example • d1 = the cat sat on the mat • d2= the dog chased the cat • d3 = the mouse stayed at home • Vocabulary: • the, cat, sat, on, mat, dog, chased, mouse, stayed, at, home • d1 = (2,1,1,1,1,0,0,0,0,0,0) • d2 = (2,1,0,0,0,1,1,0,0,0,0) • d3= (1,0,0,0,0,0,0,1,1,1,1) EE3J2 Data Mining

  6. Document length revisited • Recall that the length of a vector is given by: EE3J2 Data Mining

  7. Document length • In the case of a ‘document vector’ EE3J2 Data Mining

  8. Document Similarity • Suppose d is a document and q is a query • If d and q contain the same words in the sameproportions, then vec(d) and vec(q) will point in the same direction • If d and q contain the different words, then vec(d) and vec(q) will point in different directions • Intuitively, the greater the angle between vec(d) and vec(q) the less similar the document d is with the query q EE3J2 Data Mining

  9. Cosine similarity • Define the Cosine Similarity between document d and query q by: CSim(q,d) = cos where  is the angle between vec(q) and vec(d) • Similarly, define the Cosine Similarity between documents d1 and d2 by: CSim(d1,d2) = cos where  is the angle between vec(d1) and vec(d2) EE3J2 Data Mining

  10. u v  Cosine Similarity & Similarity • Let u=(x1,y1) and v=(x2,y2) be vectors in 2 dimensions, then • In fact, this result holds for vectors in any N dimensional space EE3J2 Data Mining

  11. Cosine similarity Similarity ‘is related to’ Cosine Similarity & Similarity • Hence, if q is a query, d is a document, and  is the angle between vec(q) and vec(d), then: EE3J2 Data Mining

  12. Latent Semantic Analysis (LSA) • Suppose we have a real corpus with a large number of documents • For each document d the dimension of the vector vec(d) will typically be several (tens of) thousands • Let’s focus on just 2 of these dimensions, corresponding, say, to the words ‘sea’ and ‘beach’ • Intuitively, often, when a document d includes ‘sea’ it will also include ‘beach’ EE3J2 Data Mining

  13. “about the seaside” sea beach LSA continued • Equivalently, if vec(d) has a non-zero entry in the ‘sea’ component, it will often have a non-zero entry in the ‘beach’ component  EE3J2 Data Mining

  14. “about the seaside” sea  beach Latent Semantic Classes • If we can detect this type of structure, then we can discover relationships between words automatically, from data • In the example we have found an equivalence set of terms, including ‘beach’ and ‘sea’, which is ‘about the seaside’ EE3J2 Data Mining

  15. Finding Latent Semantic Classes • LSA involves some advanced linear algebra - the description here is just an outline • First construct the ‘word-document’ matrix A • Then decompose A using Singular Value Decomposition (SVD) • SVD is a standard technique from matrix algebra • Packages such as MATLAB have SVD functions: >>[U,S,V]=svd(A) EE3J2 Data Mining

  16. Number of times tn occurs in dm ‘Word-Document’ Matrix term document EE3J2 Data Mining

  17. Direction of most significant correlation ‘Strength’ of most significant correlation Singular Value Decomposition (SVD) EE3J2 Data Mining

  18. More about LSA • See: Landauer, T.K. and Dumais, S.T., “A solution to Platos problem: The Latent Semantic Analysis theory of the acquisition, induction and representation of knowledge”, Psychological Review 104(2), 211-240 (1997) EE3J2 Data Mining

  19. Summary • Vector space representation of documents • Intuitive description of Latent Semantic Analysis EE3J2 Data Mining

More Related