1 / 29

Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text

Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text. Conley Read cread@cs.ucr.edu Computer Science & Engineering University of California - Riverside. Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University of Maryland - Baltimore County

della
Download Presentation

Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text Conley Read cread@cs.ucr.edu Computer Science & Engineering University of California - Riverside Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University of Maryland - Baltimore County ACM Web Information Data Management, 2002:31-35

  2. Overview • The problem – research motivation • The solution, LSA? • LSA doesn’t work so well • Let’s do it (LSA) again • Two-stage LSA works! • Create your own Corpus

  3. Mumbai Bombay The Problem

  4. al Qaeda al Qaida Motivation

  5. Nutrasweet aspartame Motivation

  6. al Qaeda cells al Qaida network suspects Iraq bin Laden alleged cell warned terrorist Motivation

  7. An Old IR Problem … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

  8. Keyword Query: CAR … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

  9. Keyword Query: AUTOMOBILE … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

  10. Latent Semantic Analysis … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

  11. Term-Document Matrix n documents A = m terms A(I, J) = number of times term I occurs in document J

  12. Latent Semantic Analysis • Compute singular value decomposition (SVD) of A A = U S VT • Retain k < n largest singular values • Set remainder to zero • Projects terms/docs into k-dimensional space • Compute similarity in that space

  13. S U V Singular Value Decomposition U – row corresponds to a wordΣ – singular values of AV – column corresponds to a document [Berry & Fierro 1996] Numerical Linear Algebra with Applications 3(4):301-327

  14. Using SVD Sk U V U – Look only at k columns (words)Σk – Set all but k largest to zeroV – Look only at k rows (documents) [Berry & Fierro 1996] Numerical Linear Algebra with Applications 3(4):301-327

  15. Using LSA to Find Aliases • Given name N and document collection D • Compute SVD of term-document matrix • Retain k largest singular values • Compute similarity of all terms to N • Report rank-ordered list of terms • True aliases for N must be high in list

  16. Experiment: Creating Aliases • name N and document collection D • Set P, a percentage • S1 and S2 are two strings not in D • Replace N with S1 in P% of the documents • Replace N with S2 in the other documents • Search for aliases for S1 • Observe rank of S2 in ordered list

  17. Our Dataset • 77 documents from www.cnn.com • Shortest has 131 words, longest has 1923 • “al Qaeda” occurs in 49 documents • Others on politics, sports, entertainment • N = “al Qaeda” • S1 = “alqaeda1” • S2 = “alqaeda2” • P = 50

  18. Algorithm Parameters • k – dimensionality of compressed space • Small values result in spurious similarities • Large values closely approximate A • T – threshold on TF/IDF value • More aggressive filtering with larger values • Want to avoid filtering aliases • Want to filter irrelevant words Term Frequency / Inverse Document Frequency We want High Retrieval (precision) and Low Miss (infrequent in collection) rates.

  19. Results 1: LSA Stage 1 Figure 1: Plot of Rank as a function of t for values of k.

  20. k = 20 k = 5 k = 10 arrested government ressam lindh zubaydah raids attacks brahim passengers virginia zubaydah raids ressam pakistani hamdi soldier trial alqaeda2 pakistan walker zubaydah ressam raids hamdi alqaeda2 pakistani trial soldier pakistan lindh Results: Ontologically Dissimilar Problem: LSA shows Organizations and Individuals as similar.

  21. Local Context to Ontology An Organization … list of al Qaeda leaders … … most senior al Qaeda member captured … … alleged al Qaeda representative … An Individual … photograph showing Lindh blindfolded … … with Lindh, the 21-year-old American … … Lindh pleaded guilty … Ontology: Hierarchical structuring of knowledge according to relevant or cognitive qualities.

  22. A Second Run of LSA • For each term T in the top 250 candidates • Create a document DT • DT contains the words just before and just after each occurrence of T in the original corpus • Run LSA on all of the DT (the new corpus) … most senioral Qaedamember captured … … photograph showingLindhblindfolded and …

  23. Results 2: LSA Stage 2 Figure 2: Plot of Rank as a function of t for values of k.

  24. Results 2: Scaled to Figure 1 Figure 3: Plot of Rank as a function of t for values of k.

  25. Results 1 & 2: Comparison LSA-1 and LSA-2, Before and After.

  26. k = 20 k = 5 k = 10 tenet suspected warned alqaeda2 terrorism terrorist anaconda potential operation operations cells alqaeda2 network suspects germany laden alleged cell terrorist warned cells network alqaeda2 cell terrorist alleged suspects laden singapore germany Results: Contextually Similar Solution: LSA with context ranks terms by ontological similarity.

  27. Applications • Create your own corpus • Submit N as Google query • Create corpus from top M hits • Run two-stage LSA • Example alias in Movie Titles: • Query N = “Ocean’s 12” • Use Google to get top 100 hits • Run two-stage LSA algorithm You might retrieve: 1. GoldenEye 2. Ocean’s 11 3. Die Hard: Vengeance 4. The Italian Job

  28. Review • Find semantically related terms • Obvious solution – LSA • LSA is not so good • We ran LSA again! • LSA is great! • Create a Corpus with Google

  29. Your Questions? AcknowledgementsDr. Tim Oates, oates@cs.umbc.edu References – the math…Berry, M., Fierro R. 1996. Low-rank orthogonaldecompositions for information retrieval applications.

More Related