1 / 33

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion. Miles Efron. Peter Organisciak. Katrina Fenlon. Graduate School of Library & Information Science University of Illinois, Urbana-Champaign. ASIST 2011 New Orleans, LA October 10, 2011.

jenna
Download Presentation

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Topic Models in a Federated Digital Library Through Selective Document Exclusion Miles Efron Peter Organisciak Katrina Fenlon Graduate School of Library & Information Science University of Illinois, Urbana-Champaign ASIST 2011 New Orleans, LA October 10, 2011 Supported by IMLS LG-06-07-0020.

  2. The Setting: IMLS DCC … collection(s) collection(s) collection(s) Data providers (IMLS NLG & LSTA) OAI-PMH metadata metadata metadata DCC Service provider: DCC services

  3. High-Level Research Interest • Improve “access” to data harvested for federated digital libraries by enhancing: • Representation of documents • Representation of document aggregations • Capitalizing on the relationship between aggregations and documents. • PS: By “document” I mean a single metadata (usually DC) record.

  4. Motivation for our Work • Most empirical approaches to this type of problem rely on some kind of analysis of term counts. • Unreliable for our data: • Vocabulary mismatch • Poor probability estimates

  5. The Setting: IMLS DCC

  6. The Problem: Supporting End-User Experience • Full-text search • Browse by “subject” • Desired: • Improved browsing • Support high-level aggregation understanding and resource discovery • Approach: Empirically induced “topics” using established methods--e.g. latent Dirichlet allocation(LDA).

  7. Research Question • Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings? • Hypothesis: Harvested records are not all useful for training a model of corpus-level topics. • Approach: Identify and remove “weakly topical” documents during model training.

  8. Latent Dirichlet Allocation • Given a corpus of documents, C, and an empirically chosen integer k • Assume that a generative process involving k latent topics generated word occurrences in C. • End result: for a given word w and a given document D: • Pr(w|Ti) • Pr(D|Ti) • Pr(Ti) For each topic T1 … Tk

  9. Latent Dirichlet Allocation • Given a corpus of documents, C, and an empirically chosen integer k • Assume that a generative process involving k latent topics generated word occurrences in C. • End result: for a given word w and a given document D: • Pr(w|Ti) • Pr(D|Ti) • Pr(Ti) • Choose doc length N ~ Poisson(mu). • Choose probability vector Theta ~ Dir(alpha). • For each word wiin 1:N: • Choose topic zi ~ Multinomial(Theta). • Choose word wn from P(wn | wn, Beta). For each topic T1 … Tk

  10. Latent Dirichlet Allocation • Given a corpus of documents, C and an empirically chosen integer k. • Assume that a generative process involving k latent topics generated word occurrences in C. • End result: for a given word w and a given document D: • Pr(w|Ti) • Pr(D|Ti) • Pr(Ti) Calculate estimates via iterative methods: MCMC / Gibbs Sampling. For each topic T1 … Tk

  11. Full Corpus

  12. Full Corpus Proposed algorithm

  13. Reduced Corpus Train the Model Pr(w | T) Pr(D | T) Pr(T)

  14. Full Corpus Pr(w | T) Pr(D | T) Pr(T) Inference Pr(w | T) Pr(D | T) Pr(T)

  15. Sample Topics Induced from “Raw” Data

  16. Documents’ Topical Strength • Hypothesis: Harvested records are not all useful for training a model of corpus-level. topics.

  17. Documents’ Topical Strength • Hypothesis: Harvested records are not all useful for training a model of corpus-level. • Proposal: Improve induced topic model by removing “weakly topical” documents during training. • After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”

  18. Identifying “Stop Documents” • Time at which documents enter a repository is often informative (e.g. bulk uploads). where MC is the collection language model and di is the words comprising the ith document log Pr(di | MC)

  19. Identifying “Stop Documents” • Our paper outlines an algorithm for accomplishing this. • Intuition: • Given a document di decide if it is part of a “run” of near-identical records. • Remove all records that occur within a run. • The required amount of homogeneity to identify a run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.

  20. Sample Topics Induced from Groomed Data

  21. Experimental Assessment • Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora? • Intrusion detection: • Find the 10 most probable words for topic Ti • Replace one of these 10 with a word chosen from the corpus with uniform probability. • Ask human assessors to identify the “intruder” word.

  22. Experimental Assessment • For each topic Tihave 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models. • i.e. 20 * 2* 100 = 4,000 assessments • Asiis the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model • H0: Asi> Ariyields p<0.001

  23. Experimental Assessment • For each topic Tihave 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.

  24. Current & Future Work • Testing breadth of coverage • Assessing the value of induced topics • Topic information for document priors in the language modeling IR framework [next slide] • Massive document expansion for improved language model estimation [under review]

  25. Weak Topicality and Document Priors

  26. Weak Topicality and Document Priors

  27. Thank You Miles Efron Peter Organisciak Katrina Fenlon Graduate School of Library & Information Science University of Illinois, Urbana-Champaign ASIST 2011 New Orleans, LA October 10, 2011

More Related