building topic models in a federated digital library through selective document exclusion
Download
Skip this Video
Download Presentation
Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

Loading in 2 Seconds...

play fullscreen
1 / 33

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion - PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion. Miles Efron. Peter Organisciak. Katrina Fenlon. Graduate School of Library & Information Science University of Illinois, Urbana-Champaign. ASIST 2011 New Orleans, LA October 10, 2011.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Building Topic Models in a Federated Digital Library Through Selective Document Exclusion' - jenna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
building topic models in a federated digital library through selective document exclusion

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

Miles Efron

Peter Organisciak

Katrina Fenlon

Graduate School of Library & Information Science

University of Illinois, Urbana-Champaign

ASIST 2011

New Orleans, LA

October 10, 2011

Supported by IMLS LG-06-07-0020.

the setting imls dcc
The Setting: IMLS DCC

collection(s)

collection(s)

collection(s)

Data providers

(IMLS NLG & LSTA)

OAI-PMH

metadata

metadata

metadata

DCC

Service provider:

DCC

services

high level research interest
High-Level Research Interest
  • Improve “access” to data harvested for federated digital libraries by enhancing:
    • Representation of documents
    • Representation of document aggregations
    • Capitalizing on the relationship between aggregations and documents.
  • PS: By “document” I mean a single metadata (usually DC) record.
motivation for our work
Motivation for our Work
  • Most empirical approaches to this type of problem rely on some kind of analysis of term counts.
  • Unreliable for our data:
    • Vocabulary mismatch
    • Poor probability estimates
the problem supporting end user experience
The Problem: Supporting End-User Experience
  • Full-text search
  • Browse by “subject”
  • Desired:
    • Improved browsing
    • Support high-level aggregation understanding and resource discovery
  • Approach: Empirically induced “topics” using established methods--e.g. latent Dirichlet allocation(LDA).
research question
Research Question
  • Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings?
  • Hypothesis: Harvested records are not all useful for training a model of corpus-level topics.
  • Approach: Identify and remove “weakly topical” documents during model training.
latent dirichlet allocation
Latent Dirichlet Allocation
  • Given a corpus of documents, C, and an empirically chosen integer k
  • Assume that a generative process involving k latent topics generated word occurrences in C.
  • End result: for a given word w and a given document D:
    • Pr(w|Ti)
    • Pr(D|Ti)
    • Pr(Ti)

For each topic T1 … Tk

latent dirichlet allocation1
Latent Dirichlet Allocation
  • Given a corpus of documents, C, and an empirically chosen integer k
  • Assume that a generative process involving k latent topics generated word occurrences in C.
  • End result: for a given word w and a given document D:
    • Pr(w|Ti)
    • Pr(D|Ti)
    • Pr(Ti)
  • Choose doc length N ~ Poisson(mu).
  • Choose probability vector Theta ~ Dir(alpha).
  • For each word wiin 1:N:
    • Choose topic zi ~ Multinomial(Theta).
    • Choose word wn from P(wn | wn, Beta).

For each topic T1 … Tk

latent dirichlet allocation2
Latent Dirichlet Allocation
  • Given a corpus of documents, C and an empirically chosen integer k.
  • Assume that a generative process involving k latent topics generated word occurrences in C.
  • End result: for a given word w and a given document D:
    • Pr(w|Ti)
    • Pr(D|Ti)
    • Pr(Ti)

Calculate estimates via iterative methods: MCMC / Gibbs Sampling.

For each topic T1 … Tk

slide16

Full Corpus

Proposed algorithm

slide17

Reduced Corpus

Train the Model

Pr(w | T)

Pr(D | T)

Pr(T)

slide18

Full Corpus

Pr(w | T)

Pr(D | T)

Pr(T)

Inference

Pr(w | T)

Pr(D | T)

Pr(T)

documents topical strength
Documents’ Topical Strength
  • Hypothesis: Harvested records are not all useful for training a model of corpus-level. topics.
documents topical strength1
Documents’ Topical Strength
  • Hypothesis: Harvested records are not all useful for training a model of corpus-level.
  • Proposal: Improve induced topic model by removing “weakly topical” documents during training.
  • After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”
identifying stop documents
Identifying “Stop Documents”
  • Time at which documents enter a repository is often informative (e.g. bulk uploads).

where MC is the collection language model

and di is the words comprising the ith document

log Pr(di | MC)

identifying stop documents1
Identifying “Stop Documents”
  • Our paper outlines an algorithm for accomplishing this.
  • Intuition:
    • Given a document di decide if it is part of a “run” of near-identical records.
    • Remove all records that occur within a run.
    • The required amount of homogeneity to identify a run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.
experimental assessment
Experimental Assessment
  • Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora?
  • Intrusion detection:
    • Find the 10 most probable words for topic Ti
    • Replace one of these 10 with a word chosen from the corpus with uniform probability.
    • Ask human assessors to identify the “intruder” word.
experimental assessment1
Experimental Assessment
  • For each topic Tihave 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models.
    • i.e. 20 * 2* 100 = 4,000 assessments
  • Asiis the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model
  • H0: Asi> Ariyields p<0.001
experimental assessment2
Experimental Assessment
  • For each topic Tihave 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.
current future work
Current & Future Work
  • Testing breadth of coverage
  • Assessing the value of induced topics
  • Topic information for document priors in the language modeling IR framework [next slide]
  • Massive document expansion for improved language model estimation [under review]
thank you
Thank You

Miles Efron

Peter Organisciak

Katrina Fenlon

Graduate School of Library & Information Science

University of Illinois, Urbana-Champaign

ASIST 2011

New Orleans, LA

October 10, 2011

ad