Building topic models in a federated digital library through selective document exclusion
Download
1 / 33

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion - PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion. Miles Efron. Peter Organisciak. Katrina Fenlon. Graduate School of Library & Information Science University of Illinois, Urbana-Champaign. ASIST 2011 New Orleans, LA October 10, 2011.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Building Topic Models in a Federated Digital Library Through Selective Document Exclusion' - jenna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Building topic models in a federated digital library through selective document exclusion

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion

Miles Efron

Peter Organisciak

Katrina Fenlon

Graduate School of Library & Information Science

University of Illinois, Urbana-Champaign

ASIST 2011

New Orleans, LA

October 10, 2011

Supported by IMLS LG-06-07-0020.


The setting imls dcc
The Setting: IMLS DCC Selective Document Exclusion

collection(s)

collection(s)

collection(s)

Data providers

(IMLS NLG & LSTA)

OAI-PMH

metadata

metadata

metadata

DCC

Service provider:

DCC

services


High level research interest
High-Level Research Interest Selective Document Exclusion

  • Improve “access” to data harvested for federated digital libraries by enhancing:

    • Representation of documents

    • Representation of document aggregations

    • Capitalizing on the relationship between aggregations and documents.

  • PS: By “document” I mean a single metadata (usually DC) record.


Motivation for our work
Motivation for our Work Selective Document Exclusion

  • Most empirical approaches to this type of problem rely on some kind of analysis of term counts.

  • Unreliable for our data:

    • Vocabulary mismatch

    • Poor probability estimates


The setting imls dcc1
The Setting: IMLS DCC Selective Document Exclusion


The problem supporting end user experience
The Problem: Supporting End-User Experience Selective Document Exclusion

  • Full-text search

  • Browse by “subject”

  • Desired:

    • Improved browsing

    • Support high-level aggregation understanding and resource discovery

  • Approach: Empirically induced “topics” using established methods--e.g. latent Dirichlet allocation(LDA).


Research question
Research Question Selective Document Exclusion

  • Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings?

  • Hypothesis: Harvested records are not all useful for training a model of corpus-level topics.

  • Approach: Identify and remove “weakly topical” documents during model training.


Latent dirichlet allocation
Latent Selective Document ExclusionDirichlet Allocation

  • Given a corpus of documents, C, and an empirically chosen integer k

  • Assume that a generative process involving k latent topics generated word occurrences in C.

  • End result: for a given word w and a given document D:

    • Pr(w|Ti)

    • Pr(D|Ti)

    • Pr(Ti)

For each topic T1 … Tk


Latent dirichlet allocation1
Latent Selective Document ExclusionDirichlet Allocation

  • Given a corpus of documents, C, and an empirically chosen integer k

  • Assume that a generative process involving k latent topics generated word occurrences in C.

  • End result: for a given word w and a given document D:

    • Pr(w|Ti)

    • Pr(D|Ti)

    • Pr(Ti)

  • Choose doc length N ~ Poisson(mu).

  • Choose probability vector Theta ~ Dir(alpha).

  • For each word wiin 1:N:

    • Choose topic zi ~ Multinomial(Theta).

    • Choose word wn from P(wn | wn, Beta).

For each topic T1 … Tk


Latent dirichlet allocation2
Latent Selective Document ExclusionDirichlet Allocation

  • Given a corpus of documents, C and an empirically chosen integer k.

  • Assume that a generative process involving k latent topics generated word occurrences in C.

  • End result: for a given word w and a given document D:

    • Pr(w|Ti)

    • Pr(D|Ti)

    • Pr(Ti)

Calculate estimates via iterative methods: MCMC / Gibbs Sampling.

For each topic T1 … Tk


Full Corpus Selective Document Exclusion


Full Corpus Selective Document Exclusion

Proposed algorithm


Reduced Corpus Selective Document Exclusion

Train the Model

Pr(w | T)

Pr(D | T)

Pr(T)


Full Corpus Selective Document Exclusion

Pr(w | T)

Pr(D | T)

Pr(T)

Inference

Pr(w | T)

Pr(D | T)

Pr(T)


Sample topics induced from raw data
Sample Topics Induced from “Raw” Data Selective Document Exclusion


Documents topical strength
Documents’ Topical Strength Selective Document Exclusion

  • Hypothesis: Harvested records are not all useful for training a model of corpus-level. topics.


Documents topical strength1
Documents’ Topical Strength Selective Document Exclusion

  • Hypothesis: Harvested records are not all useful for training a model of corpus-level.

  • Proposal: Improve induced topic model by removing “weakly topical” documents during training.

  • After training, use the inferential apparatus of LDA to assign topics to these “stop documents.”


Identifying stop documents
Identifying “Stop Documents” Selective Document Exclusion

  • Time at which documents enter a repository is often informative (e.g. bulk uploads).

where MC is the collection language model

and di is the words comprising the ith document

log Pr(di | MC)


Identifying stop documents1
Identifying “Stop Documents” Selective Document Exclusion

  • Our paper outlines an algorithm for accomplishing this.

  • Intuition:

    • Given a document di decide if it is part of a “run” of near-identical records.

    • Remove all records that occur within a run.

    • The required amount of homogeneity to identify a run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence.


Sample topics induced from groomed data
Sample Topics Induced from Groomed Data Selective Document Exclusion


Experimental assessment
Experimental Assessment Selective Document Exclusion

  • Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora?

  • Intrusion detection:

    • Find the 10 most probable words for topic Ti

    • Replace one of these 10 with a word chosen from the corpus with uniform probability.

    • Ask human assessors to identify the “intruder” word.


Experimental assessment1
Experimental Assessment Selective Document Exclusion

  • For each topic Tihave 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models.

    • i.e. 20 * 2* 100 = 4,000 assessments

  • Asiis the percent of workers who correctly found the intruder in the ith topic of the sampled model and Ari is analogous for the raw model

  • H0: Asi> Ariyields p<0.001


Experimental assessment2
Experimental Assessment Selective Document Exclusion

  • For each topic Tihave 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale.


Current future work
Current & Future Work Selective Document Exclusion

  • Testing breadth of coverage

  • Assessing the value of induced topics

  • Topic information for document priors in the language modeling IR framework [next slide]

  • Massive document expansion for improved language model estimation [under review]


Weak topicality and document priors
Weak Topicality and Document Priors Selective Document Exclusion


Weak topicality and document priors1
Weak Topicality and Document Priors Selective Document Exclusion


Thank you
Thank You Selective Document Exclusion

Miles Efron

Peter Organisciak

Katrina Fenlon

Graduate School of Library & Information Science

University of Illinois, Urbana-Champaign

ASIST 2011

New Orleans, LA

October 10, 2011


ad