27. May 2014

Topic Models Nam Khanh Tran (ntran@L3S.de) L3S Research Center 27. May 2014 1

Acknowledgements • The slides are in part based on the following slides • “Probabilistic Topic Models”, David M. Blei 2012 • “Topic Models”, Claudia Wagner, 2010 • .....and the papers • David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation. Journal of Machine Learning Research 2003 • Steyvers and Griffiths, Probabilistic Topic Models, (2006). • David M. Blei, John D. Lafferty, Dynamic Topic Models. Proceedings of the 23rd international conference on Machine learning Nam Khanh Tran 2

Outline • Introduction • Latent Dirichlet Allocation • Overview • The posterior distribution for LDA • Gibbs sampling • Beyond latent Dirichlet Allocation • Demo Nam Khanh Tran 3

The problem with information • As more information becomes available, it becomes more difficult to find and discover what we need • We need new tools to help us organize, search, and understand these vast amounts of information Nam Khanh Tran 4

Topic modeling • Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives • Discover the hidden themes that pervade the collection • Annotate the documents according to those themes • Use annotations to organize, summarize, search, form predictions Nam Khanh Tran 5

Discover topics from a corpus Nam Khanh Tran 6

Model the evolution of topics over time Nam Khanh Tran 7

Model connections between topics Nam Khanh Tran 8

Image annotation Nam Khanh Tran 9

Latent Dirichlet Allocation

Latent Dirichlet Allocation • Introduction to LDA • The posterior distribution for LDA • Gibbs sampling Nam Khanh Tran 11

Probabilistic modeling • Treat data as observations that arise from a generative probabilistic process that includes variables • For documents, the hidden variables reflect the thematic structure of the collection • Infer the hidden structure using posterior inference • What are the topics that describe this collections? • Situate new data into the estimated model • How does the query or new document fit into the estimated topic structure Nam Khanh Tran 12

Intuition behind LDA Nam Khanh Tran 13

Generative model Nam Khanh Tran 14

The posterior distribution Nam Khanh Tran 15

Topic Models Topic distribution per doc (topic-doc-matrix) 3 latent variables: Word distribution per topic (word-topic-matrix) Topic word assignment Topic 1 Topic 2 (Steyvers, 2006)

Topic models • Observed variables: • Word-distribution per document • 3 latent variables • Topic distribution per document : P(z) = θ(d) • Word distribution per topic: P(w, z) = φ(z) • Word-Topic assignment: P(z|w) • Training: Learn latent variables on trainings-collection of documents • Test: Predict topic distribution θ(d) of an unseen document d

Latent Dirichlet Allocation (LDA) P(φ(z) | β) Advantage: We learn topic distribution of a corpus  we can predict topic distribution of an unseen document of this corpus by observing its words Hyper-parameters α and β are corpus-level parameters  are only sampled once P( w | z, φ (z)) number of documents number of words

Matrix Representation of LDA φ(z) θ(d) latent latent observed

Statistical Inference and Parameter Estimation • Key problem: • Compute posterior distribution of the hidden variables given a document • Posterior distribution is intractable for exact inference (Blei, 2003) Latent Vars Observed Vars and Priors

Statistical Inference and Parameter Estimation • How can we estimate posterior distribution of hidden variables given a corpus of trainings-documents? • Direct (e.g. via expectation maximization, variational inference or expectation propagation algorithms) • Indirect  i.e. estimate the posterior distribution over z (i.e. P(z)) • Gibbs sampling, a form of Markov chain Monte Carlo, is often used to estimate the posterior probability over a high-dimensional random variable z

Gibbs Sampling • Generates a sequence of samples from the joint probability distribution of two or more random variables. • Aim: compute posterior distribution over latent variable z • Pre-request: we must know the conditional probability of z • P( zi = j | z-i , wi , di , . )

Gibbs Sampling for LDA • Random start • Iterative • For each word we compute • How dominant is a topic z in the doc d? How often was the topic z already used in doc d? • How likely is a word for a topic z? How often was the word w already assigned to topic z?

Run Gibbs Sampling Example (1) 1 2 1 2 1 2 1 2 • Random topic assignments • 2 count-matrices: • CWT Words per topic • CDT Topics per document 1 2 2 1 1 2 2 1 1 2 1 2 1 2 2 1

Gibbs Sampling for LDA Count number of times a word token wi was assigned to a topic j across all docs Probability that topic j is chosen for word wi, conditioned on all other assigned topics of words in this doc and all other observed vars. Count number of times a topic j was already assigned to some word token in doc di unnormalized! => divide the probability of assigning topic j to word wi by the sum over all topics T

Run Gibbs Sampling Example (2) 2 2 1 2 1 2 1 2 • First Iteration: • Decrement CDT and CWT for current topic j • Sample new topic from the current topic-distribution of a doc 1 2 2 1 1 2 2 1 1 2 1 2 1 2 2 1 2 3 3 5

Run Gibbs Sampling Example (2) 2 2 2 1 2 1 2 1 2 • First Iteration: • Decrement CDT and CWT for current topic j • Sample new topic from the current topic-distribution of a doc 1 2 2 1 1 2 2 1 1 2 1 2 1 2 2 1 5 6 4 5

Run Gibbs Sampling Example (3) How often was topic j used in doc di α = 50/T = 25 and β = 0.01 How often were all other topics used in doc di “Bank” is assigned to Topic 2

Example inference

Topics vs. words

Visualizing a document • Use the posterior topic probabilities of each document and the posterior topic assignments to each word

Document similarity • Two documents are similar if they assign similar probabilities to topics

Beyond Latent Dirichlet Allocation Nam Khanh Tran 38

Extending LDA • LDA is a simple topic model • It can be used to find topics that describe a corpus • Each document exhibits multiple topics • How can we build on this simple model of text? Nam Khanh Tran 39

Extending LDA • LDA can be embedded in more complicated models, embodying further intuitions about the structure of the texts (e.g., account for syntax, authorship, dynamics, correlation, and other structure) • The data generating distribution can be changed. We can apply mixed-membership assumptions to many kinds of data (e.g., models of images, social networks, music, computer code and other types) • The posterior can be used in many ways (e.g., use inferences in IR, recommendation, similarity, visualization and other applications) Nam Khanh Tran 40

Dynamic topic models Nam Khanh Tran 41

Long tail of data Nam Khanh Tran 45

Topic cropping Thema 2 Thema 4 Term Selection Finden charakteris-tischer Terme Corpus Collection durch Suche Topic Modeling mittels LDA Topic Inference basierend auf dem gelernten Model Thema 1: team, kollegen, … Thema 2: prozess, planung, … Thema 3: schicht, nacharbeit, .. Thema 4: qualifizierung, lernen

Implementations of LDA • There are many available implementations of topic modeling • LDA-C : A C implementation of LDA • Online LDA: A python package for LDA on massive data • LDA in R: Package in R for many topic models • Mallet: Java toolkit for statistical NLP Nam Khanh Tran 47

Demo Nam Khanh Tran 48

Discussion Nam Khanh Tran 49

27. May 2014

27. May 2014

Presentation Transcript

JLab Hypernuclear Workshop 27 th May 2014

LTAP End of Year May 27, 2014

Tues day , May 27, 2014

MU Listening Sessions May 20, 2014 and May 27, 2014

JLab Hypernuclear Workshop 27 th May 2014

Neuroanatomy & Organization May 27, 2014

May 27, 2014

May 27, 2014 Daniel Sigg

CPHA conference, Toronto, ON May 27, 2014

Tuesday, May 27, 2014

Space News Update - May 27, 2014 -

Tuesday May 27, 2014

May 27, 2014

May 27, 2014 EOG Review

May 27

Tuesday, May 27 th 2014

Issyk Kul , May 27 and 28, 2014

27 May 2014

May 27, 2014

Informatics 43 – May 27 , 2014

May 27 th , 2014

April 27 – May 3, 2014

27. May 2014

27. May 2014

Presentation Transcript

JLab Hypernuclear Workshop 27 th May 2014

LTAP End of Year May 27, 2014

Tues day , May 27, 2014

MU Listening Sessions May 20, 2014 and May 27, 2014

JLab Hypernuclear Workshop 27 th May 2014

Neuroanatomy &amp; Organization May 27, 2014

May 27, 2014

May 27, 2014 Daniel Sigg

CPHA conference, Toronto, ON May 27, 2014

Tuesday, May 27, 2014

Space News Update - May 27, 2014 -

Tuesday May 27, 2014

May 27, 2014

May 27, 2014 EOG Review

May 27

Tuesday, May 27 th 2014

Issyk Kul , May 27 and 28, 2014

27 May 2014

May 27, 2014

Informatics 43 – May 27 , 2014

May 27 th , 2014

April 27 – May 3, 2014

Neuroanatomy & Organization May 27, 2014