Probabilistic Topic Models

Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology Department of Statistics University of Illinois, Urbana-Champaign http://www.cs.illinois.edu/homes/czhai czhai@illinois.edu

Outline • General Idea of Topic Models • Basic Topic Models • Probabilistic Latent Semantic Analysis (PLSA) • Latent Dirichlet Allocation (LDA) • Applications of Basic Topic Models to Text Mining • Advanced Topic Models • Capturing Topic Structures • Contextualized Topic Models • Supervised Topic Models • Summary We are here

Document as a Sample of Mixed Topics [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … • How can we discover these topic word distributions? • Many applications would be enabled by discovering such topics • Summarize themes/aspects • Facilitate navigation/browsing • Retrieve documents • Segment documents • Many other text mining tasks government 0.3 response 0.2... Topic 1 city 0.2new 0.1orleans 0.05 ... Topic 2 … donate 0.1relief 0.05help 0.02 ... Topic k is 0.05the 0.04a 0.03 ... Background k

Simplest Case: 1 topic + 1 “background” d Assume words in d are from two distributions: 1 topic + 1 background (rather than just one) B Text mining paper General Background English Text B d the 0.03 a 0.02 is 0.015 we 0.01 ... food 0.003 computer 0.00001 … text 0.000006 … the 0.031 a 0.018 … text 0.04 mining 0.035 association 0.03 clustering 0.005 computer 0.0009 … food 0.000001 … How can we “get rid of” the common words from the topic to make it more discriminative? Document LM: p(w|d) Background LM: p(w|B)

Background words w P(w|B)  Document d P(Topic) Topic words w 1- P(w| ) The Simplest Case: One Topic + One Background Model Assume p(w|B) and  are known  = assumed percentage of background words in d Topic choice Maximum Likelihood

Understanding a Mixture Model the 0.2 a 0.1 we 0.01 to 0.02 … text 0.0001 mining 0.00005 … Suppose each model would be selected with equal probability =0.5 Known Background p(w|B) The probability of observing word “text”: p(“text”|B) + (1- )p(“text”| ) =0.5*0.0001 + 0.5* p(“text”| ) The probability of observing word “the”: p(“the”|B) + (1- )p(“the”| ) =0.5*0.2 + 0.5* p(“the”| ) Unknown query topic p(w|)=? “Text mining” … text =? mining =? association =? word =? … The probability of observing “the” & “text” (likelihood) [0.5*0.0001 + 0.5* p(“text”| )]  [0.5*0.2 + 0.5* p(“the”| )] How to set p(“the”|) and p(“text”|) so as to maximize this likelihood? assume p(“the”| )+p(“text”| )=constant  give p(“text”|) a higher probability than p(“the”|) (why?) B and are competing for explaining words in document d!

ML Estimator Simplest Case Continued: How to Estimate ? the 0.2 a 0.1 we 0.01 to 0.02 … text 0.0001 mining 0.00005 … Observed words Known Background p(w|B) =0.7 Unknown query topic p(w|)=? “Text mining” … text =? mining =? association =? word =? … =0.3 Suppose we know the identity/label of each word ...

Can We Guess the Identity? Identity (“hidden”) variable: zi{1 (background), 0(topic)} zi 1 1 1 1 0 0 0 1 0 ... Suppose the parameters are all known, what’s a reasonable guess of zi? - depends on  (why?) - depends on p(w|B) and p(w|) (how?) the paper presents a text mining algorithm the paper ... E-step M-step Initially, set p(w| ) to some random values, then iterate …

An Example of EM Computation Expectation-Step: Augmenting data by guessing hidden variables Maximization-Step With the “augmented data”, estimate parameters using maximum likelihood Assume =0.5

Outline • General Idea of Topic Models • Basic Topic Models • Probabilistic Latent Semantic Analysis (PLSA) • Latent Dirichlet Allocation (LDA) • Applications of Basic Topic Models to Text Mining • Advanced Topic Models • Capturing Topic Structures • Contextualized Topic Models • Supervised Topic Models • Summary We are here

Discover Multiple Topics in a Collection Percentage of background words Coverage of topic  j in doc d Prob. of word w in topic  j ? Parameters: =(B, {d,j}, { j}) Can be estimated using ML Estimator ? Topic coverage in document d warning 0.3 system 0.2.. Topic 1 ? ? aid 0.1donation 0.05support 0.02 .. 1 “Generating” word w in doc d in the collection ? d,1 Topic 2 2 ? d,2 1 - B … ? statistics 0.2loss 0.1dead 0.05 .. d, k ? W k Topic k ? B ? is 0.05the 0.04a 0.03 .. ? B Background B

Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99a, 99b] • Mix k multinomial distributions to generate a document • Each document has a potentially different set of mixing weights which captures the topic coverage • When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same multinomial distribution) • By fitting the model to text data, we can estimate (1) the topic coverage in each document, and (2) word distribution for each topic, thus achieving “topic mining”

M-Step: Max. Likelihood Estimator based on “fractional counts” How to Estimate Multiple Topics?(Expectation Maximization) the 0.2 a 0.1 we 0.01 to 0.02 … E-Step: Predict topic labels using Bayes Rule Known Background p(w | B) Observed Words … text =? mining =? association =? word =? … Unknown topic model p(w|1)=? “Text mining” … Unknown topic model p(w|2)=? “informationretrieval” … information =? retrieval =? query =? document =? …

E-Step: • Word w in doc d is generated • from cluster j • from background Application of Bayes rule • M-Step: • Re-estimate • mixing weights • topic LM • Fractional counts contributing to • using cluster j in generating d • generating w from cluster j Sum over all docs in the collection Parameter Estimation

How the Algorithm Works c(w,d)(1 - p(zd,w = B))p(zd,w=j) πd1,1( P(θ1|d1) ) πd1,2( P(θ2|d1) ) Topic coverage c(w,d)p(zd,w = B) c(w, d) aid 7 price d1 5 Initial value 6 oil πd2,1( P(θ1|d2) ) πd2,2( P(θ2|d2) ) aid 8 d2 price 7 5 oil Initial value Topic 1 Topic 2 P(w| θ) Iteration 1: E Step: split word counts with different topics (by computing z’ s) Iteration 2: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the splitted word counts Initializing πd, j and P(w| θj) with random values Iteration 3, 4, 5, … Until converging Iteration 2: E Step: split word counts with different topics (by computing z’ s) Iteration 1: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the splitted word counts aid Initial value price oil

PLSA with Prior Knowledge • Users have some domain knowledge in mind, e.g., • We expect to see “retrieval models” as a topic in IR literature • We want to see aspects such as “battery” and “memory” for opinions about a laptop • One topic should be fixed to model background words (infinitely strong prior!) • We can easily incorporate such knowledge as priors of PLSA model

Adding Prior :Maximum a Posteriori (MAP) Estimation Most likely  Prior can be placed on  as well (more about this later) Topic coverage in document d warning 0.3 system 0.2.. Topic 1 d,1 1 “Generating” word w in doc d in the collection aid 0.1donation 0.05support 0.02 .. Topic 2 2 d,2 1 - B … d, k W k statistics 0.2loss 0.1dead 0.05 .. Topic k B B is 0.05the 0.04a 0.03 .. Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum A Posteriori (MAP) Background B

MAP Estimator Adding Prior as Pseudo Counts Observed Doc(s) the 0.2 a 0.1 we 0.01 to 0.02 … Known Background p(w | B) Suppose, we know the identity of each word ... … text =? mining =? association =? word =? … Unknown topic model p(w|1)=? “Text mining” Pseudo Doc … Unknown topic model p(w|2)=? “informationretrieval” … information =? retrieval =? query =? document =? … Size = μ text mining

+p(w|’j) + Maximum A Posterior (MAP) Estimation Pseudo counts of w from prior ’ Sum of all pseudo counts What if =0? What if =+? A consequence of using conjugate prior is that the prior can be converted into “pseudo data” which can then be “merged” with the actual data for parameter estimation

A General Introduction to EM Data: X (observed) + H(hidden) Parameter:  “Incomplete” likelihood: L( )= log p(X| ) “Complete” likelihood: Lc( )= log p(X,H| ) EM tries to iteratively maximize the incomplete likelihood: Starting with an initial guess (0), 1. E-step: compute the expectation of the complete likelihood 2. M-step: compute (n) by maximizing the Q-function

Convergence Guarantee Goal: maximizing “Incomplete” likelihood: L( )= log p(X| ) I.e., choosing (n), so that L((n))-L((n-1))0 Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X,  (n-1) )/p(H|X, (n))] Taking expectation w.r.t. p(H|X, (n-1)), L((n))-L((n-1)) = Q((n);  (n-1))-Q( (n-1);  (n-1)) + D(p(H|X,  (n-1))||p(H|X,  (n))) Doesn’t contain H EM chooses (n) to maximize Q KL-divergence, always non-negative Therefore, L((n))  L((n-1))!

EM as Hill-Climbing:converging to a local maximum Likelihood p(X| ) L()=L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) )+ D(p(H|X,  (n-1) )||p(H|X,  )) L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) ) next guess current guess Lower bound (Q function)  E-step = computing the lower bound M-step = maximizing the lower bound

Deficiency of PLSA • Not a generative model • Can’t compute probability of a new document • Heuristic workaround is possible, though • Many parameters  high complexity of models • Many local maxima • Prone to overfitting • Not necessary a problem for text mining (only interested in fitting the “training” documents)

Latent Dirichlet Allocation (LDA) [Blei et al. 02] • Make PLSA a generative model by imposing a Dirichlet prior on the model parameters  • LDA = Bayesian version of PLSA • Parameters are regularized • Can achieve the same goal as PLSA for text mining purposes • Topic coverage and topic word distributions can be inferred using Bayesian inference

LDA = Imposing Prior on PLSA PLSA: Topic coverage d,j is specific to the “training documents”, thus can’t be used to generate a new document Topic coverage in document d {d,j } are free for tuning “Generating” word w in doc d in the collection d,1 1 2 W d,2 LDA: Topic coverage distribution {d,j } for any document is sampled from a Dirichlet distribution, allowing for generating a new doc d, k k {d,j } are regularized Magnitudes of  and  determine the variances of the prior, thus also the strength of prior (larger  and  stronger prior) In addition, the topic word distributions {j } are also drawn from another Dirichlet prior

Equations for PLSA vs. LDA PLSA Core assumption in all topic models PLSA component LDA Added by LDA

Parameter Estimation & Inferences in LDA Parameter estimation can be done in the same say as in PLSA: Maximum Likelihood Estimator: However, must now be computed using posterior inference: Computationally intractable, must resort to approximate inference!

distribution over topics for each document (same as d on the previous slides) Dirichlet priors distribution over words for each topic (same as  j on the previous slides) topic assignment for each word word generated from assigned topic LDA as a graph model [Blei et al. 03a]   (d)  (d)  Dirichlet()  zi  Discrete( (d) ) zi  (j) T (j)  Dirichlet() wi  Discrete((zi) ) wi Nd D Most approximate inference algorithms aim to infer from which other interesting variables can be easily computed

Approximate Inferences for LDA • Many different ways; each has its pros & cons • Deterministic approximation • variational EM [Blei et al. 03a] • expectation propagation [Minka & Lafferty 02] • Markov chain Monte Carlo • full Gibbs sampler [Pritchard et al. 00] • collapsed Gibbs sampler [Griffiths & Steyvers 04] Most efficient, and quite popular, but can only work with conjugate prior

The collapsed Gibbs sampler [Griffiths & Steyvers 04] • Using conjugacy of Dirichlet and multinomial distributions, integrate out continuous parameters • Defines a distribution on discrete ensembles z

The collapsed Gibbs sampler [Griffiths & Steyvers 04] • Sample each zi conditioned on z-i • This is nicer than your average Gibbs sampler: • memory: counts can be cached in two sparse matrices • optimization: no special functions, simple arithmetic • the distributions on  and  are analytic given z and w, and can later be found for each sample

Gibbs sampling in LDA iteration 1

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2 words in di assigned with topic j Count of instances where wi is assigned with topic j Count of all words assigned with topic j words in di assigned with any topic

Gibbs sampling in LDA iteration 1 2 What’s the most likely topic for wi in di? How likely would di choose topic j? How likely would topic j generate word wi?

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2 … 1000

The doc is about text mining and food nutrition, how much percent is about text mining? 30% of the doc is about text mining, what’s the rest about? The doc is about text mining, is it also about some other topic, and if so to what extent? 30% of the doc is about one topic and 70% is about another, what are these two topics? The doc is about two subtopics, find out what these two subtopics are and to what extent the doc covers each. Applications of Topic Models for Text Mining:Illustration with 2 Topics Likelihood: • Application Scenarios: • p(w|1) & p(w|2) are known; estimate  • p(w|1) &  are known; estimate p(w|2) • p(w|1) is known; estimate  & p(w|2) •  is known; estimate p(w|1)& p(w|2) • Estimate , p(w|1), p(w|2)

Use PLSA/LDA for Text Mining • Both PLSA and LDA would be able to generate • Topic coverage in each document: p(d = j) • Word distribution for each topic: p(w|j) • Topic assignment at the word level for each document • The number of topics must be given in advance • These probabilities can be used in many different ways • j naturally serves as a word cluster • d,j can be used for document clustering • Contextual text mining: Make these parameters conditioned on context, e.g., • p(j |time), from which we can compute/plot p(time| j ) • p(j |location), from which we can compute/plot p(loc| j )

Sample Topics from TDT Corpus [Hofmann 99b]

Term, relevance, weight, feedback Retrieval Models How to Help Users Interpret a Topic Model? [Mei et al. 07b] • Use top words • automatic, but hard to make sense • Human generated labels • Make sense, but cannot scale up term 0.16relevance 0.08weight 0.07 feedback 0.04independence 0.03 model 0.03 frequent 0.02 probabilistic 0.02 document 0.02 … insulin foraging foragers collected grains loads collection nectar … ? Question: Can we automatically generate understandable labels for topics?

Statistical topic models term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Multinomial topic models Collection (Context) Coverage; Discrimination Relevance Score Re-ranking clustering algorithm; distance measure; … database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … NLP Chunker Ngram stat. Ranked Listof Labels Candidate label pool Automatic Labeling of Topics [Mei et al. 07b] 1 2

Good Label (l1): “clustering algorithm” Bad Label (l2):“body shape” Relevance: the Zero-Order Score • Intuition: prefer phrases well covering top words Clustering p(“clustering”|) = 0.4 √ p(“dimensional”|) = 0.3 dimensional > algorithm Latent Topic  … birch p(“shape”|) = 0.01 shape … p(w|) body p(“body”|) = 0.001

Clustering Clustering Clustering dimension dimension dimension Good Label (l1):“clustering algorithm” Bad Label (l2):“hash join” … Topic  partition algorithm partition key algorithm algorithm hash … … P(w|) p(w | hash join) hash hash p(w | clustering algorithm ) D( | clustering algorithm) < D( | hash join) Relevance: the First-Order Score • Intuition: prefer phrases with similar context (distribution) SIGMOD Proceedings Score (l,  )

Results: Sample Topic Labels north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 selectivity estimation … iran contra … the, of, a, and,to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005 tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 r treeb tree … clustering algorithm clustering structure … large data, data quality, high data, data application, … indexing methods

Results: Contextual-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … Context: Database(SIGMOD Proceedings) Context: IR(SIGIR Proceedings) selectivity estimation; random sampling; approximate answers; distributed retrieval; parameter estimation; mixture models; dependencies functional cube multivalued iceberg buc … multivalue dependency functional dependency Iceberg cube term dependency; independence assumption;

gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038… marketing 0.0087customer 0.0086 model 0.0079business 0.0048… rules 0.0142association 0.0064 support 0.0053… Using PLSA to Discover Temporal Topic Trends[Mei & Zhai 05]

Probabilistic Topic Models

Probabilistic Topic Models

Presentation Transcript

Probabilistic Topic Models and Associative Memory

Probabilistic models

Probabilistic Models

Temporal Probabilistic Models

Temporal Probabilistic Models

Probabilistic Models

Semantic Representations with Probabilistic Topic Models

Probabilistic graphical models

Probabilistic Graphical Models

Probabilistic Graphical Models

Probabilistic Topic Models for Text Mining

Temporal Probabilistic Models

Probabilistic Models

Probabilistic Models

Probabilistic Topic Models

Contextual Text Mining with Probabilistic Topic Models

Probabilistic Graphical Models

Probabilistic Models

Probabilistic Models

Contextual Text Mining with Probabilistic Topic Models

Probabilistic models