Statistical modeling of text
Download
1 / 88

Statistical Modeling of Text - PowerPoint PPT Presentation


  • 213 Views
  • Uploaded on

Statistical Modeling of Text. (Can be seen as an application of probabilistic graphical models). Modeling Text Documents. Documents are sequences are words D = [p1=w1 p2=w2 p3=w3 … pk =wk] Where wi are drawn from some vocabulary V So P(D) = P([p1=w1 p2=w2 p3=w3 … pk =wk])

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Statistical Modeling of Text' - freya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Statistical modeling of text

Statistical Modeling of Text

(Can be seen as an application of probabilistic graphical models)


Modeling text documents
Modeling Text Documents

  • Documents are sequences are words

    • D = [p1=w1 p2=w2 p3=w3 … pk=wk]

      • Where wi are drawn from some vocabulary V

    • So P(D) = P([p1=w1 p2=w2 p3=w3 … pk=wk])

    • Needs highly joint probabilities..

  • Let us make assumptions


Unigram model
Unigram Model

  • Assume that all words occur independently (!)

    P(pi=wipk=wk) = P(pi=wi)*P(pk=wk)

    P(pi=wipk=wi) = P(pi=wi)*P(pi=wi)

    • P ([p1=w1 p2=w2 p3=w3 … pk=wk]) = P(w1)#(w1) P(w2)#(w2) …

Note that this way the probability

of occurrence of a word is the same

in EVERY DOCUMENT…

--A little too overboard..

--words in neighboring positions tend to be correlated

(bigram models; trigram models)

--Different documents tend to have different topics…

topic models


Single topic model
Single Topic Model

  • Assume each document has a topic z

  • The topic z determines the probabilities of the word occurrence

    • P ([p1=w1 p2=w2 p3=w3 … pk=wk]|z) = P(w1|z)#(w1) P(w2|z)#(w2) …

Connection to candies?

lime and cherry are words

bag types (h1..h5) are topics

you see candies; you guess

bag types..

..Still not quite right.. Each document is really a

mixture of topics..

The “supervised” version of this model

is the Naïve Bayes classifier


Bayesian document categorization
Bayesian document categorization

priors

P(Cat)

Cat

P(w|Cat)

w1

nD

D



Statistical modeling of text

Overview of Latent Semantic Indexing a topic space?

Eigen Slide

factor-factor

(+ve sqrt of eigen values of

d-t*d-t’or d-t’*d-t; both same)

Doc-factor

(eigen vectors of d-t*d-t’)

(term-factor)T

(eigen vectors of d-t’*d-t)

Term

Term

dt

df

dfk

dtk

tft

doc

ff

tfkt

ffk

Þ

doc

fxt

dxt

dxf

fxf

dxk

kxk

kxt

dxt

Reduce Dimensionality:

Throw out low-order

rows and columns

Recreate Matrix:

Multiply to produce

approximate term-

document matrix.

dtk is a k-rank matrix

That is closest to dt

Singular Value Decomposition

Convert doc-term

matrix into 3matrices

D-F, F-F, T-F

Where DF*FF*TF’ gives the

Original matrix back


Statistical modeling of text

New document coordinates a topic space?

d-f*f-f

t1= database

t2=SQL

t3=index

t4=regression

t5=likelihood

t6=linear

F-F

D-F

6 singular values

(positive sqrt of

eigen values of

dd or tt)

T-F

Eigen vectors of dd (dt*dt’)

(Principal document

directions)

Eigen vectors of tt (dt’*dt)

(Principal term directions)


Statistical modeling of text

t1= database a topic space?

t2=SQL

t3=index

t4=regression

t5=likelihood

t6=linear

For the database/regression

example

Suppose D1 is a new

Doc containing “database”

50 times and D2 contains

“SQL” 50 times


The plsi model attempts to give probabilistic semantics to lsa
The a topic space?pLSIModel (attempts to give probabilistic semantics to LSA)

For each word of document d in the training set,

  • Choose a topic z according to a multinomial conditioned on the index d.

  • Generate the word by drawing from a multinomial conditioned on z.

    LSA factors are linear combinations of terms; LDA topics are multinomial distributions over terms

d

zd1

zd2

zd3

zd4

wd1

wd2

wd3

wd4

Probabilistic Latent Semantic Indexing (pLSI) Model

Can also be written in a symmetric way

P(d)P(z|d)P(w|z) = P(d|z) P(z) P(w|z)

[Slides from Jonathan Huang]


Plsi to lda is a small technical step
PLSI to LDA is a small technical step a topic space?

  • First order view: LDA is just “bayesian learning” version of PLSI (which typically estimates its parameters with MLE/MAP)

  • Other differences:

    • In pLSI, the observed variable d is an index into some training set. There is no natural way for the model to handle previously unseen documents.

    • The number of parameters for pLSI grows linearly with M (the number of documents in the training set).

    • We would like to be Bayesian about our topic mixture proportions.


Intuition behind lda
Intuition behind LDA a topic space?

[LDA slides from Blei’s MLSS 09 lecture]


Generative model
Generative model a topic space?

Importance of

the “sparsity”

We want a document

to have more than one

topic, but not really

all the topics..

You can ensure

Sparsity by starting with

a dirichlet prior whose

Hyper parameter sum is

Low..

(you get interesting

colors by combining

primary colors, but

if you combine them

all you always get

white..)

Note that we are assuming that contiguous

words may come from different topics!



Unrolled lda model
Unrolled a topic space?LDA Model

  • For each document,

  • Choose ~Dirichlet()

  • For each of the N words wn:

    • Choose a topic zn» Multinomial()

    • Choose a word wn from p(wn|zn,), a multinomial probability conditioned on the topic zn.

z1

z2

z3

z4

z1

z2

z3

z4

z1

z2

z3

z4

w1

w2

w3

w4

w1

w2

w3

w4

w1

w2

w3

w4

b


Mcmc in lda
MCMC in LDA a topic space?

  • For each document,

  • Choose ~Dirichlet()

  • For each of the N words wn:

    • Choose a topic zn» Multinomial()

    • Choose a word wn from p(wn|zn,), a multinomial probability conditioned on the topic zn.

z1

z2

z3

z4

z1

z2

z3

z4

z1

z2

z3

z4

w1

w2

w3

w4

w1

w2

w3

w4

w1

w2

w3

w4

b


Mcmc in lda1
MCMC in LDA a topic space?

  • For each document,

  • Choose ~Dirichlet()

  • For each of the N words wn:

    • Choose a topic zn» Multinomial()

    • Choose a word wn from p(wn|zn,), a multinomial probability conditioned on the topic zn.

z1

z2

z3

z4

z1

z2

z3

z4

z1

z2

z3

z4

w1

w2

w3

w4

w1

w2

w3

w4

w1

w2

w3

w4

b


Lda model

LDA as a dimensionality a topic space?

reduction algorithm

--Documents can be seen

as vectors in k-dimenstional

topic space

--as against V-dimensional

vocabulary space

LDA model


A generative model for documents
A generative model for documents a topic space?

wP(w|Cat = 1)

wP(w|Cat = 2)

HEART 0.2

LOVE 0.2

SOUL 0.2

TEARS 0.2

JOY 0.2

SCIENTIFIC 0.0

KNOWLEDGE 0.0

WORK 0.0

RESEARCH 0.0

MATHEMATICS 0.0

HEART 0.0

LOVE 0.0

SOUL 0.0

TEARS 0.0

JOY 0.0

SCIENTIFIC 0.2

KNOWLEDGE 0.2

WORK 0.2

RESEARCH 0.2

MATHEMATICS 0.2

topic 1

topic 2


Statistical modeling of text

Choose mixture weights for each document, generate “bag of words”

{P(z = 1), P(z = 2)}

{0, 1}

{0.25, 0.75}

{0.5, 0.5}

{0.75, 0.25}

{1, 0}

MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS

RESEARCH WORK SCIENTIFIC MATHEMATICS WORK

SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC

HEART LOVE TEARS KNOWLEDGE HEART

MATHEMATICS HEART RESEARCH LOVE MATHEMATICS

WORK TEARS SOUL KNOWLEDGE HEART

WORK JOY SOUL TEARS MATHEMATICS

TEARS LOVE LOVE LOVE SOUL

TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY


Statistical modeling of text

LDA is a mixture-of-topics model for unigram text whose parameters are set through bayesian learning

Civilization advances by extending the number of important operations that we can do without thinking about them.

--Alfred North Whitehead


Dirichlet distribution
Dirichlet parameters are set through distribution


Dirichlet examples
Dirichlet Examples parameters are set through

Darker implies lower magnitude

\alpha < 1 leads to sparser topics


Statistical modeling of text
LDA parameters are set through


Inference in lda
Inference in LDA parameters are set through


Example inference
Example inference parameters are set through

Let’s look at the

NIPS papers

instead..


Example inference1
Example inference parameters are set through


Topics vs words
Topics parameters are set through vs words


Explore and browse document collections
Explore and browse document collections parameters are set through


Lda like any generative model is modular general useful
LDA (like any generative model) parameters are set through is modular, general, useful


Event tweet alignment
Event Tweet Alignment.. parameters are set through

Republican Primary Debate, 09/07/2011

Tweets tagged with #ReaganDebate

?

?

Which part of the event did a tweet refer to?

What were the topics of the event and tweets?

Applications: Event playback/Analysis, Sentiment Analysis, Advertisement, etc

37


Event tweet alignment the problem
Event-Tweet Alignment: The Problem parameters are set through

  • Given an event’s transcript S and its associated tweets T

    • Find the segment s (s ∈S) which is topically referred by tweet t (t ∈ T) [Could be a general tweet]

  • Alignment requires:

    • Extracting topics in the tweets and event

    • Segmenting the event into

      topically coherent chunks

    • Classify the tweets

      --General vs. Specific

Idea: represent tweets and

segements in a topic space

38


Event tweet alignment a model
Event parameters are set through -Tweet Alignment: A Model


Event tweet alignment challenges
Event parameters are set through -Tweet Alignment: Challenges

  • Both topics and Segments are latent

  • Tweets are topically influenced by the content of the event. A tweet’s words’ topics can be

    • general(high-level and constant across the entire event), or

    • specific(concrete and relate to specific segments of the event)

      • General tweet = weakly influenced by the event

      • Specific tweet = strongly influenced by the event

  • An event is formed by discrete sequentially-ordered segments, each of which discusses a particular set of topics


  • Event tweet alignment approaches
    Event-Tweet parameters are set through Alignment: Approaches

    • Prior work

      • Event Segmentation

        • HMM-based, etc

      • Topics Modeling

        • LDA, PLSI

    • Possible Solution

      • Apply LDA to event and Tweets separately

      • Measure the closeness by JS-divergence of their topic distributions

      • Problem: Event and and its twitter feeds are modeled largely independently

    • Our Solution: Joint Modeling

    • ET-LDA (event-tweets LDA) considers an event and its Twitter feeds jointly and characterizes the topic influences between them in a fully Bayeisanmodel

  • Potential advantages

    • Tweets provide a richer context about the topic evolution in the event

    • Can measure the influence of the event on the twitterati


  • Et lda
    ET-LDA parameters are set through


    Et lda model
    ET-LDA Model parameters are set through

    Tweets

    Event

    Determine event

    segmentation

    Determine

    which

    segment a tweet (word) refers to

    Determine tweet type

    Determine word’s topic in event

    Tweets word’s topic


    Et lda model1
    ET-LDA parameters are set through Model

    For more details of the inference, please refer to our paper: http://bit.ly/MBHjyZ


    Learning et lda gibbs sampling
    Learning ET-LDA: Gibbs sampling parameters are set through

    Coupling between a and b makes the posterior computation of latent variables is intractable

    For more details of the inference, please refer to our paper: http://bit.ly/MBHjyZ


    Analysis of the first obama romney debate
    Analysis of the First Obama-Romney Debate parameters are set through


    Lda is modular general useful
    LDA is modular, general, useful parameters are set through


    Lda is modular general useful1
    LDA is modular, general, useful parameters are set through


    Inverting the generative model
    Inverting the generative model parameters are set through

    • Maximum likelihood estimation (EM)

      • e.g. Hofmann (1999)

    • Deterministic approximate algorithms

      • variational EM; Blei, Ng & Jordan (2001; 2003)

      • expectation propagation; Minka & Lafferty (2002)

    • Markov chain Monte Carlo

      • full Gibbs sampler; Pritchard et al. (2000)

      • collapsed Gibbs sampler; Griffiths & Steyvers (2004)


    Generative vs discriminative learning
    Generative vs. Discriminative Learning parameters are set through

    • Often, we are really more interested in predicting only a subset of the attributes given the rest.

      • E.g. we have data attributes split into subsets X and Y, and we are interested in predicting Y given the values of X

    • You can do this by either by

      • learning the joint distribution P(X, Y) [Generative learning]

      • or learning just the conditional distribution P(Y|X) [Discriminative learning]

    • Often a given classification problem can be handled either generatively or discriminatively

      • E.g. Naïve Bayes and Logistic Regression

    • Which is better?


    Generative vs discriminative
    Generative vs. Discriminative parameters are set through

    P(y)P(x|y) = P(y,x) = P(x)P(y|x)

    Generative Learning

    Discriminative Learning

    More to the point (if what you want is P(Y|X), why bother with P(Y,X) which is after all P(Y|X) *P(X) and thus models the dependencies between X’s also?

    Since we don’t need to model dependencies among X, we don’t need to make any independence assumptions among them. So, we can merrily use highly correlated features..

    Interestingly, this freedom can hurt discriminative learners when there is too little data (as over fitting is easy)

    • More general (after all if you have P(Y, X) you can predict Y given X as well as do other inferences

      • You can predict jokes as well as make them up (or predict spam mails as well as generate them)

    • In trying to learn P(Y,X), we are often forced to make many independence assumptions both in Y and X—and these may be wrong..

      • Interestingly, this type of high bias can help generative techniques when there is too little data

    Bayes networks are not well suited for discriminative learning; Markov Networks are

    --thus Conditional Random Fields are basically MNs doing discriminative learning

    --Logistic regression can be seen as a simple CRF


    Latent dirichlet allocation blei ng jordan 2001 2003

    distribution over topics parameters are set through

    for each document

    Dirichlet priors

    distribution over words

    for each topic

    topic assignment

    for each word

    word generated from

    assigned topic

    Latent Dirichlet allocation(Blei, Ng, & Jordan, 2001; 2003)

     (d)  Dirichlet()

     (d)

    zi  Discrete( (d) )

    zi

    (j)  Dirichlet()

     (j)

    T

    wi  Discrete((zi) )

    wi

    Nd

    D


    Statistical modeling of text

    Note that the other parents parameters are set through

    of zj are part of the

    markov blanket

    P(rain|cl,sp,wg) = P(rain|cl) * P(wg|sp,rain)


    The collapsed gibbs sampler
    The collapsed Gibbs sampler parameters are set through

    G(n+1) = n G(n)

    • Using conjugacy of Dirichlet and multinomial distributions, integrate out continuous parameters

    • Defines a distribution on discrete ensembles z


    Statistical modeling of text

    The LDA model is no longer sparse after parameters are set through

    marginalization….!

    But you don’t need to see it ;-)

    unroll

    Marginalize


    The collapsed gibbs sampler1
    The collapsed Gibbs sampler parameters are set through

    #times wi occurs

    in topic zi

    • Sample each zi conditioned on z-i

    • This is nicer than your average Gibbs sampler:

      • memory: counts can be cached in two sparse matrices

      • optimization: no special functions, simple arithmetic

      • the distributions on  and  are analytic given z and w, and can later be found for each sample


    Statistical modeling of text

    Marginalization parameters are set through

    Gibbs Sampling


    Gibbs sampling in lda
    Gibbs sampling in LDA parameters are set through

    iteration

    1


    Gibbs sampling in lda1
    Gibbs sampling in LDA parameters are set through

    iteration

    1 2


    Gibbs sampling in lda2
    Gibbs sampling in LDA parameters are set through

    iteration

    1 2


    Gibbs sampling in lda3
    Gibbs sampling in LDA parameters are set through

    iteration

    1 2


    Gibbs sampling in lda4
    Gibbs sampling in LDA parameters are set through

    iteration

    1 2


    Gibbs sampling in lda5
    Gibbs sampling in LDA parameters are set through

    iteration

    1 2


    Gibbs sampling in lda6
    Gibbs sampling in LDA parameters are set through

    iteration

    1 2


    Gibbs sampling in lda7
    Gibbs sampling in LDA parameters are set through

    iteration

    1 2


    Gibbs sampling in lda8

    Can estimate parameters are set through

    P(Z|W)

    Gibbs sampling in LDA

    iteration

    1 2 … 1000


    Effects of hyperparameters
    Effects of hyperparameters parameters are set through

    •  and  control the relative sparsity of  and 

      • smaller , fewer topics per document

      • smaller , fewer words per topic

    • Good assignments z compromise in sparsity

    log (x)

    x


    Markov networks

    Markov Networks parameters are set through


    Statistical modeling of text

    We can have potentials parameters are set through

    on any cliques—not just

    the maximal ones.

    So, for example we can

    have a potential on A

    in addition to the other

    four pairwise potentials

    A

    Factor says a=b=0

    B

    D

    Qn: What is the

    most likely

    configuration of A&B?

    C

    But, marginal says

    a=0;b=1!

    Okay, you convinced me

    that given any potentials

    we will have a consistent

    Joint. But given any joint,

    will there be a potentials

    I can provide?

    Hammersley-Clifford

    theorem…

    Although A,B would

    Like to agree, B&C

    Need to agree,

    C&D need to disagree

    And D&A need to agree

    .and the latter three have

    Higher weights!

    Mr. & Mrs. Smith example 

    Moral: Factors

    are notmarginals!


    Markov nets vs bayes nets
    Markov Nets vs. Bayes Nets parameters are set through


    Statistical modeling of text

    Connection to MCMC: parameters are set through

    MCMC requires sampling a node given its markov blanket

    Need to use P(x|MB(x)). For Bayes nets MB(x) contains more

    nodes than are mentioned in the local distribution CPT(x)

     For Markov nets,


    Markov networks1
    Markov Networks parameters are set through

    Smoking

    Cancer

    • Undirected graphical models

    Asthma

    Cough

    • Potential functions defined over cliques


    Log linear models for markov nets
    Log-Linear models for Markov Nets parameters are set through

    A

    B

    D

    Without loss of generality!

    C

    Factors are “functions” over their domains

    Log linear model consists of

     Features fi(Di ) (functions over domains)

    Weights wi for features s.t.


    Markov networks2
    Markov Networks parameters are set through

    Smoking

    Cancer

    • Undirected graphical models

    Asthma

    Cough

    • Log-linear model:

    Weight of Feature i

    Feature i


    Inference in markov networks
    Inference in Markov Networks parameters are set through

    • Goal: Compute marginals & conditionals of

    • Exact inference is #P-complete

    • Most BN inference approaches work for MNs too

      • Variable Elimination used factor multiplication—and should work without change..

    • Conditioning on Markov blanket is easy:

    • Gibbs sampling exploits this


    Mcmc gibbs sampling
    MCMC: Gibbs Sampling parameters are set through

    state← random truth assignment

    fori← 1 tonum-samples do

    for each variable x

    sample x according to P(x|neighbors(x))

    state←state with new value of x

    P(F) ← fraction of states in which F is true


    Learning markov networks
    Learning Markov Networks parameters are set through

    • Learning parameters (weights)

      • Generatively

      • Discriminatively

    • Learning structure (features)

    • Easy Case: Assume complete data(If not: EM versions of algorithms)


    Entanglement in log likelihood
    Entanglement in log likelihood… parameters are set through

    a

    b

    c


    Learning for log linear formulation
    Learning for log-linear formulation parameters are set through

    What is the expected

    Value of the feature

    given the current

    parameterization

    of the network?

    Requires inference to answer

    (inference at every iteration—

    sort of like EM )

    Use gradient ascent

    Unimodal, because Hessian is

    Co-variance matrix over features


    Why should we spend so much time computing gradient
    Why should we spend so much time computing gradient? parameters are set through

    • Given that gradient is being used only in doing the gradient ascent iteration, it might look as if we should just be able to approximate it in any which way

      • Afterall, we are going to take a step with some arbitrary step size anyway..

    • ..But the thing to keep in mind is that the gradient is a vector. We are talking not just of magnitude but direction. A mistake in magnitude can change the direction of the vector and push the search into a completely wrong direction…


    Generative weight learning

    No. of times feature parameters are set through i is true in data

    Expected no. times feature i is true according to model

    Generative Weight Learning

    • Maximize likelihood or posterior probability

    • Numerical optimization (gradient or 2nd order)

    • No local maxima

    • Requires inference at each step (slow!)


    Alternative objectives to maximize
    Alternative Objectives to maximize.. parameters are set through

    Given a single data instance x log-likelihood is

    • Since log-likelihood requires network inference to compute the derivative, we might want to focus on other objectives whose gradients are easier to compute (and which also –hopefully—have optima at the same parameter values).

    • Two options:

      • Pseudo Likelihood

      • Contrastive Divergence

    Log prob of data

    Log prob of allotherpossible

    data instances (w.r.t. current q)

    Maximize the distance

    (“increase the divergence”)

    Compute likelihood of

    each possible data instance

    just using markov blanket

    (approximate chain rule)

    Pick a sample of

    typical other instances

    (need to sample from Pq

    Run MCMC initializing with

    the data..)


    Pseudo likelihood
    Pseudo-Likelihood parameters are set through

    • Likelihood of each variable given its neighbors in the data

    • Does not require inference at each step

    • Consistent estimator

    • Widely used in vision, spatial statistics, etc.

    • But PL parameters may not work well forlong inference chains

    [Which can lead to disasterous results]


    Discriminative weight learning
    Discriminative Weight Learning parameters are set through

    • Maximize conditional likelihood of query (y) given evidence (x)

    • Approximate expected counts by counts in MAP state of y given x

    No. of true groundings of clause i in data

    Expected no. true groundings according to model