1 / 51

Topic modeling

Topic modeling. Mark Steyvers Department of Cognitive Sciences University of California, Irvine. Some topics we can discuss. Introduction to LDA: basic topic model Preliminary work on therapy transcripts Extensions to LDA Conditional topic models (for predicting behavioral codes)

diamond
Download Presentation

Topic modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic modeling Mark Steyvers Department of Cognitive Sciences University of California, Irvine

  2. Some topics we can discuss • Introduction to LDA: basic topic model • Preliminary work on therapy transcripts • Extensions to LDA • Conditional topic models (for predicting behavioral codes) • Various topic models for word order • Topic models incorporating parse trees • Topic models for dialogue • Topic models incorporating speech information

  3. Most basic topic model: LDA(Latent Dirichlet Allocation)

  4. Automatic and unsupervised extraction of semantic themes from large text collections. Pennsylvania Gazette (1728-1800) 80,000 articles Enron 250,000 emails NYT 330,000 articles NSF/ NIH 100,000 grants AOL queries 20,000,000 queries 650,000 users 16 million Medline articles

  5. Doc1 Doc2 Doc3 … PIZZA 34 0 3 PASTA 12 0 2 ITALIAN 0 19 6 FOOD … 0 … 16 … 1 … Model Input • Matrix of counts: number of times words occur in documents • Note: • word order is lost: “bag of words” approach • Some function words are deleted: “the”, “a”, “in” documents words

  6. Basic Assumptions • Each topic is a distribution over words • Each document a mixture of topics • Each word in a document originates from a single topic

  7. Document = mixture of topics auto car parts cars used ford honda truck toyota party store wedding birthday jewelry ideas cards cake gifts webmd cymbalta xanax gout vicodin effexor prednisone lexapro ambien hannah montana zac efron disney high school musical mileycyrus hilary duff 20% Document ------------------------------- -------------------------------------------------------------- --------------------------------------------------------------------------------------- 80% 100% Document ------------------------------- -------------------------------------------------------------- ---------------------------------------------------------------------------------------

  8. Generative Process • For each document, choose a mixture of topics •  Dirichlet() • Sample a topic [1..T] from the mixturez Multinomial() • Sample a word from the topicw Multinomial((z)) Dirichlet(β) Nd D T

  9. Prior Distributions • Dirichlet priors encourage sparsity on topic mixtures and topics Topic 3 Word 3 Topic 1 Topic 2 Word 1 Word 2 θ~ Dirichlet( α )  ~ Dirichlet( β) (darker colors indicate lower probability)

  10. Statistical Inference • Three sets of latent variables: • document-topic distributions θ • topic-word distributions  • topic assignments z • Estimate posterior distribution over topic assignments • P( z | w ) • we “collapse” over topic mixtures and word mixtures • we can later infer θand  • Use approximate methods: Markov chain Monte Carlo (MCMC) with Gibbs sampling

  11. Toy Example: Artificial Dataset Two topics 16 documents Docs Can we recover the original topics and topic mixtures from this data?

  12. Initialization: assign word tokens randomly to topics: (●=topic 1; ○=topic 2 )

  13. Gibbs Sampling count of topic t assigned to doc d count of word w assigned to topic t probability that word iis assigned to topic t

  14. After 1 iteration • Apply sampling equation to each word token: (●=topic 1; ○=topic 2 )

  15. After 4 iterations (●=topic 1; ○=topic 2 )

  16. After 8 iterations (●=topic 1; ○=topic 2 )

  17. After 32 iterations  (●=topic 1; ○=topic 2 )

  18. Summary of Algorithm INPUT: word-document counts (word order is irrelevant) OUTPUT: topic assignments to each word P( zi ) likely words in each topic P( w | z ) likely topics in each document (“gist”) P( z | d )

  19. Example topics from TASA: an educational corpus • 37K docs 26K word vocabulary • 300 topics e.g.:

  20. Three documents with the word “play”(numbers & colors  topic assignments)

  21. LSA documents dims dims documents C = U D VT dims words words dims Topic model documents topics documents C = F Q topics words words normalized co-occurrence matrix mixture components mixture weights

  22. Documents as Topics Mixtures:a Geometric Interpretation P(word1) 1 topic 1 = observeddocument 0 topic 2 1 P(word2) P(word3) 1 P(word1)+P(word2)+P(word3) = 1

  23. Some Preliminary Work on Therapy Transcripts

  24. Defining documents • Can define “document” in multiple ways • all words within a therapy session • all words from a particular speaker within a session • Clearly we need to extend topic model to dialogue….

  25. Positive/Negative Topic Usage by Group

  26. Positive/ Negative Topic Usage by Changes in Satisfaction This graph shows that couples with a decrease in satisfaction over the course of therapy use relatively negative language. Those who leave the therapy with increased satisfaction exhibit more positive language

  27. Topics used by Satisfied/ Unsatisfied Couples Topic 38 talk divorce problem house along separate separation talking agree example Dissatisfied couples talk relatively more often about separation and divorce

  28. Affect Dynamics • Analyze the short-term dynamics of affect usage: • Do unhappy couples follow up negative language with negative language more often than happy couples? In other words, are unhappy couples involved in a negative feedback loop? • Calculated: • P( z2=+ | z1=+ ) • P( z2=+ | z1=- ) • P( z2=- | z1=+ ) • P( z2=- | z1=- ) • E.g. P( z2=- | z1=+ ) is the probability that after a positive word the next non-neutral word will be a negative word

  29. Markov Chain Illustration Base rates + .27 z Normal Controls - - + .73 .72 .28 + .33 z Positive Change - - + .67 .73 .27 + .37 z Little Change - - + .63 .78 .22 + .41 z Negative Change - - + .59 .78 .22

  30. Modeling Extensions

  31. Extensions • Multi-label Document Classification • conditional topic models • Topic models and word order • ngrams/collocations • hidden-markov models • Some potential model developments: • topic models incorporating parse trees • topic models for dialogue • topic models incorporating speech information

  32. Conditional Topic Models Assume there is a topic associated with each label/behavioral code. Model only is allowed to assign words to labels that are associated with the document This model can learn the distribution of words associated with each label/behavioral code

  33. Vulnerability=yes Hard Expression=no “Vulnerability” word? word word? word? word? word? word? word? word? word? word? word? word? .... ? Vulnerability=no Hard Expression=yes word? word? word? word? word? word? word? word? word? word? word? word? .... “Hard Expression” ? Vulnerability=yes Hard Expression=yes word? word? word? word? word? word?.... Topics associated with Behavioral Codes Topic Weights Documents and topic assignments

  34. Preliminary Results

  35. Topic Models for short-range sequential dependencies

  36. Hidden Markov Topics Model • Syntactic dependencies  short range dependencies • Semantic dependencies  long-range q Semantic state: generate words from topic model z1 z2 z3 z4 w1 w2 w3 w4 Syntactic states: generate words from HMM s1 s2 s3 s4 (Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

  37. NIPS Semantics KERNEL SUPPORT VECTOR SVM KERNELS # SPACE FUNCTION MACHINES SET NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUTS WEIGHTS # OUTPUTS IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VIEWS # PIXEL VISUAL EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEARNING MIXTURES FUNCTION GATE MEMBRANE SYNAPTIC CELL * CURRENT DENDRITIC POTENTIAL NEURON CONDUCTANCE CHANNELS DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR DISTRIBUTION EM BAYESIAN PARAMETERS STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT LEARNING CLASSES OPTIMAL * NIPSSyntax IN WITH FOR ON FROM AT USING INTO OVER WITHIN # * I X T N - C F P IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENTS EXISTS SEEMS SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE DESCRIBE SUGGEST HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HENCE FINALLY MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD APPROACH PAPER PROCESS USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESENTED DEFINED GENERATED SHOWN

  38. Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL

  39. Collocation Topic Model Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP

  40. Potential Model Developments

  41. Using parse trees/ pos taggers? S S NP NP VP VP “You complete me” “I complete you”

  42. Modeling Dialogue

  43. Topic Segmentation Model • Purver, Kording, Griffiths, & Tenenbaum, J. B. (2006). Unsupervised topic modeling for multi-party spoken discourse. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics • Automatically segments multi-party discourse into topically coherent segments • Outperforms standard HMMs • Model does not incorporate speaker information or speaker turns • goal is simply to segment long stream of words into segments

  44. At each utterance, there is a prob. of changing theta, the topic mixture. If no change is indicated, words are drawn from the same mixture of topics. If there is a change, the topic mixture is resampled from Dirichley

  45. Latent Dialogue Structure modelDing et al. (Nips workshop, 2009) • Designed for modeling sequences of messages on discussion forums • Models the relationship of messages within documents – a message might relate to any previous message within a dialogue • It does not incorporate speaker specific variables

  46. Some details …

  47. Learning User Intentions in Spoken Dialogue SystemsChinaei et al. (ICAART, 2009) • Applies HTMM model (Gruber et al., 2007) to dialogue • Assumes that within each talk-turn, words are drawn from same topic z (not mixture!). At start of new talk-turn, there is some probability (psi below) of sampling new topic z from mixture theta

  48. Other ideas • Can we enhance topic models with non-verbal speech information • Each topic is a distribution over words as well as voicing information (f0, timing, etc) T Nd D Non-verbal feature

  49. Other Extensions

More Related