1 / 20

Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA

Empirical Study of Topic Modeling in Twitter. Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA. Why we care about text modeling in Twitter ?. SOMA 2010 . Why we care about text modeling in Twitter ?. Understanding users’ interests

shona
Download Presentation

Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Empirical Study of Topic Modeling in Twitter Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA

  2. Why we care about text modeling in Twitter ? SOMA 2010

  3. Why we care about text modeling in Twitter ? • Understanding users’ interests • Understanding social network • Identifying emerging topics SOMA 2010

  4. Problems • Tweets are too short (140 char) • Hash tags • Abbreviations • Multiple languages SOMA 2010

  5. Question How can we train an “effective” standard topic model ? SOMA 2010

  6. We found • Topics learned by different aggregation strategies • are substantially different • Training the model at user-level is faster • Learned topics can help classification tasks SOMA 2010

  7. A quick review of topic models LDA Author-Topic SOMA 2010

  8. Our goal • Obtain topic mixtures for both tweets and users SOMA 2010

  9. Training Schemes • Train on tweets • Infer users + tweets • Train on aggregated tweets (by users) • Infer tweets • Train on aggregated tweets (by terms) • Infer users + tweets • Author-Topic model • Infer tweets SOMA 2010

  10. Datasets • 1,992,758 tweets + 514,130 users • 3,697,498 terms • 274 verified users from Twitter Suggestion • 16 categories • 50,447 tweets (150 tweets per user) SOMA 2010

  11. Tasks • Topic modeling • Retweet Prediction • User & Tweets Topical Classification • Logistic Regression SOMA 2010

  12. Topic Modeling SOMA 2010

  13. Topic Modeling SOMA 2010

  14. Topic Modeling SOMA 2010

  15. Retweet Prediction @Jon Hello World 2009-11-01 13:15pm @Kim @Jon Hello World 2009-11-01 13:23pm @Frank @Kim @Jon Hello World 2009-11-01 17:49pm Hello World 2009-11-01 12:00pm Positive examples Negative examples SOMA 2010

  16. Retweet Prediction SOMA 2010

  17. Tweets Classification SOMA 2010

  18. User Classification SOMA 2010

  19. Conclusion • User Level Aggregation is helpful • Fast and good result • Author-Topic model does not directly apply • Topic Modeling can help other tasks • tweets classification SOMA 2010

  20. Thank you and IBM Travel Grant! • Contact Info: • Liangjie Hong • hongliangjie@lehigh.edu • WUME Laboratory • Computer Science and Engineering • Lehigh University • Bethlehem, PA 18015 USA SOMA 2010

More Related