1 / 47

I256: Applied Natural Language Processing

I256: Applied Natural Language Processing. Marti Hearst Nov 8, 2006. Today . Comparing term clustering and category output Clustering in Weka Data mining from blogs. LDA. Latent Dirchelet Allocation Blei, Ng, Jordan, JLMR 03. LDA is a hierarchical probabilistic model of documents.

wayne
Download Presentation

I256: Applied Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006

  2. Today • Comparing term clustering and category output • Clustering in Weka • Data mining from blogs

  3. LDA • Latent Dirchelet Allocation • Blei, Ng, Jordan, JLMR 03. • LDA is a hierarchical probabilistic model of documents. • “LDA allows you to analyze of corpus, and extract the topics that combined to form its documents.” • http://www.cs.princeton.edu/~blei/lda-c/ • Not really clustering, but in the “soft clustering” ballpark.

  4. LDA on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/Flamenco

  5. LDA on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/Flamenco

  6. CastaNet • (Semi)automated facet creation • Stoica & Hearst • Build up from WordNet • Algorithm is fully automatic but we think you can improve results manually afterwards.

  7. CastaNet on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/Flamenco

  8. CastaNet on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/Flamenco

  9. TopicSeek on Enron Email • Technique: pLSI (probabilistic LSI, Hofmann 99) • Hand-picked example for website • http://topicseek.com/enron.html

  10. TopicSeek on Medline • Technique: pLSI (probabilistic LSI, Hofmann 99) • Hand-picked example for website • http://topicseek.com/pubmed.html

  11. CastaNet on Medline Journal Titles http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/medicine-automated/Flamenco

  12. Clustering in Weka

  13. Looking at Clustering Results • Weka lets you save cluster results to an ARFF file • I wrote some python code to process this file and pull out the Subject headings for each newsgroup posting in each cluster.

  14. 15-way clustering

  15. Cobweb clustering

  16. Blog Analysis • What’s special about blogs?

  17. Blog analysis sites • http://dijest.com/bc/ • Called blogcount; lots of stats and news about blogs • http://blogcensus.net/?page=tools • Language, location, marketshare • http://www.perseus.com/blogsurvey/ • Stats about biggest blogs, demographics • http://www.weblogs.com/ • Notify when new content posted • http://blogpulse.com/ • Trends and recent popular topics

  18. Blogs vs. Newsgroups • Posting about products … what can we tell? • Blog: • Newsgroup: Example from Glance, Hurst, and Tomokiyo ‘04

  19. Analyzing Blogs for Market Data • Idea: examine comments about a product (or a product’s competition or market) in an automated fashion. • Application area: handheld electronic devices. Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05

  20. Analyzing Blogs for Market Data Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05

  21. Technology used • Post segmentation • Important phrases • Foreground vs. background corpus • Background: text about product • Foreground: certain negative paragraphs about product • Sentiment classification • What do people talk about when saying negative things about product X? • Social network analysis (on discussion boards) • What does this group of people talk about when saying negative things about product X? • Author dispersion • Many people talking about it, or just a few?

  22. Example • What common phrases to people use when saying negative things about product X?

  23. Example • What do people in this group say when saying negative things about product X?

  24. Example • What do people in this group say when saying negative things about product X?

  25. Predicting Film Sales • Idea: • Use discussion before a film to predict its opening weekend box office scores • Use discussion afterwards to predict longer-term sales • Separate out topic labels from sentiment labels • Outcome: • Good predictor for opening weekend, but not for longer term • Observation: the nature of discussion gets (and thus harder to analyze) after the film has been out a while. Example from Mishne & Glance, 2006

  26. Predicting Film Sales Example from Mishne & Glance, 2006

  27. Prediction Film Sales Example from Mishne & Glance, 2006

  28. Predicting Film Sales Example from Mishne & Glance, 2006

  29. Analyzing Political Blogs • Analyze: • Who links to whom • What the popularity profile looks like • A powerlaw/Zipf/Pareto, of course • Look at structure of topic-specific blogs • By #inbound links Image from blogsphere ecosystem via Shirky

  30. Analyzing Political Blogs • Earlier work examined books bought together in pairs at major retailers • Krebs, Divided we Stand??? http://www.orgnet.com/leftright.html • In other domains the groupings are more distributed.

  31. http://www.orgnet.com/booknet.html

  32. http://www.orgnet.com/leftright.html from Jan 2003

  33. http://www.orgnet.com/divided.html from 2004 election

  34. Analyzing Political Blogs • Study by Adamic and Glance, 2005 • Analyzed 40 most popular political blogs • 2 months preceding 2004 US presidential election • Also study 1000 political blogs on a one day snapshot • Findings for the latter: • Liberal and conservative blogs had distinct lists of favorate news sources, people, and topics, with some overlap on current news • Use labels from aggregator sources • Linking patterns were indeed pretty internal (91% stayed within political leaning) • More and more frequent linking among conservatives • 82% conservative linked out vs. 74% of liberal

  35. Analyzing Political Blogs • For the 40 most popular blogs: • Looked for “echo chamber” effect • The conservative blogs are more tightly interlinked. • Question: do they repeat the same concepts more? • Measured textual similarity among blog posts • Slightly stronger within a political leaning than between, but not one orientation more than the other. • Looked for interaction with “mainstream” media • Found strong distinctions between which sources cited

  36. Image from Adamic & Glance 200

  37. Image from Adamic & Glance 200

  38. Image from Adamic & Glance 200

  39. Image from Adamic & Glance 200

  40. Image from Adamic & Glance 200

  41. Image from Adamic & Glance 200

  42. Next Time • Sentiment and Opinion Analysis

More Related