i256 applied natural language processing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
I256: Applied Natural Language Processing PowerPoint Presentation
Download Presentation
I256: Applied Natural Language Processing

Loading in 2 Seconds...

play fullscreen
1 / 47

I256: Applied Natural Language Processing - PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on

I256: Applied Natural Language Processing. Marti Hearst Nov 8, 2006. Today . Comparing term clustering and category output Clustering in Weka Data mining from blogs. LDA. Latent Dirchelet Allocation Blei, Ng, Jordan, JLMR 03. LDA is a hierarchical probabilistic model of documents.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

I256: Applied Natural Language Processing


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006

    2. Today • Comparing term clustering and category output • Clustering in Weka • Data mining from blogs

    3. LDA • Latent Dirchelet Allocation • Blei, Ng, Jordan, JLMR 03. • LDA is a hierarchical probabilistic model of documents. • “LDA allows you to analyze of corpus, and extract the topics that combined to form its documents.” • http://www.cs.princeton.edu/~blei/lda-c/ • Not really clustering, but in the “soft clustering” ballpark.

    4. LDA on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/Flamenco

    5. LDA on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/Flamenco

    6. CastaNet • (Semi)automated facet creation • Stoica & Hearst • Build up from WordNet • Algorithm is fully automatic but we think you can improve results manually afterwards.

    7. CastaNet on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/Flamenco

    8. CastaNet on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/Flamenco

    9. TopicSeek on Enron Email • Technique: pLSI (probabilistic LSI, Hofmann 99) • Hand-picked example for website • http://topicseek.com/enron.html

    10. TopicSeek on Medline • Technique: pLSI (probabilistic LSI, Hofmann 99) • Hand-picked example for website • http://topicseek.com/pubmed.html

    11. CastaNet on Medline Journal Titles http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/medicine-automated/Flamenco

    12. Clustering in Weka

    13. Looking at Clustering Results • Weka lets you save cluster results to an ARFF file • I wrote some python code to process this file and pull out the Subject headings for each newsgroup posting in each cluster.

    14. 15-way clustering

    15. Cobweb clustering

    16. Blog Analysis • What’s special about blogs?

    17. Blog analysis sites • http://dijest.com/bc/ • Called blogcount; lots of stats and news about blogs • http://blogcensus.net/?page=tools • Language, location, marketshare • http://www.perseus.com/blogsurvey/ • Stats about biggest blogs, demographics • http://www.weblogs.com/ • Notify when new content posted • http://blogpulse.com/ • Trends and recent popular topics

    18. Blogs vs. Newsgroups • Posting about products … what can we tell? • Blog: • Newsgroup: Example from Glance, Hurst, and Tomokiyo ‘04

    19. Analyzing Blogs for Market Data • Idea: examine comments about a product (or a product’s competition or market) in an automated fashion. • Application area: handheld electronic devices. Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05

    20. Analyzing Blogs for Market Data Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05

    21. Technology used • Post segmentation • Important phrases • Foreground vs. background corpus • Background: text about product • Foreground: certain negative paragraphs about product • Sentiment classification • What do people talk about when saying negative things about product X? • Social network analysis (on discussion boards) • What does this group of people talk about when saying negative things about product X? • Author dispersion • Many people talking about it, or just a few?

    22. Example • What common phrases to people use when saying negative things about product X?

    23. Example • What do people in this group say when saying negative things about product X?

    24. Example • What do people in this group say when saying negative things about product X?

    25. Predicting Film Sales • Idea: • Use discussion before a film to predict its opening weekend box office scores • Use discussion afterwards to predict longer-term sales • Separate out topic labels from sentiment labels • Outcome: • Good predictor for opening weekend, but not for longer term • Observation: the nature of discussion gets (and thus harder to analyze) after the film has been out a while. Example from Mishne & Glance, 2006

    26. Predicting Film Sales Example from Mishne & Glance, 2006

    27. Prediction Film Sales Example from Mishne & Glance, 2006

    28. Predicting Film Sales Example from Mishne & Glance, 2006

    29. Analyzing Political Blogs • Analyze: • Who links to whom • What the popularity profile looks like • A powerlaw/Zipf/Pareto, of course • Look at structure of topic-specific blogs • By #inbound links Image from blogsphere ecosystem via Shirky

    30. Analyzing Political Blogs • Earlier work examined books bought together in pairs at major retailers • Krebs, Divided we Stand??? http://www.orgnet.com/leftright.html • In other domains the groupings are more distributed.

    31. http://www.orgnet.com/booknet.html

    32. http://www.orgnet.com/leftright.html from Jan 2003

    33. http://www.orgnet.com/divided.html from 2004 election

    34. Analyzing Political Blogs • Study by Adamic and Glance, 2005 • Analyzed 40 most popular political blogs • 2 months preceding 2004 US presidential election • Also study 1000 political blogs on a one day snapshot • Findings for the latter: • Liberal and conservative blogs had distinct lists of favorate news sources, people, and topics, with some overlap on current news • Use labels from aggregator sources • Linking patterns were indeed pretty internal (91% stayed within political leaning) • More and more frequent linking among conservatives • 82% conservative linked out vs. 74% of liberal

    35. Analyzing Political Blogs • For the 40 most popular blogs: • Looked for “echo chamber” effect • The conservative blogs are more tightly interlinked. • Question: do they repeat the same concepts more? • Measured textual similarity among blog posts • Slightly stronger within a political leaning than between, but not one orientation more than the other. • Looked for interaction with “mainstream” media • Found strong distinctions between which sources cited

    36. Image from Adamic & Glance 200

    37. Image from Adamic & Glance 200

    38. Image from Adamic & Glance 200

    39. Image from Adamic & Glance 200

    40. Image from Adamic & Glance 200

    41. Image from Adamic & Glance 200

    42. Next Time • Sentiment and Opinion Analysis