1 / 27

Ranking for Sentiment

Ranking for Sentiment. DCU at TREC 2008: The Blog Track Adam Bermingham abermingham@computing.dcu.ie. DCU: Team Sentiment!. CLARITY: Sensor for Web Technologies Centre for Digital Video Processing. National Centre For Language Technology. Prof. Alan Smeaton. Dr. Jennifer Foster.

kumiko
Download Presentation

Ranking for Sentiment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking for Sentiment DCU at TREC 2008: The Blog TrackAdam Berminghamabermingham@computing.dcu.ie

  2. DCU: Team Sentiment! CLARITY: Sensor for Web TechnologiesCentre for Digital Video Processing National Centre For Language Technology Prof. Alan Smeaton Dr. JenniferFoster Adam Bermingham Dr. Deirdre Hogan

  3. Sentiment Analysis I Who is favourite to win the match, Ireland or New Zealand? What is the sentiment towards Barack Obama / the new iPod / Lehman Brothers Holdings Inc? How opinionated is the discussion around Mary Hearney, Enda Kenny, Zig & Zag?

  4. Sentiment Analysis II Identification of subjectivity and polarity of opinion in textual information Crossover of Information Retrieval, NLP, Text Mining The Challenges Document classification, scoring Opinion extraction Opinion summarization, visualization Real world correlation

  5. SA: Document Scoring Permits reranking, fusion with other document information Eg relevance, authority, pagerank etc. Machine Learning approaches Bag-of-words + variants Lexicon approaches Dictionaries for sentiment, polarity and subjectivity. Alternative text features: Out of vocabulary words, punctuation (etc)

  6. The Blog Track • Run at TREC since 2006 • Three tasks (2008): • Find relevant blog posts • Find opinionated blog posts • Find positive & negative blog posts • Results: 1000 ranked documents per topic (query), per task • 50 topics per yearPrimary evaluation metric: MAP – Mean Average Precision

  7. Topic Example <num> Number: 1049 </num><title> YouTube </title> <desc> Description:Find views about the YouTube video-sharing website.</desc> <narr> Narrative: The YouTube video-sharing website provides internet users with a relatively new way to share videos. Documents which express views about how well it succeeds in meeting the needs of users are relevant.</narr>

  8. Assessments Relevance judgementsPoolingHuman Assessors QRELS:Not relevantRelevant, non-opinionatedRelevant, positively opinionatedRelevant, negatively opinionated Relevant, mixed opinionatedNot judged 32,021 QRELS from 2006, 2007 available

  9. Corpus – Blog06 • >3 million blog posts • Crawled over a few weeks in 2006 • Permalink HTML • Also available: homepage HTML, RSS • Real-world • Includes: spam blogs (“splogs”), multilingual blogs, inappropriate content

  10. Blogs “Weblog” coined 1997A website containing regular timestamped posts in chronological order Universal McCann (March 2008)184 million WW have started a blog | 26.4 US 346 million WW read blogs | 60.3 US77% of active Internet users read blogs

  11. Blog – an example Blog Date Post Links Tags Comments

  12. Approach • Get relevant documents • Assess results for sentiment using three feature sets • Re-rank relevant results using late fusion of feature sets

  13. Approach – feature set Lexicon Features Aggregate sentiment scores for a document’s constituent words in a sentiment lexicon. Surface Features Textual features which do not require parsing or syntactic understanding of the sentence structure. Syntactic Features Textual features derived from parsing and part-of-speech tagging documents.

  14. System Architecture

  15. Retrieval Terrier University of Glasgow Open source / Java Retrieval: Okapi BM25 Query Expansion Bo1 (Bose-Einstein) Divergence From Randomness

  16. Preprocessing Parse HTML HTMLParser tool Divide into text sections according to breaking HTML elements Noise Removal Discard sections with: A high anchor text to non-anchor text ratio (eg ad, blogroll) A high non-alphabetic character to alphabetic character ratio (eg date, code, gobbledegook)

  17. Machine Learning WEKA - Waikato Environment for Knowledge Analysis JavaGood entry pointPerformance Issues (?) Three-way Binary Logistic Regression ClassificationScores are obtained from distributions for classified documents

  18. Syntactic Features *Thanks to Joachim Wagner! Parsed using Charniak and Johnson re-ranking parser*ICHEC – Irish Centre for High End Computing The 50 most discriminative part-of-speech unigrams, bigrams and trigrams Penn Treebank phrasal types:Normalised counts of typesNormalised counts of types as root of treeNormalised counts of parse tree structures likely to reflect subjectivity

  19. Surface Features Normalized word counts for a manually created lexicon of obscenities and emotive and polarised words Non-word characters and character sequences such as punctuation and emoticons. Regex patterns to detect unusual word and punctuation structures. Eg “arrrrgh”, “?!?!”, “....”, “b****” Document measurements

  20. Lexicon Features SentiWordNetPositivity, Negativity score for each Synset in WordNetScoringWeighted sum of mean positivity and negativity scores per document

  21. Weighting • Weighted Comb Sum • Learning weights from 2006, 2007 MAP • Rather than cross validation • Scores from 3 classifiers fused before merging with relevance score

  22. Weighting Opinion finding: Polarised Opinion finding:

  23. Results Baseline (Opinion) Opinion Finding Polarised Opinion Finding

  24. Results – per topic

  25. Preliminary Conclusions • Syntactic features appear to subsume surface features • Observed during in training weights • Significant gains can be had through an efficient, uniform baseline • Subjectivity important in polarity detection • Bigger difference in writing style between objective and subjective texts than between negative and positive texts.

  26. Future Work • Further work on parse trees for sentiment classification • Movie review classification – Wolfgang Seeker • Sub-document relevance and sentiment modelling • Unstructured text • Logical levels – Sentence? Phrase? Paragraph? Passage? N.O.T.A?

  27. Thanks! • TREC – Blog Track Wiki • http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG/ • Opinion Mining and Sentiment Analysis Survey • http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html • TREC Blog Track 2007 overview • http://trec.nist.gov/pubs/trec16/papers/BLOG.OVERVIEW08.pdf • Tools: • Weka: http://www.cs.waikato.ac.nz/ml/weka/ • Terrier: http://ir.dcs.gla.ac.uk/terrier/ • HTMLParser: http://htmlparser.sourceforge.net/SentiWordNet: http://sentiwordnet.isti.cnr.it/

More Related