270 likes | 403 Views
Ranking for Sentiment. DCU at TREC 2008: The Blog Track Adam Bermingham abermingham@computing.dcu.ie. DCU: Team Sentiment!. CLARITY: Sensor for Web Technologies Centre for Digital Video Processing. National Centre For Language Technology. Prof. Alan Smeaton. Dr. Jennifer Foster.
E N D
Ranking for Sentiment DCU at TREC 2008: The Blog TrackAdam Berminghamabermingham@computing.dcu.ie
DCU: Team Sentiment! CLARITY: Sensor for Web TechnologiesCentre for Digital Video Processing National Centre For Language Technology Prof. Alan Smeaton Dr. JenniferFoster Adam Bermingham Dr. Deirdre Hogan
Sentiment Analysis I Who is favourite to win the match, Ireland or New Zealand? What is the sentiment towards Barack Obama / the new iPod / Lehman Brothers Holdings Inc? How opinionated is the discussion around Mary Hearney, Enda Kenny, Zig & Zag?
Sentiment Analysis II Identification of subjectivity and polarity of opinion in textual information Crossover of Information Retrieval, NLP, Text Mining The Challenges Document classification, scoring Opinion extraction Opinion summarization, visualization Real world correlation
SA: Document Scoring Permits reranking, fusion with other document information Eg relevance, authority, pagerank etc. Machine Learning approaches Bag-of-words + variants Lexicon approaches Dictionaries for sentiment, polarity and subjectivity. Alternative text features: Out of vocabulary words, punctuation (etc)
The Blog Track • Run at TREC since 2006 • Three tasks (2008): • Find relevant blog posts • Find opinionated blog posts • Find positive & negative blog posts • Results: 1000 ranked documents per topic (query), per task • 50 topics per yearPrimary evaluation metric: MAP – Mean Average Precision
Topic Example <num> Number: 1049 </num><title> YouTube </title> <desc> Description:Find views about the YouTube video-sharing website.</desc> <narr> Narrative: The YouTube video-sharing website provides internet users with a relatively new way to share videos. Documents which express views about how well it succeeds in meeting the needs of users are relevant.</narr>
Assessments Relevance judgementsPoolingHuman Assessors QRELS:Not relevantRelevant, non-opinionatedRelevant, positively opinionatedRelevant, negatively opinionated Relevant, mixed opinionatedNot judged 32,021 QRELS from 2006, 2007 available
Corpus – Blog06 • >3 million blog posts • Crawled over a few weeks in 2006 • Permalink HTML • Also available: homepage HTML, RSS • Real-world • Includes: spam blogs (“splogs”), multilingual blogs, inappropriate content
Blogs “Weblog” coined 1997A website containing regular timestamped posts in chronological order Universal McCann (March 2008)184 million WW have started a blog | 26.4 US 346 million WW read blogs | 60.3 US77% of active Internet users read blogs
Blog – an example Blog Date Post Links Tags Comments
Approach • Get relevant documents • Assess results for sentiment using three feature sets • Re-rank relevant results using late fusion of feature sets
Approach – feature set Lexicon Features Aggregate sentiment scores for a document’s constituent words in a sentiment lexicon. Surface Features Textual features which do not require parsing or syntactic understanding of the sentence structure. Syntactic Features Textual features derived from parsing and part-of-speech tagging documents.
Retrieval Terrier University of Glasgow Open source / Java Retrieval: Okapi BM25 Query Expansion Bo1 (Bose-Einstein) Divergence From Randomness
Preprocessing Parse HTML HTMLParser tool Divide into text sections according to breaking HTML elements Noise Removal Discard sections with: A high anchor text to non-anchor text ratio (eg ad, blogroll) A high non-alphabetic character to alphabetic character ratio (eg date, code, gobbledegook)
Machine Learning WEKA - Waikato Environment for Knowledge Analysis JavaGood entry pointPerformance Issues (?) Three-way Binary Logistic Regression ClassificationScores are obtained from distributions for classified documents
Syntactic Features *Thanks to Joachim Wagner! Parsed using Charniak and Johnson re-ranking parser*ICHEC – Irish Centre for High End Computing The 50 most discriminative part-of-speech unigrams, bigrams and trigrams Penn Treebank phrasal types:Normalised counts of typesNormalised counts of types as root of treeNormalised counts of parse tree structures likely to reflect subjectivity
Surface Features Normalized word counts for a manually created lexicon of obscenities and emotive and polarised words Non-word characters and character sequences such as punctuation and emoticons. Regex patterns to detect unusual word and punctuation structures. Eg “arrrrgh”, “?!?!”, “....”, “b****” Document measurements
Lexicon Features SentiWordNetPositivity, Negativity score for each Synset in WordNetScoringWeighted sum of mean positivity and negativity scores per document
Weighting • Weighted Comb Sum • Learning weights from 2006, 2007 MAP • Rather than cross validation • Scores from 3 classifiers fused before merging with relevance score
Weighting Opinion finding: Polarised Opinion finding:
Results Baseline (Opinion) Opinion Finding Polarised Opinion Finding
Preliminary Conclusions • Syntactic features appear to subsume surface features • Observed during in training weights • Significant gains can be had through an efficient, uniform baseline • Subjectivity important in polarity detection • Bigger difference in writing style between objective and subjective texts than between negative and positive texts.
Future Work • Further work on parse trees for sentiment classification • Movie review classification – Wolfgang Seeker • Sub-document relevance and sentiment modelling • Unstructured text • Logical levels – Sentence? Phrase? Paragraph? Passage? N.O.T.A?
Thanks! • TREC – Blog Track Wiki • http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG/ • Opinion Mining and Sentiment Analysis Survey • http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html • TREC Blog Track 2007 overview • http://trec.nist.gov/pubs/trec16/papers/BLOG.OVERVIEW08.pdf • Tools: • Weka: http://www.cs.waikato.ac.nz/ml/weka/ • Terrier: http://ir.dcs.gla.ac.uk/terrier/ • HTMLParser: http://htmlparser.sourceforge.net/SentiWordNet: http://sentiwordnet.isti.cnr.it/