Sentiment analysis

Sentiment analysis Or, how to find happiness.

Why do we want sentiment info? • Useful input for detection • Brand sentiment • Useful input for prediction • Stock market, box office revenues, political outcomes • Potentially for social uprisings, terrorist incidents

What do you really want to know?

Brand satisfaction

Quality of life

Abstract predictor

Three considerations for a sentiment analysis system • Data cleaning • One piece of the puzzle • Simple works best

Data cleaning (Because it’s a dirty world)

Data cleaning: on Twitter… • Spam accounts • Bots (Weather, sport, etc…) Answer: a) http://trst.me/ (from infochimps) b) Make your own system

Data cleaning: from sentences to words • Tokenize the sentence(s) into words. (This may not be as easy as it seems). • Maybe do stopping/stemming, depending on application. • Pick a threshold of times we have to see a word in our training set, below which we ignore it. • Build a dictionary of words. Answer: a) Twokenize.py b) Write your own

One piece of the puzzle

Always make it part of a system • When it’s wrong (and this is quite often) it will be very obviously wrong • People don’t need to see this • This doesn’t actually detract from the utility of the system

Success: • Tracking political polls. • Predicting box office revenues. • Predicting the stock market.

Simple works best (for now)

The quick version • Use supervised/semi-supervised learning method. • For most cases I would recommend Naïve Bayes on the Bag of Words representation. Very simple to implement and near-best performance. • If you don’t have any examples of happy/sad tweets (for your purpose), use known keywords, such as emoticons.

^_^

Things that don’t really help (Generally less than 2% improvement) • More advanced classifiers (eg SVMs) • Part of Speech tagging • Parse trees • Semi-supervised methods if you have very large amounts of data

The formula for happiness

Basic positive/negative Twitter sentiment word list • http://alexdavies.net/projects/twitter-sentiment-word-lists/

Thanks.

Sentiment analysis