70 likes | 77 Views
Who Needs Polls? Gauging Public Opinion from Twitter Data. David Cummings Haruki Oh Ningxuan (Jason) Wang. From Tweets to Poll Numbers. Motivation: People spend millions of dollars on polling every year: politics, economy, entertainment Millions of posts on Twitter every day
E N D
Who Needs Polls?Gauging Public Opinion from Twitter Data David Cummings Haruki Oh Ningxuan (Jason) Wang
From Tweets to Poll Numbers • Motivation: People spend millions of dollars on polling every year: politics, economy, entertainment • Millions of posts on Twitter every day • Can we model public opinion using tweets? • Data: 476 million tweets from June to December 2009, courtesy of Jure Lescovec • Public polls from The Gallup Organization (presidential approval, economic confidence) and Rasmussen Reports (generic Congressional ballot) • Goal: high correlation with public opinion polls • All correlation figures for 6-day smoothing window
Approach 1: Volume • The simplest metric: percentage of tweets that mention a given topic in a certain time window • Moderate negative correlation (-36.3%, -35.7%) for economy and Congressional ballot: mention things you want to complain about more often • Higher correlation (52.4%) for Obama
Approach 2: Generic Sentiment • Can we distinguish between positive and negative sentiment of tweets? • University of Pennsylvania OpinionFinder subjective polarity lexicon • “conceited” strong negative -10 • “ironic” weak negative -5 • “trendy” weak positive +5 • “illuminating” strong positive +10 • Sum word scores for a tweet to classify it as positive, negative, or neutral; then subtract negative counts from positive counts and normalize over window
Approach 2: Generic Sentiment • Good results on economic confidence: 60.4% correlation, 70.1% correlation on 15-day window • Poor performance on presidential approval and Congressional ballot: -24.5% and 21.5% correlation respectively • Sentiment about politics expressed differently?
Approach 3: LM-based Classification • Train three language models (positive, negative, and neutral) on hand-classified data • Classify each tweet according to the language model that affords it the highest probability • Applied for the case of Obama: manually classified 3,633 tweets • “can we all talk about how awesome Obama is?” • “that Obama sticker on your car might as well say ‘Yes I’m stupid’ #tcot #iamthemob #teaparty #glennbeck” • Then we tested the language models: best performer was a linearly interpolated bigram model
Approach 3: LM-based Classification • Much-improved results on presidential approval: 49.4% correlation • Throwing out retweets and duplicate tweets helps a little more: 55.9% correlation • Finally, combining both volume and LM-based sentiment gives best results: 63.3% correlation, or 69.6% correlation on a 15-day window