Event Analytics on Social Media: Challenges and Solutions

Event Analytics on Social Media: Challenges and Solutions Yuheng Hu Committee Members Dr. SubbaraoKambhampati, Chair Dr. Eric Horvitz, Dr. John Krumm Dr. Huan Liu Dr. HariSundaram

Since the dawn of civilization, people congregated in town squares to discuss events The emergence of social media has now created a sprawling virtual town square, whose scope is vast, and whose chatter can be captured! opening exciting possibilities for analyzing what people are actually saying..

debate Superbowl i-5 bridge collapse Obama’s selfie

What’s the relation between event and tweets? Which part of the event did a tweet refer to?

What were the topics of the event and tweets? What were the sentiments of the event elicited on tweets?

How to characterize the crowds’ tweeting behavior

? How to detect an event from social media responses

How to predict the crowds' engagement in future events

How to find social media responses about the events How to model relations between event and its responses How to address these challenges? How to link social media responses to events How to infer topics and sentiments of social media responses How to characterize the crowds’ behavior in response to events How to distill insights about event based on social media responses How to predict future development of event How to predict crowds’ engagement in future event

Potential applications Computational Journalism Political Campaign

The event master “Fox News Unveils New State-Of-The-Art Newsroom” –the Verge, Oct 17, 2013

Tweets volume on Egypt & Morsi 12k ~ per hour The fact: vast amounts of social media responses We need automated solutions!

Event Analytics on Social Media Most existing event analytics solutions are primitive. Simply combining other solutions ignores connections between events and responses Given the vast amounts of social media responses and complex nature of events, we need automated tools to conduct in-depth analysis

In this proposal, we present Eventics Task 1: Event sensemaking Task 3: Event engagement prediction Task 2: Event recognition Event topics, segments, Event-tweet alignment, Event sentiments Trending events with associated Twitterresponses Predict user’s engagement in future events

Specific Specific Specific General Specific General Specific General General ET-LDA [AAAI’12, ICWSM’12, MMW’12]

Event-tweets alignment Frequency of specific tweets Evolution of specific tweets ET-LDA [AAAI’12, ICWSM’12, MMW’12]

Specific Specific Specific General Specific General Specific General General SocSent [IJCAI’13]

Fire happened at 5St and Pike, heard sirens, lots smoke DeMA [CHI’13]

Hey Mike: we found this event may be of interest to you based on our prediction on your potential engagement ! Our predictions were made based on your Twitter engagement history. Regards, Alice Alice [under review]

Summary of Contributions Eventics, automated toolbox to conduct in-depth analysis of 3 core tasks in event analytics • ET-LDA& SocSentfor Event sensemaking • DeMAfor Event recognition • Alicefor Event engagement prediction Our toolbox enables a richer perspective about How people respond to events on Twitter What factors affect crowd’s engagement in events

Event Sensemaking Motivation Republican Primary Debate, 09/07/2011 Tweets tagged with #ReaganDebate ? ? What’s the relation between an event and tweets? Which part of the event did a tweet refer to? What were the topics of the event and tweets? How to characterize the crowds’ tweeting behavior?

Event Sensemaking: the Problem Given an event’s transcript S, and its associated tweets T – Characterize the event in terms of its topics and segments, and its influences (w.r.t the nature and magnitude) on the crowds’ Twitter responding behavior Requirements: Extract topics in the event and tweets Segment the event into topically coherent chunks Establish the alignment between the event and tweets Measure the influence of the event on its associated tweets

Event Sensemaking: the Challenges • Both topics and segments are latent Tweets are topically influenced by the content of the event. A tweet’s topics can be general (high-level and constant across the entire event), or speciﬁc (concrete and relate to speciﬁc segments of the event) An event is formed by discrete sequentially-ordered segments, each of which discusses a particular set of topics

Event Sensemaking: Possible Approaches Applying existing event segmentation tools e.g., time-windows For each <tweet, segment> pair, measuring similarities e.g., TF-IDF Counting related tweets for each segment Unfortunately, these approaches are not able to discover latent topics/segments, besides they model event and its Twitter responses independently

Our Contribution: ET-LDA Event transcript ET-LDA (joint Event and Tweets LDA) is a hierarchical fully Bayesian model, which jointly models an event and its Twitter responses via their inter-dependency, i.e., topical influences …………………………………….................... ……………………………… ……………………………………………………………… Yuheng Hu, Ajita John, Fei Wang, SubbaraoKambhampati. “ET-LDA: Joint Topic Modeling for Aligning Events and their Twitter Feedback.” In AAAI Conference on Artificial Intelligence (AAAI) 2012 Yuheng Hu, Ajita John, Doree Duncan Seligmann, Fei Wang. “What were the Tweets about? Topical Associations between Public Events and Twitter Feeds.” ICWSM’12 Yuheng Hu, Ajita John, Doree Duncan Seligmann. “Event Analytics via Social Media.” In Proc. ACM Multimedia 2011 Workshop on Social and Behavioral Networked Media Access (SBNMA) , 2011

ET-LDA: Generative Process Foreach paragraph s in S draw a segment choice indicate Cs if Cs = 1 then draw a new topic mixture for s else then topic of s is as same as the topic of previous paragraph s-1 Foreach tweet t in T draw a topic changing indicate Ct if Ct = 1 then draw a new topic for t else then draw a paragraph s assign topic mixture of s to t

ET-LDA: Graphical Model Event Tweets Determine event segmentation C(s)~Bernoulli() Determine which segment a tweet (word) refers to S(t) ~ Categorical(γ) Determine segment topics θ(s)~Dirichlet(α), or θ(s)~(θ(s-1),θ(s)), Determine tweet type C(t)~Bernoulli(λ) General topics Ψ(t)~Dirichlet(α) Determine word’s topic in event Zs~multinomial(θ) Tweets word’s topic Zt~multinomial(ψ) or Zt~multinomial(θ)

Inference in ET-LDAis HARD Unfortunately, model inference is intractable during coupling of hyperparameters. We need approximate inference algorithms. Here we use collapsed Gibbs sampling We need to infer P(Zs, Zt, Cs , Cs, St | Ws, Wt) How joint distribution looks like: Gibbs sampling approximates the posterior distribution by iteratively updating each latent variable given the remaining variables

Evaluation of ET-LDA Experimental Setup • Tweets for President Obama’s speech on the Middle East on May 19, 2011 (#MESpeech) and Republican Primary debate in the US on Sept 7, 2011 (#ReaganDebate) • Event transcripts from New York Times • Model settings: Gibbs sampling and pick #topics by maximizing log-likelihood • Baselines • LDA – Latent Dirichlet Allocation (LDA) • LCSeg – HMM-based event segmentation tool • Tasks • Event segmentation • Topic extraction • Alignment

Results: Event Segmentation Pk= probability that a randomly chosen pair of words from the event will be incorrectly separated by a hypothesized segment boundary

Results: Topic Extraction Performance based on Likert scale

Results: Alignment Goal: whether the specific tweets (i.e., tweets that are strongly influenced by the events) are correctly identified for each segment. Procedure: • ET-LDA: sampled tweets when P(C(t)) > .5 • LDA: run LDA on tweets corpus, and event transcripts; calculate distance between topic mixtures through JS-divergence Performance based on Likert scale

Evolution of Specific Tweets rapid increase from 33% to 54% Controversial topic mentioned, the responses were pronounced most responses were either tangential or about the high-level themes Observation 1: crowds’ responses tended to be general and steady before the event; after the event, while during the event, they were more specific and episodic.

Distribution of Segments Referred to by Specific Tweets People can also talk about things which are expected to be discussed later People can talk about things that have been discussed before or being discussed currently ET-LDA alignment Observation 2: topical context of the tweets did notalways correlate with the timeline of the event – an event segment can be referred to by specific tweets at any time irrespective of whether it has already occurred or is occurring currently or will occur later on

Examples of Specific/General tweets Specific Something the #GOP candidates won't mention about Reagan - Reagan grew the size of the federal government tremendously. #reagandebate • Yes, we need to talk about jobs and teachers needing jobs! #Reagandebate General Boring #GOPDebate#tcot #ReaganDebate Ron Paul. Gogogog :) . #reagandebate

Summary of ET-LDA Motivated joint event-tweet modeling for event sensemaking ET-LDA can concurrently segment an event and classify two types of tweets: general and specific Demonstrated that ET-LDA significantly outperformed the traditional models ET-LDA enables many insights which were never studied before

Proposed Work ET-LDA is powerful, but there Remain Open Questions • How does data incompleteness in event’s transcript affect the performance of ET-LDA in classifying the types of tweets • How does the volume of tweets affect the performance of ET-LDA in segmenting the event. • How well does ET-LDA predict future tweeting behavior given the topics covered in the event. • How does ET-LDA predict the future development of the event given the tweets seen so far. Robustness of ET-LDA Predictive power of ET-LDA

Proposed Work Extension to ET-LDA Possible solutions: • Investigate the performance of different inference algorithms (e.g., the EM algorithm) in estimating ET-LDA’s parameters while data is incomplete • Investigate a training-testing scheme for the ET-LDA model Outcome • Analyze what is currently happening rather than the “after-the-fact” analysis • Users can interact with the system and evaluate its effectiveness in predicting future development of the event as well as the tweeting behavior

What other tasks can we do based on this alignment?

Specific Specific Specific General Specific General Specific General What were the sentiments elicited by the segments and topics of the event on Twitter? General Applications: Event analysis, Stock market, Advertisement

Events Sensemaking via Aggregated Twitter Sentiment: the Problem Given an event’s transcript S and its associated tweets T – Find the aggregated sentiments (positive or negative) about segment (s ∈S) and topics of the event (k ∈K) elicited on Twitter 45

Events Sensemaking via Aggregated Twitter Sentiment: possible solution Main steps • Manually label tweets with their sentiment orientation as training data • Apply off-the-shelf sentiment classifiers, e.g., MinCut[Pang et al. 2002] • Relate aggregated Twitter sentiment to segments and topics of the event that occur within ﬁxed time-windows around the tweets’ timestamps • Is this sufficient? • Unfortunately, NO..

Events Sensemaking via Aggregated Twitter Sentiment: Challenges C1. Difficult to relate Twitter sentiment to segments and topics of the event • Fixed time-window approach is often not valid as presented in ET-LDA C2. Manually annotating sentiments of a vast amount of tweets is error-prone • Present a bottleneck in learning high quality models C3. Twitter sentiment is conveyed with highly domain-speciﬁc contextual cues • Can cause models to potentially lose performance and become stale How to overcome these challenges?

Our Contribution: SocSent • Leverage prior knowledge to overcome the challenges • ET-LDAto align tweets to the event  C1 • Sentiment lexicon  C3 • Labelsfor small sets of tweets  C2 • SocSent incorporates prior knowledge into a matrix factorization framework, that learns factors in latent dimensions – segments, topicsand sentiments (positive or negative)– of the event, as elicited on Twitter Yuheng Hu, Fei Wang, SubbaraoKambhampati. “Listen to the Crowd: Automated Analysis of Events via Aggregated Twitter Sentiment.” In International Joint Conference on Artificial Intelligence (IJCAI) 2013

SocSent: Framework segment segment Regulation From prior tweet tweet terms topic Tweet-event alignment from ET-LDA segment Regulation From prior sentiment tweet sentiment tweet factorization topic Labels for small tweets T sentiment sentiment We require that the factors respect the prior knowledge to the extent possible. Regulation From prior term term Sentiment lexicon

SocSent: Formal Formulation G G0 T X R0 S F F0 R0 regulates G, T and S together T X S represents segment-sentiment matrix G X T X S represents tweets-sentiment matrix

Prior Knowledge in SocSent segment Obtain G0 sentiment lexicon from ET-LDA inference. Each row represent nt tweets and its columns represent ns segments of the event. the content is the posterior probability of a tweet referring to the segments. G0 tweet sentiment F0 Obtain F0 sentiment lexicon from MPQA corpus. F0(i, 1) = 1 if word i is possible, and F0(i, 2) = 1 for negative sentiment term sentiment R0 Ask people to label the sentiment for a few tweets (e.g., less than 1000) for the purposes of capturing some domain-speciﬁc connotations tweets

SocSent: Model Inference Multiplicative update rules The coupling between G, T, S, F makes it difﬁcult to ﬁnd optimal solutions for all factors simultaneously. We adopt an alternating optimization scheme [Ding et al., 2006]

Inference in SocSent Ψ is the Lagrangian multipliers which enforce non-negativity constraints on F, C represents terms irrelevant to F

Evaluation Plan for SocSent Experimental Setup Evaluation of SocSent • Classification performance of sentiment of event segment • Classification performance of sentiment of event Topics • Effectiveness of Prior Knowledge • Tweets for President Obama’s speech on the Middle East (#MESpeech) & 2012 Presidential Debate in the US (#DenverDebate") • Event transcripts from New York Times • Ground truth: • Graduate students manually label the sentiment. Later applied ET-LDA to establish the alignment between the labeled tweets and the event segments • Label sentiment according to the majority aggregated Twitter sentiment that correlated to it

Event Analytics on Social Media: Challenges and Solutions