Classifiers for Event Detection & Future Work

Classifiers for Event Detection & Future Work Kleisarchaki Sofia

Contents • Presentation of papers: • [1] – [6] • Events VS Non-Events • Definitions • Preconditions • Examples

[1]: “On-line New Event Detection and Tracking” • Feature extraction and query representation.(Inquery) • n most frequent single word features • Determine the query’s initial threshold by evaluating the new story with the query. , where withe relative weight of a query feature qi di = belief(qi, d, c) = 0.4+0.6*tf*idf • t: #of times feature qi occurs in the doc • df: #of docs containing feature qi • dl: document’s length • avg_dl: avg doc’s length in the collection • |c|: #of docs in the collection Documents/ Streams Classifier Ranker Presentation

[1]: “On-line New Event Detection and Tracking” • Ifeval(q, d) > thresh then new event. Else, no new event. • p: constant percentage of the initial threshold • tp: time penalty • i-j: distance of the documents i and j (documents closer together on the stream are more likely to discuss related events) • Unable to detect events that are discussed in the news at different level of granularity. i.e. “O.J. Simpson trial” vs other court cases • Solution: different weight strategy for query features Documents/ Streams Classifier Ranker Presentation

[1]: “On-line New Event Detection and Tracking” • Increasing the number of features results in improved performance, with an unacceptable increase in running time of the system. • Performance=100-distance from origin Documents/ Streams Classifier Ranker Presentation

[1]: “On-line New Event Detection and Tracking” • Effects of varying threshold parameters p and tp. • On average, for any value of p, performance is better when tp>0. Documents/ Streams Classifier Ranker Presentation

[2]: “A system for New Event Detection” • Incremental Model (df: not static) • Nt: total number of documents at time t. • dfCt : denotes the document frequencies in the newly added set of documents Ct. • New events introduce new vocabulary • Low frequency terms w tends to be uninformative. • dft>= θd (θd=2) Documents/ Streams Classifier Ranker Presentation

[2]: “A system for New Event Detection” • Similarity Calculation between documents d, q • Making a decision- Identify document d*: • Score(q) > θs new event Documents/ Streams Classifier Ranker Presentation

[2]: “A system for New Event Detection” • Improvements • Source-Specific TF-IDF Model - dfs,t(w) • Document Similarity Normalization • Source-Pair Specific On-Topic Similarity Normalization • Using Inverse Event Frequencies of Terms – ef(w) Documents/ Streams Classifier Ranker Presentation

[3]: “Text Classification and Named Entities for New Event Detection” • Basic Model • weight(w, d) = tf ∗ idf • tf = log(termfrequecy + 1.0) • idf = log((docCount + 1)/(documentfreq + 0.5)) • Basic Model can make mistakes  look into other parameters (category, overlap of named entities etc) Documents/ Streams Classifier Ranker Presentation

[3]: “Text Classification and Named Entities for New Event Detection” • Some categories: • Elections • Scandals/Hearings • Legal/Criminal Cases • Natural Disasters • Accidents • Acts of Violence or War • Three vector representations α: all terms in the document β: named entities (Language, location, nationality, organization etc) γ: the non-named entity terms Documents/ Streams Classifier Ranker Presentation

[3]: “Text Classification and Named Entities for New Event Detection” • Named entities are a double-edged sword and deciding when to use them can be tricky. • Considering named entities or not can not be decided for all categories. Documents/ Streams Classifier Named Entities do not matter Ranker Presentation

[3]: “Text Classification and Named Entities for New Event Detection” Named Entities Win Documents/ Streams Classifier Can not decide Ranker Presentation

[4]: “Streaming First Story Detection with application to Twitter” • Algorithm on locality-based sensitivity (constant time & space) • LSH-based approach • Constant number of documents inside the buckets. • Oldest document is removed • Constant number of comparisons • Compare each document with at most 3L documents it collided with. • We take the 3L most popular documents, according to the number of hash tables where the collision occurred. Documents/ Streams Classifier Ranker Presentation

[4]: “Streaming First Story Detection with application to Twitter” Documents/ Streams Classifier Ranker Presentation

[4]: “Streaming First Story Detection with application to Twitter” • Minimal normalized scores: • Umass: 0.69 (28 hours) • LSH: 0.71 (2 hours) Documents/ Streams Classifier Ranker Presentation

[4]: “Streaming First Story Detection with application to Twitter” • Comparison of processing time per 100 documents for LSH system and the Umass system. Documents/ Streams Classifier Ranker Presentation

[4]: “Streaming First Story Detection with application to Twitter” • Average Precision for Events vs Rest (Neutral, Spam) and for Events and Neutral vs Spam. • Average Precision as a function of the entropy threshold on the Events vs Rest task. Documents/ Streams Classifier Ranker Presentation

[5]: “Learning Similarity Metrics for Event Identification in Social Media” • Similarity metrics for: • Textual Features • Cosine Similarity [3] • Time/Date • 1-|t1-t2| / y, y: number of minutes in a year • Location • 1-H(L1, L2) L1, L2: latitude-longitude pairs H: Haversine distance [The haversine formula is an equation important in navigation, giving great-circle distances between two points on a sphere from their longitudes and latitudes] Documents/ Streams Classifier Ranker Presentation

[5]: “Learning Similarity Metrics for Event Identification in Social Media” • Clustering Framework • Single pass incremental clustering algorithm with a threshold parameter. • Threshold Selection • Select the threshold with the highest combined NMI and B-Cubed value. • Where C={c1, .., cn}: set of clusters E = {e1, .., en}: set of events Pb: avg precision, Rb: avg recall Documents/ Streams Classifier Ranker Presentation

[5]: “Learning Similarity Metrics for Event Identification in Social Media” • Clusterer’s Weight Selection • Assign a weight during the supervised training phase, indicating our confidence in its prediction. • wc = combined(NMI, B-Cubed) / Σwi • Consensus score: • P: prediction function. Returns 1 if documents are in the same cluster, 0 otherwise. • Simple Ensemble based technique • Compute similarity of a document with a cluster by comparing the document against all documents in the cluster using the ensemble consensus function. Documents/ Streams Classifier Ranker Presentation

[5]: “Learning Similarity Metrics for Event Identification in Social Media” • Improved Ensemble based technique (centroid-based) • if σc(di, cj) > μc then • Pc(di, Cj) = 1 • Else • Pc(di, Cj) = 0 • Compute consensus-score(di, cj) = , where wc weight of clusterer • Textual Centroid • Avg(tf*idf) per term • Time Centroid • Avg(time) in minutes • Location Centroid • Geographic mid-point Documents/ Streams Classifier Ranker Presentation

[6]: “Temporal and Information Flow Based Event Detection From Social Text Streams” • Collection of social text stream data: • D = <(p1, t1, s1), .., (pn, tn, sn)> pi ε P = {p1, .., p|p| }: piece of text content ti : timestamp si = <ai, ri> :social actor (initial actorreceiver) • Modelled as a graph, where each node is a text piece and each edge is the similarities between text pieces. Content Based Clustering Temporal Intensity based segmentation Information Flow Pattern Event Definition & Detection Algorithm

[6]: “Temporal and Information Flow Based Event Detection From Social Text Streams” • Text pieces are clustered into different topics using the graph cut algorithm. • Minimize the function: • Shi & Malik, ‘Normalized cuts and Image Segmentation’ • As a result each piece of text belongs to a topic cluster in the graph cut-based result. Content Based Clustering Temporal Intensity based segmentation Information Flow Pattern Event Definition & Detection Algorithm

[6]: “Temporal and Information Flow Based Event Detection From Social Text Streams” • Intensity of a topic at a time window is defined as the total number of text pieces created within a time window under the corresponding topic. • Segment a sequence of intensities of a topic <i1, .., in> into a sequences of k intervals <I1, .., In> [9] Content Based Clustering Temporal Intensity based segmentation Information Flow Pattern Event Definition & Detection Algorithm

[6]: “Temporal and Information Flow Based Event Detection From Social Text Streams” • As a result from the temporal segmentation, each topic is represented as a sequence of social network graphs over the temporal dimension. • Nodes: actors • Edges: communication intensity of the corresponding social actors • Communication intensity: number of communication text pieces between two social actors bi and bj under topic m within the nth time window. Content Based Clustering Temporal Intensity based segmentation Information Flow Pattern Event Definition & Detection Algorithm

[6]: “Temporal and Information Flow Based Event Detection From Social Text Streams” • Definition (Information Flow Chart): Given two social actors bi and bj , for a given topic m, the information flow pattern between them, denoted as Fm(bi, bj ), is defined as a vector of communication intensities. • Compute similarity between flow patterns using the dynamic time warping concept [10] Content Based Clustering Temporal Intensity based segmentation Information Flow Pattern Event Definition & Detection Algorithm

[6]: “Temporal and Information Flow Based Event Detection From Social Text Streams” • Definition (Event): Given a social text stream corpus denoted as D = <(p1, t1, s1), (p2, t2, s2), .., (pn, tn, sn) >, an event is defined as a subset of triples M = {(pi, ti, si), (pi+1, ti+1, si+1), ..., (pl, tl, sl) } such that: (1) for every pi, pjε PM= {pi, pi+1, .., p|M|} belongs to the same topic cluster based on the content-based text clustering results; (2) any timestamp in <ti, ti+1... Tj> is within the same time interval In, which is one of the time segments in the temporal intensity-based segmentation results; and (3) each pair of social actors stε SM = {si, si+1... sl} belongs to the same cluster among the graph cut results on the dual graph of the information flow pattern based graph. Content Based Clustering Temporal Intensity based segmentation Information Flow Pattern Event Definition & Detection Algorithm

[6]: “Temporal and Information Flow Based Event Detection From Social Text Streams” Content Based Clustering Temporal Intensity based segmentation Information Flow Pattern Event Definition & Detection Algorithm

[6]: “Temporal and Information Flow Based Event Detection From Social Text Streams” • C: content based E.D • CT content and temporal based E.D • CS content and social based E.D • CTS content, temporal, and social based E.D • TIF temporal & information flow pattern based E.D Content Based Clustering Temporal Intensity based segmentation Information Flow Pattern Event Definition & Detection Algorithm

Events VS non-Events • Current papers focus on event documents. • Learn to distinguish documents that contain an event from non-event documents.

Events VS non-Events • Event Definitions: • An event is something that occurs in a certain place at a certain time. • A tweet can labelled as an event, if it is clear from the tweet alone what exactly has happened without any prior knowledge of the event and the event referenced in the tweet has to be sufficiently important. [4] • Informative • Important

Events VS non-Events • Event Pre-Conditions: • Informative A tweet is informative when it contains information (directly or indirectly) about what, when and where something happened and which where the actors of the event. • Subject, time, place, actors • Important (celebrity deaths, natural disasters, major sports, political, entertainment, plane crashes and other disasters) Some indicators of importance are: • The growth rate of unique users talking about the event. • The influence of the users. • The dissemination of the information.

Events VS non-Events • Indicators of Importance: • Growth rate of users

Events VS non-Events • Indicators of Importance: • Influence of the user • A user with many followers represents a strongly authoritative twitter user that he/she can influence the text stream activity of many other users. • The influence of a user can be calculated using PageRank algorithm [7]

Events VS non-Events • Indicators of Importance: • The dissemination of the information • Events that influence many people are/tend to be important. • On the other hand locality-proximity is an indication of documents dissimilarity in the presence of all other features (text, time etc) [5]

Events VS non-Events • Non Event Definitions: • A non-event is the non-occurrence of an event. [8] • A non-event is an anticipated or highly publicized event that either does not occur or turns out to be anticlimactic, boring, or a hoax. Non-events are disappointing because they are often hyped prior to their occurrence. [wikipedia] • A tweet can be characterized as non-event tweet if it does not obey the preconditions 1 and 2.

Events VS non-Events • Consider the examples below: • The growth rate of users talking about Christmas is increasing. Many tweets ,containing wishes about Christmas, arrive during December. • Preconditions:1 is not valid, 2 is valid non Event • A local festival (Heraklion city) is taking place on 11th of December. • Preconditions:1 is valid, 2 is not valid non Event

Events VS non-Events • Non-Event tweets contain: • Spam Tweets • Advertisements, automatic weather updates, automatic radio station updates etc. • Entropy is a good metric for detecting spam tweets, as they contain very little information. [4] • Neutral Tweets • Any tweet that is not event or spam tweet.

Events VS non-Events • Davidson’s criterion of identity: two events are identical when they have the same causes and effects. • Non-events fail to give satisfactory results. Even though two non-events may have exactly the same set of causes and results, they do not seem always to be identical to one another. • [8]

References • [1]: On-line New Event Detection and Tracking, 1998 • [2]: A system for New Event Detection, 2003 • [3]: Text Classification and Named Entities for New Event Detection, 2004 • [4]: Streaming First Story Detection with application to Twitter, 2010 • [5]: Learning Similarity Metrics for Event Identification in Social Media, 2010 • [6]: Temporal and Information Flow Based Event Detection From Social Text Streams, 2007 • [7]: Emerging Topic Detection on Twitter based on Temporal and Social Terms Evaluation • [8]: Non-Events • [9]: A better Alternative to piecewise linear time series segmentation, 2007 • [10]: Exact indexing of dynamic time warping, 2002

Classifiers for Event Detection & Future Work