Parameter Free Bursty Events Detection in Text Streams

Parameter Free Bursty Events Detection in Text Streams Authors: Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Philip S Yu VLDB 2005

Outline • Introduction • Bursty events? Text streams? • A Possible Method • Document pivot clustering • Proposed Work • Feature pivot clustering • Results Highlight • Related Work • Summary & Future Work

Outline • Introduction • Bursty events? Text streams? • A Possible Method • Document pivot clustering • Proposed Work • Feature pivot clustering • Results Highlight • Related Works • Summary & Future Work

Introduction (1 or 5) • Parameter Free Bursty Events Detection in Text Streams

… … … Introduction (2 or 5) • Parameter Free Bursty Events Detection in Text Streams • A sequence of documents organized temporally • E.g. News stories and e-mails • Two kinds of stream: Online vs. Offline • Online Stream: Open-ended. • Offline Stream: Have boundaries.

Introduction (3 or 5) • Parameter Free Bursty Events Detection in Text Streams • An event consists a set of features that are useful to identify (understand) the event. • A Bursty Event is an event that is hot in a specific period of time • We call the features that are used to identify the Bursty Event as Bursty Features • E.g. The event “SARS” consists of the features “Outbreak, Atypic, Respire, …” No. of News Stories An event, e.g. SARS Time

… … … Introduction (4 or 5) • Parameter Free Bursty Events Detection in Text Stream • Given a text stream, try to figure out all of the bursty events • In other words, try to figure out all of the bursty features (features that are “hot” in a specific period) and group the bursty features together logically, such that the bursty features grouped together are useful for identifying an event.

Introduction (5 or 5) • Parameter Free Bursty Events Detection in Text Streams • Parameter Free – You do not need to turn the parameters by yourself • The framework is applicable on any corpus • No fine tuning is necessary • No parameter needs to be estimated • Why parameter free is useful? • Without any prior knowledge about the information in a database, it is rather difficult to make any initially estimation • In our problem, we are trying to identify the bursty events in a text stream. In this problem, we do not have any prior knowledge about the information in the database. We do not know what it contains. We even do not know whether there is any burst. We do not know…

Problem Setting • Data archived • Source: Local news stories (South China Morning Post) • Period: 2003-01-01 to 2004-12-31 • Some major settings • Offline detection • New stories that are release on the same day (i.e. new stories that appear in the same piece of the newspaper) are grouped together as a batch

Step 2 Step 1 . . . . . . . . . feature feature All News Stories Document Pivot Clustering Approach (1 of 3) • A possible method (Not our approach) • Step 1: • Objective: Group similar events together • Method: Use clustering to group similar documents together (e.g. K-Means) • Step 2 • Objective: Extract the keywords of each event • Method: Use feature selection (e.g. Information gain) Extract the Key Features Group 1 Via Clustering Extract the Key Features Group 2

Document Pivot Clustering Approach (2 of 3) • Some difficulties • Most similar documents may not report the same event • From our experiments, we found that two documents that are the most similar in terms of the features, may not necessary report the same event • Clustering requires feature weightings (e.g. tf-idf) • Feature weighting is originated from IR. Its idea is: feature appear in fewer documents in the domain are more useful (obtain higher weights). • For clustering: feature appear in many documents in a certain period should obtain a higher weights.

Document Pivot Clustering Approach (3 of 3) • Some difficulties (cont’d) • A long running events may be broken down into several small pieces • This phenomenon appears in many reported studies (esp. in TDT) • Difficult to figure out the bursty features • Assume clustering can determine bursty events. However, there can be many clusters that are not “hot” (important). Determine which of the cluster is “hot” is difficult (may require a ranking function, but difficult to derive.)

Step 3 Step 1 Step 2 . . . . . . . . . . . . . . . Event 2 Event 1 All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features • Step 2 • Group the bursty features into bursty events • Step 3 • Determine the hot periods of the bursty events Determine the hot period Cluster Identify Extract Determine the hot period

Step 1 . . . . . . . . . . . . . . . All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features Step 2 Group the bursty features into bursty events Step 3 Determine the hot periods of the bursty events Step 3 Event 1 Determine the hot period Step 2 Cluster Identify Extract Event 2 Determine the hot period

Identify the Bursty Features (1 of 7) • General Idea • Given a single feature, f, try to figure out whether it contains any bursty period. • If so, then it is a bursty feature (in some specific periods) The distribution of a feature, f, among documents No. of docs contains the feature, f Bursty Period Time

Identify the Bursty Features (2 of 7) • Some more examples No. of docs contains the feature, f No. of docs contains the feature, f No burst Not a burst (stopword) Time Time No. of docs contains the feature, f No. of docs contains the feature, f Two burst Burst without fading away Time Time

Identify the Bursty Features (3 of 7) • An obvious approach to discover whether a feature is a bursty feature is to use a “threshold cut” No. of docs contains the feature, f The distribution of a feature, f, among documents Bursty Period threshold Time

Identify the Bursty Features (4 of 7) • Challenges • Setting one single threshold for all features is impossible For a normal non-bursty feature: For a stop-word: No. of docs contains the feature, f No. of docs contains the feature, f threshold Time Time • Another attempt – set a “percentage cut” • Figure out the relative differences between the max and min of the “No. of docs contains the feature”

Identify the Bursty Features (5 of 7) • Challenges • Setting a percentage cut is also impossible • Different features has different distribution: No. of docs contains the feature, f No. of docs contains the feature, f 300 500 Time Time

Identify the Bursty Features (6 of 7) • Our solution • Treating each feature in the text stream as a probabilistic distribution • In each day, we compute the probability that the number of documents contains a particular feature, fj • What we got are: • N’ – no. of news stories in the streamn’ – no. of news stories in a time window (one day)K’ – no. of news stories contains the specific feature n’ – K’ – no. of news stories does not contain the specific feature • We can model the distribution of a feature in a time window (i.e. in a day) by binomial distribution (the above four elements are enough for computing binomial distribution) (Continue next page)

Identify the Bursty Features (7 of 7) • If in any time window (day), the value of the binomial distribution (probability that the number of documents contain the feature) change significantly, than it implies that the feature exhibit “abnormal” behavior • The reason is that if the features are generated from an unknown probability distribution, than the value of the binomial distribution at each time window (in each day) should be more or less constant • Two reasons that it drop significantly: • Suddenly very few documents contains the specific features • We are not interested in this kind of observation, as it only tells us that the specific feature is NOT a bursty feature in the corresponding time window (day). It gives no insight about whether it is a bursty feature NOW. • Suddenly many documents contains the specific features • We are interested in this kind of features

Step 1 . . . . . . . . . . . . . . . All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features Step 2 Group the bursty features into bursty events Step 3 Determine the hot periods of the bursty events Step 3 Event 1 Determine the hot period Step 2 Cluster Identify Extract Event 2 Determine the hot period

Step 1 Step 2 . . . . . . . . . . . . . . . Event 2 Event 1 All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features • Step 2 • Group the bursty features into bursty events • Step 3 Determine the hot periods of the bursty events Step 3 Determine the hot period Cluster Identify Extract Determine the hot period

Group the Bursty Features (1 of 2) • General idea • Group the features such that they always appear together • If the features always appear together, they should be discussing the same event • Cluster the features • Challenge • Should we group these two features together? • Situation:If feature A appears, Feature B always appears also.Feature A appears in 1,000 stories. Feature B appears in 200 stories. • We claim that they should not be grouped together, as Feature B is only a subset of Feature A. • We want to group the feature at the “same level”

Group the Bursty Features (2 of 2) • Our solution • We try to figure out what is the probability of the features grouped together given the observation of the document distribution of the text stream • Find a maximum probability that the features would be grouped together (Expectation-Maximization, EM) • Mathematically,

Step 1 Step 2 . . . . . . . . . . . . . . . Event 2 Event 1 All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features • Step 2 • Group the bursty features into bursty events • Step 3 Determine the hot periods of the bursty events Step 3 Determine the hot period Cluster Identify Extract Determine the hot period

Step 3 Step 1 Step 2 . . . . . . . . . . . . . . . Event 2 Event 1 All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features • Step 2 • Group the bursty features into bursty events • Step 3 • Determine the hot periods of the bursty events Determine the hot period Cluster Identify Extract Determine the hot period

Determine the Hot Periods • General idea • The highest average probability that the bursty features will be appeared together • Graphically Document Distribution Time

Problem Setting • Data archived • Source: Local news stories (South China Morning Post) • Period: 2003-01-01 to 2004-12-31 • Major Settings • Offline detection • New stories that are release on the same day (i.e. new stories that appear in the same piece of the newspaper) are grouped together as a batch

Results Highlight

Key Problem • Whether two bursty features will be wrongly grouped together in the same bursty event, if the two features have very similar feature distributions? • Sars, Outbreak and Iraq have similar feature distributions, but Sars and Outbreak should be grouped together as bursty event SARS, and Sars and Iraq should not be grouped together for any bursty events.

Key Problem The feature-pivot clustering approach can correctly group features together. • Whether two bursty features will be wrongly grouped together in the same bursty event, if the two features have very similar feature distributions? • Sars, Outbreak and Iraq have similar feature distributions, but Sars and Outbreak should be grouped together as bursty event SARS, and Sars and Iraq should not be grouped together for any bursty events.

Results Highlight • Some events

Results Highlight Bursty Events

Results Highlight Hot Periods

Outline • Introduction • Bursty events? Text streams? • A Possible Method • Document pivot clustering • Proposed Work • Feature pivot clustering • Results Highlight • Related Works • Conclusion

Related Works (1 of 2) • TDT – Automatically techniques for locating topically related materials in streams data (Wayne 2000 pp. 1487) • Five major tasks: segmentation, tracking, detection, first story detection, linking • Work well with the “document-pivot clustering” approach • Try to group similar documents to form an event (The event is not named, i.e. no need to extract or identify the main features in the event) • No need to figure out the “bursty features” • Other interesting issue • Our approach naturally combine the detection task and linking task together

Related Works (2 of 2) • Many other related works • Vlachos et la SIGMOD’04 • Burst for online query • Smith SIGIR’02 • Events Detection • Kleinbery KDD’02 • Burst and hierarchical structure • Swan & Allan SIGIR’00 • Time varying features • …

Summary & Future Work • Document Pivot Clustering vs. Feature Pivot Clustering • Document Pivot Clustering – Clustering is based on the content of the documents • Feature Pivot Clustering – Clustering is based on distribution of features • Future Works • Try to apply the framework in TDT dataset • However, TDT contain selectednews stories from multiple sources. The distribution of features may be affected. • Moreover, the time period of TDT is relatively short. We do not know whether the change in the distribution of features is significant enough for us to do analysis • Try to assign the same features to multiple events (more realistic) • However, this may lead to many new issues, such as a “cycle” appear, or the some parameters needed to introduce

Thank you very much – The End –

Parameter Free Bursty Events Detection in Text Streams

Parameter Free Bursty Events Detection in Text Streams

Presentation Transcript

Mining Correlated Bursty Topic Patterns from Coordinated Text Streams

Bursty Event Detection from Text Streams for Disaster Management

Text Streams for PlanetData

Bursty and Hierarchical Structure in Streams

Small Events Detection

Dynamic Detection of Streams in Memory References †

Surprise Detection in Science Data Streams

Extracting Events from Probabilistic Streams

Impact parameter studies in μ+jet events

Mining Correlated Bursty Topic Patterns from Coordinated Text Streams

Clarification of the Link Setup Bursty parameter (CID 2547, 3214)

Adaptive Frequency Counting over Bursty Data Streams

RPC, Events, Streams

Soccer Streams Free

free events in Hyderabad | local events in Hyderabad | Hyderabad events

Real Time Bursty Topic Detection from Twitter

AutoPart: Parameter-Free Graph Partitioning and Outlier Detection

Detection of Illicit Content in Video Streams

Semantic Relation Detection in Bioscience Text