1 / 46

Parameter Free Bursty Events Detection in Text Streams

Parameter Free Bursty Events Detection in Text Streams. Authors: Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Philip S Yu. VLDB 2005. Outline. Introduction Bursty events? Text streams? A Possible Method Document pivot clustering Proposed Work Feature pivot clustering

palma
Download Presentation

Parameter Free Bursty Events Detection in Text Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parameter Free Bursty Events Detection in Text Streams Authors: Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Philip S Yu VLDB 2005

  2. Outline • Introduction • Bursty events? Text streams? • A Possible Method • Document pivot clustering • Proposed Work • Feature pivot clustering • Results Highlight • Related Work • Summary & Future Work

  3. Outline • Introduction • Bursty events? Text streams? • A Possible Method • Document pivot clustering • Proposed Work • Feature pivot clustering • Results Highlight • Related Works • Summary & Future Work

  4. Introduction (1 or 5) • Parameter Free Bursty Events Detection in Text Streams

  5. … … Introduction (2 or 5) • Parameter Free Bursty Events Detection in Text Streams • A sequence of documents organized temporally • E.g. News stories and e-mails • Two kinds of stream: Online vs. Offline • Online Stream: Open-ended. • Offline Stream: Have boundaries.

  6. Introduction (3 or 5) • Parameter Free Bursty Events Detection in Text Streams • An event consists a set of features that are useful to identify (understand) the event. • A Bursty Event is an event that is hot in a specific period of time • We call the features that are used to identify the Bursty Event as Bursty Features • E.g. The event “SARS” consists of the features “Outbreak, Atypic, Respire, …” No. of News Stories An event, e.g. SARS Time

  7. … … Introduction (4 or 5) • Parameter Free Bursty Events Detection in Text Stream • Given a text stream, try to figure out all of the bursty events • In other words, try to figure out all of the bursty features (features that are “hot” in a specific period) and group the bursty features together logically, such that the bursty features grouped together are useful for identifying an event.

  8. Introduction (5 or 5) • Parameter Free Bursty Events Detection in Text Streams • Parameter Free – You do not need to turn the parameters by yourself • The framework is applicable on any corpus • No fine tuning is necessary • No parameter needs to be estimated • Why parameter free is useful? • Without any prior knowledge about the information in a database, it is rather difficult to make any initially estimation • In our problem, we are trying to identify the bursty events in a text stream. In this problem, we do not have any prior knowledge about the information in the database. We do not know what it contains. We even do not know whether there is any burst. We do not know…

  9. Problem Setting • Data archived • Source: Local news stories (South China Morning Post) • Period: 2003-01-01 to 2004-12-31 • Some major settings • Offline detection • New stories that are release on the same day (i.e. new stories that appear in the same piece of the newspaper) are grouped together as a batch

  10. Outline • Introduction • Bursty events? Text streams? • A Possible Method • Document pivot clustering • Proposed Work • Feature pivot clustering • Results Highlight • Related Works • Summary & Future Work

  11. Step 2 Step 1 . . . . . . . . . feature feature All News Stories Document Pivot Clustering Approach (1 of 3) • A possible method (Not our approach) • Step 1: • Objective: Group similar events together • Method: Use clustering to group similar documents together (e.g. K-Means) • Step 2 • Objective: Extract the keywords of each event • Method: Use feature selection (e.g. Information gain) Extract the Key Features Group 1 Via Clustering Extract the Key Features Group 2

  12. Document Pivot Clustering Approach (2 of 3) • Some difficulties • Most similar documents may not report the same event • From our experiments, we found that two documents that are the most similar in terms of the features, may not necessary report the same event • Clustering requires feature weightings (e.g. tf-idf) • Feature weighting is originated from IR. Its idea is: feature appear in fewer documents in the domain are more useful (obtain higher weights). • For clustering: feature appear in many documents in a certain period should obtain a higher weights.

  13. Document Pivot Clustering Approach (3 of 3) • Some difficulties (cont’d) • A long running events may be broken down into several small pieces • This phenomenon appears in many reported studies (esp. in TDT) • Difficult to figure out the bursty features • Assume clustering can determine bursty events. However, there can be many clusters that are not “hot” (important). Determine which of the cluster is “hot” is difficult (may require a ranking function, but difficult to derive.)

  14. Outline • Introduction • Bursty events? Text streams? • A Possible Method • Document pivot clustering • Proposed Work • Feature pivot clustering • Results Highlight • Related Works • Summary & Future Work

  15. Step 3 Step 1 Step 2 . . . . . . . . . . . . . . . Event 2 Event 1 All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features • Step 2 • Group the bursty features into bursty events • Step 3 • Determine the hot periods of the bursty events Determine the hot period Cluster Identify Extract Determine the hot period

  16. Step 1 . . . . . . . . . . . . . . . All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features Step 2 Group the bursty features into bursty events Step 3 Determine the hot periods of the bursty events Step 3 Event 1 Determine the hot period Step 2 Cluster Identify Extract Event 2 Determine the hot period

  17. Identify the Bursty Features (1 of 7) • General Idea • Given a single feature, f, try to figure out whether it contains any bursty period. • If so, then it is a bursty feature (in some specific periods) The distribution of a feature, f, among documents No. of docs contains the feature, f Bursty Period Time

  18. Identify the Bursty Features (2 of 7) • Some more examples No. of docs contains the feature, f No. of docs contains the feature, f No burst Not a burst (stopword) Time Time No. of docs contains the feature, f No. of docs contains the feature, f Two burst Burst without fading away Time Time

  19. Identify the Bursty Features (3 of 7) • An obvious approach to discover whether a feature is a bursty feature is to use a “threshold cut” No. of docs contains the feature, f The distribution of a feature, f, among documents Bursty Period threshold Time

  20. Identify the Bursty Features (4 of 7) • Challenges • Setting one single threshold for all features is impossible For a normal non-bursty feature: For a stop-word: No. of docs contains the feature, f No. of docs contains the feature, f threshold Time Time • Another attempt – set a “percentage cut” • Figure out the relative differences between the max and min of the “No. of docs contains the feature”

  21. Identify the Bursty Features (5 of 7) • Challenges • Setting a percentage cut is also impossible • Different features has different distribution: No. of docs contains the feature, f No. of docs contains the feature, f 300 500 Time Time

  22. Identify the Bursty Features (6 of 7) • Our solution • Treating each feature in the text stream as a probabilistic distribution • In each day, we compute the probability that the number of documents contains a particular feature, fj • What we got are: • N’ – no. of news stories in the streamn’ – no. of news stories in a time window (one day)K’ – no. of news stories contains the specific feature n’ – K’ – no. of news stories does not contain the specific feature • We can model the distribution of a feature in a time window (i.e. in a day) by binomial distribution (the above four elements are enough for computing binomial distribution) (Continue next page)

  23. Identify the Bursty Features (7 of 7) • If in any time window (day), the value of the binomial distribution (probability that the number of documents contain the feature) change significantly, than it implies that the feature exhibit “abnormal” behavior • The reason is that if the features are generated from an unknown probability distribution, than the value of the binomial distribution at each time window (in each day) should be more or less constant • Two reasons that it drop significantly: • Suddenly very few documents contains the specific features • We are not interested in this kind of observation, as it only tells us that the specific feature is NOT a bursty feature in the corresponding time window (day). It gives no insight about whether it is a bursty feature NOW. • Suddenly many documents contains the specific features • We are interested in this kind of features

  24. Step 1 . . . . . . . . . . . . . . . All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features Step 2 Group the bursty features into bursty events Step 3 Determine the hot periods of the bursty events Step 3 Event 1 Determine the hot period Step 2 Cluster Identify Extract Event 2 Determine the hot period

  25. Step 1 Step 2 . . . . . . . . . . . . . . . Event 2 Event 1 All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features • Step 2 • Group the bursty features into bursty events • Step 3 Determine the hot periods of the bursty events Step 3 Determine the hot period Cluster Identify Extract Determine the hot period

  26. Group the Bursty Features (1 of 2) • General idea • Group the features such that they always appear together • If the features always appear together, they should be discussing the same event • Cluster the features • Challenge • Should we group these two features together? • Situation:If feature A appears, Feature B always appears also.Feature A appears in 1,000 stories. Feature B appears in 200 stories. • We claim that they should not be grouped together, as Feature B is only a subset of Feature A. • We want to group the feature at the “same level”

  27. Group the Bursty Features (2 of 2) • Our solution • We try to figure out what is the probability of the features grouped together given the observation of the document distribution of the text stream • Find a maximum probability that the features would be grouped together (Expectation-Maximization, EM) • Mathematically,

  28. Step 1 Step 2 . . . . . . . . . . . . . . . Event 2 Event 1 All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features • Step 2 • Group the bursty features into bursty events • Step 3 Determine the hot periods of the bursty events Step 3 Determine the hot period Cluster Identify Extract Determine the hot period

  29. Step 3 Step 1 Step 2 . . . . . . . . . . . . . . . Event 2 Event 1 All feature Bursty feature All News Stories Feature Pivot Clustering Approach • Overview of the framework • Step 1 • Identify the bursty features • Step 2 • Group the bursty features into bursty events • Step 3 • Determine the hot periods of the bursty events Determine the hot period Cluster Identify Extract Determine the hot period

  30. Determine the Hot Periods • General idea • The highest average probability that the bursty features will be appeared together • Graphically Document Distribution Time

  31. Outline • Introduction • Bursty events? Text streams? • A Possible Method • Document pivot clustering • Proposed Work • Feature pivot clustering • Results Highlight • Related Works • Summary & Future Work

  32. Problem Setting • Data archived • Source: Local news stories (South China Morning Post) • Period: 2003-01-01 to 2004-12-31 • Major Settings • Offline detection • New stories that are release on the same day (i.e. new stories that appear in the same piece of the newspaper) are grouped together as a batch

  33. Results Highlight

  34. Key Problem • Whether two bursty features will be wrongly grouped together in the same bursty event, if the two features have very similar feature distributions? • Sars, Outbreak and Iraq have similar feature distributions, but Sars and Outbreak should be grouped together as bursty event SARS, and Sars and Iraq should not be grouped together for any bursty events.

  35. Explanation • Grouping bursty features Sars and Iraq: |DSars| = 3240, |DIraq| = 2404, | DSars DIraq| = 153, | DSars DIraq| = |M| = 5491 If Sars and Iraq should be grouped: P(D|Ek) = (3240/5491)(2404/5491) = 0.258, the cost c = 0.190 + 0.588 = 0.778 where P(Ek) = 0.646 If they are not grouped: P(D|Ek) = (3240/5491)(1 - 2404/5491) = 0.332, the cost c = 0 + 0.479 = 0.479 where P(Ek) = 1

  36. Explanation • Grouping bursty features Sars and Outbreak: |DSars| = 3240, |DOutbreak| = 2254, | DSars DOutbreak| = 1854, | DSars DOutbreak| = |M| = 3640 If Sars and Outbreak should be grouped: P(D|Ek) = (3240/3640)(2254/3640) = 0.551, the cost c = 0.043 + 0.259 = 0.302 where P(Ek) = 0.906 If they are not grouped: P(D|Ek) = (3240/3640)(1 - 2254/3640) = 0.338, the cost c = 0 + 0.471 = 0.471 where P(Ek) = 1

  37. Key Problem The feature-pivot clustering approach can correctly group features together. • Whether two bursty features will be wrongly grouped together in the same bursty event, if the two features have very similar feature distributions? • Sars, Outbreak and Iraq have similar feature distributions, but Sars and Outbreak should be grouped together as bursty event SARS, and Sars and Iraq should not be grouped together for any bursty events.

  38. Results Highlight • Some events

  39. Results Highlight Bursty Events

  40. Results Highlight Hot Periods

  41. Outline • Introduction • Bursty events? Text streams? • A Possible Method • Document pivot clustering • Proposed Work • Feature pivot clustering • Results Highlight • Related Works • Conclusion

  42. Related Works (1 of 2) • TDT – Automatically techniques for locating topically related materials in streams data (Wayne 2000 pp. 1487) • Five major tasks: segmentation, tracking, detection, first story detection, linking • Work well with the “document-pivot clustering” approach • Try to group similar documents to form an event (The event is not named, i.e. no need to extract or identify the main features in the event) • No need to figure out the “bursty features” • Other interesting issue • Our approach naturally combine the detection task and linking task together

  43. Related Works (2 of 2) • Many other related works • Vlachos et la SIGMOD’04 • Burst for online query • Smith SIGIR’02 • Events Detection • Kleinbery KDD’02 • Burst and hierarchical structure • Swan & Allan SIGIR’00 • Time varying features • …

  44. Outline • Introduction • Bursty events? Text streams? • A Possible Method • Document pivot clustering • Proposed Work • Feature pivot clustering • Results Highlight • Related Works • Summary & Future Work

  45. Summary & Future Work • Document Pivot Clustering vs. Feature Pivot Clustering • Document Pivot Clustering – Clustering is based on the content of the documents • Feature Pivot Clustering – Clustering is based on distribution of features • Future Works • Try to apply the framework in TDT dataset • However, TDT contain selectednews stories from multiple sources. The distribution of features may be affected. • Moreover, the time period of TDT is relatively short. We do not know whether the change in the distribution of features is significant enough for us to do analysis • Try to assign the same features to multiple events (more realistic) • However, this may lead to many new issues, such as a “cycle” appear, or the some parameters needed to introduce

  46. Thank you very much – The End –

More Related