An introduction to detection of news events mouli venkataramani
Download
1 / 17

Project Presentation - PowerPoint PPT Presentation


  • 258 Views
  • Updated On :

An Introduction to Detection of News Events Mouli Venkataramani . References James Allan et al, topic detection and tracking pilot study final report, proceedings of the DARPA broadcast news transcription and understanding workshop, Feb 1998.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Project Presentation' - JasminFlorian


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An introduction to detection of news events mouli venkataramani
An Introduction to Detection of News EventsMouli Venkataramani

References

  • James Allan et al, topic detection and tracking pilot study final report, proceedings of the DARPAbroadcast news transcription and understanding workshop, Feb 1998.

  • Yiming yang et al, learning approaches for detecting and tracking news events


Outline
OUTLINE

  • Importance of news

  • Terminology

  • Event Evolution

  • Patterns in Event Distribution

  • TDT

  • Major Tasks

  • New Event Detection

  • Clustering

  • On-Line New Event Detection


Importance of news
Importance of News

Examples

  • A person returns from an extended vacation and needs to find out quickly what happened in the world

  • A foreign policy specialist who wants to study the Asian economic crisis

  • Query based retrieval is useful only when one knows precisely the nature of the events or facts one is seeking

  • Retrieval based on immediate-content-focussed queries is often insufficient for tracking the gradual evolution of events through time


News in financial world
News in Financial World

  • Impact of news on stock prices is a phenomenon that has been widely studied in the financial world. Examples of news are

    • Earnings reports

    • Splits

    • Merger Talks

    • Good News/ Bad News


Some terminology
Some Terminology

  • Topic

    • A seminal event or activity along with all directly related events and activities

  • Event

    • Something that happens at some specific time and place

  • Event Vs topic

    • The property of time is what distinguishes an event from the more general topic

  • Example event

    • Computer virus detected at British telecom march 3, 1993

  • Example topic

    • Computer virus outbreaks


Event evolution
Event Evolution

  • As an event evolves, new lexical features appear

  • Example

    • Oklahoma city bombing


Patterns in event distributions
Patterns in Event Distributions

  • News stories discussing the same event tend to be temporally proximate

  • A time gap between burst of topically similar stories is often an indication of different events

    • Different earthquakes

    • Airplane accidents

  • A significant vocabulary shift and rapid changes in term frequency are typical of stories reporting a new event, including previously unseen proper nouns

  • Events are typically reported in a relatively brief time window 1- 4 weeks


Tdt the corpus
TDT & The Corpus

  • TDT

    • Topic detection and tracking

  • A corpus of text and transcribed news has been developed to support the TDT study effort

  • This study corpus spans the period from July 1 1994 to June 30 1995

  • Includes 16,000 stories, half from Reuters newswire and half from CNN broadcast news

  • Stories are arranged in chronological order

  • A set of 25 target events has been identified to support the TDT effort


Tasks in news detection
Tasks in News Detection

News Feeds

Segmentation

Detection

Retro

On-Line

Tracking


Task explained
Task Explained

  • Segmentation

    • Defined as the task of segmenting a continuous stream of text into its constituent stories i.E. Locate the boundaries between adjacent stories

  • Detection

    • Characterized by lack of knowledge of event to be detected. Leads to one of the following

      • Retrospective detection, where task is to identify all the the events in a corpus o f stories

      • On-line new event detection where the task is to identify new events in a stream of stories

  • Tracking

    • Defined as the task of associating incoming stories with events known to the system


New event detection
New Event Detection

  • New event detection is an unsupervised learning task

  • Detection may consist of discovering previously unidentified events in an accumulated collection – retro

  • Flagging onset of new events from live news feeds in an on-line fashion

  • Lack of advance knowledge of new events, but have access to unlabeled historical data as a contrast set

  • The input to on-line detection is the stream of TDT stories in chronological order simulating real-time incoming documents

  • The output of on-line detection is a YES/NO decision per document


Clustering in information retrieval
Clustering in Information Retrieval

  • Document clustering is an unsupervised process that groups documents with similar content

  • Clustering methods cluster documents in groups containing overlapping sets of words

  • Used effectively in query based retrieval systems – web search engines

  • Improves speed, effectiveness as the query is matched to the different clusters instead of all documents and the best matching cluster is then returned

  • Agglomerative clustering and single pass clustering are most commonly used


Clustering algorithms
Clustering Algorithms

  • Agglomerative clustering – reviewed in class

  • Single pass clustering or incremental clustering

    • Documents are processed serially

    • The representation for the first document becomes the cluster representative for the first cluster

    • Each subsequent document is matched against all cluster representative existing at processing time

    • A given document is assigned to one cluster according to some similarity measure

    • When a document is assigned to a cluster the representative for that cluster is recomputed

    • If a document fails a certain similarity test it becomes the cluster representative of a new cluster


Modified single pass clustering
Modified Single Pass Clustering

  • A slightly different version of single pass clustering is to use all the documents for comparison instead of just the cluster representative

  • Example


On line new event detection
On-line New Event Detection

  • A new document is absorbed by the most similar cluster in the past if the similarity between the document and the cluster is above a pre-selected clustering threshold

  • For on-line new event detection we need another threshold called the novelty threshold. If the maximal similarity score between the current document and any cluster in the past is below the threshold then the document is labeled “new” meaning that it is the first story of a new event; Else it is labeled “old”

  • Both the thresholds are user specified and require tuning

  • Most important functionality is time penalty. There are two approaches

    • Uniformly weighted time window

    • Linear decaying-weight function


New event detection contd
New Event Detection (Contd.)

  • Given the current document (x) in the input stream, we impose a time window of (m) documents prior to (x), we define similarity between (x) and any cluster (c) in the past to be

    • sim (x,c) = sim (x,c) if cluster (c) has any member document in the time window

    • sim (x,c) = (1- i/m) * sim (x,c) if cluster (c) has any member document in the time window

    • Where (i) is the number of documents between document (x) and the most recent member document in cluster (c)

    • sim (x,c) is the usual cosine similarity


Take home message
Take Home Message

  • Event detection, tracking and clustering form an integral part of news detection

  • The field is relatively young and is very “hot” due to rapid advances in the internet domain

  • As we saw in the beginning , timely news detection and handling generic queries are important

  • Methodologies from multivariate statistics form the backbone for all applications


ad