1 / 22

First Story Detection: Combining Similarity and Novelty Based Approaches

First Story Detection: Combining Similarity and Novelty Based Approaches. Martin Franz, Abraham Ittycheriah, J. Scott McCarley, Todd Ward, IBM T. J. Watson Research Center. What is First Story Detection. Have we seen this before? If it is not old it must be new.

gwenllian
Download Presentation

First Story Detection: Combining Similarity and Novelty Based Approaches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. First Story Detection: Combining Similarity and Novelty Based Approaches Martin Franz, Abraham Ittycheriah, J. Scott McCarley, Todd Ward, IBM T. J. Watson Research Center

  2. What is First Story Detection • Have we seen this before? • If it is not old it must be new. • Novelty measured at three levels: • word: “Bloomberg” • story: Bloomberg wins! … (NYT 11/7/2001, page1) • story cluster: NYC mayoral elections in 2001 past future

  3. Outline • our first participation in FSD • combined approach: • story similarity (unsupervised clustering) • term novelty

  4. FSD with Unsupervised Clustering for each story for each cluster compute story/cluster similarity score yes best score > threshold no start new cluster merge story into cluster FSD confidence = 1 / best similarity score

  5. Story/Cluster Similarity cluster representation: “mean story” symmetrized Okapi formula Ok(s,c) = S cnts(t)*cntc(t)*idf(t) cntis warped, length scaled term count

  6. Text Pre-Processing • tokenizing • part-of-speech tagging • morphing • word_tag -> morph • computers_NNS -> computer • computed_VBD -> compute • unigrams and noun bigrams

  7. Refinement: Cluster Recency Distance from the first story (TDT2, January-March) correct reject: flat FA: decreasing with the distance from the seed story

  8. Clusters more “attractive” shortly after they are created. score’ = score *(1 + 2-age/half-time) half-time ~ 2 days ~ 860 stories

  9. After Incorporating Cluster Recency

  10. Effect of Cluster Recency before (baseline) after (cluster recency) half-time TDT2, first 10 000 stories

  11. Baseline vs. Cluster Recency TDT3, ASR, reference boundaries

  12. Effect of Cluster Recency TDT3, ASR, reference boundaries

  13. Processing Very Short Stories, Automatic Boundaries Problem: numerous segmentation false alarms, resulting in short “stories”, causing FSD false alarms. Solution: finding and connecting similar neighboring stories “catch all” cluster

  14. Processing Very Short Stories • Problem: • short “stories”, causing FSD false alarms. • Solution: • if • best similarity score = 0 • or • story vocabulary size < 20 • then • story -> “catch all” cluster

  15. Term Novelty Feature new story ~ new words and phrases score(t) = (1 - 2-distance / half-time) * tf * idf half-time = (dev_corpus_size / df) * c c TDT2, Jan-March, clean

  16. Combining Similarity and Novelty Scores scoreFSD = 0.8 * scoreSim + 0.2 * scoreNov

  17. Combining Similarity and Novelty Scores TDT3, manual TDT3, ASR TDT3, ASR, auto boundaries

  18. FSD on Mandarin (Systran) Data reference boundaries automatic boundaries det_SR=nwt+bnasr_TE=mul,eng.ndx October-December Mandarin only 99 topics

  19. FSD on Mandarin (Systran) and English Data reference boundaries automatic boundaries det_SR=nwt+bnasr_TE=mul,eng.ndx October-December Mandarin (Systran) + English 240 topics, 39 have Mandarin first story

  20. Conclusion • Cluster recency feature brings moderate performance gain. • Term novelty approach shows acceptable performance, more robust to noise. • Combining the two algorithms improves performance under most conditions. • As the noise level grows, the performance gain obtained by combining novelty and similarity systems increases.

  21. Lessons Learned • Automatic FSD is a hard problem • Solution: deeper story understanding?

More Related