Story segmentation in english mandarin and arabic broadcast news
Download
1 / 18

Story Segmentation in English Mandarin and Arabic Broadcast News - PowerPoint PPT Presentation


  • 378 Views
  • Uploaded on

Story Segmentation in English Mandarin and Arabic Broadcast News. Andrew Rosenberg, Julia Hirschberg Columbia University 5/31/06. Outline. Introduction Approach Motivating Example Results. Why do we need story segmentation?. News shows commonly contain many distinct stories.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Story Segmentation in English Mandarin and Arabic Broadcast News' - KeelyKia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Story segmentation in english mandarin and arabic broadcast news l.jpg

Story Segmentation in English Mandarin and Arabic Broadcast News

Andrew Rosenberg, Julia Hirschberg

Columbia University

5/31/06


Outline l.jpg
Outline News

  • Introduction

  • Approach

  • Motivating Example

  • Results

HLT/NAACL'06 - Practice Talk


Why do we need story segmentation l.jpg
Why do we need story segmentation? News

  • News shows commonly contain many distinct stories.

  • Identifying story (topic) boundaries improves:

    • Summarization [Hearst and Plaunt 93]

    • Information Retrieval [Hearst and Plaunt 93]

    • Anaphora Resolution [Kozima 93]

HLT/NAACL'06 - Practice Talk


Our approach l.jpg
Our Approach News

  • Input: Speech signal, ASR transcript, automatic speaker diarization and automatic sentence boundary hypotheses

  • Assume: story boundaries occur at sentence boundaries

  • JRip: java implementation of Cohen’s RIPPER, a rule induction algorithm.

    • build rulesets for each show individually

    • using Lexical, Acoustic and Speaker-dependant features.

HLT/NAACL'06 - Practice Talk


Corpus description l.jpg
Corpus Description News

  • Broadcast News material from TDT4 Corpus distributed by LDC

    • English: 312.5 hrs, 450 broadcasts 6 shows

    • Mandarin: 134 hrs, 205 broadcasts, 3 shows

    • Arabic: 88.5 hrs, 109 broadcasts, 2 shows

  • Each broadcast contains between 10 and 20 stories

HLT/NAACL'06 - Practice Talk


Story segmentation example l.jpg
Story segmentation example News

the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics in armenian gymnast tested positive for a banned stimulant that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass a spokeswoman says republican senator strom thurmond is going very well after falling ill saturday he spent the night it will to read army medical center in washington

HLT/NAACL'06 - Practice Talk


Story segmentation example7 l.jpg
Story segmentation example News

  • the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four

  • andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics in armenian gymnast tested positive for a banned stimulant

  • that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion

  • the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass a spokeswoman says republican senator strom thurmond is doing very well after falling ill saturday

  • he spent the night it will to read army medical center in washington

HLT/NAACL'06 - Practice Talk


Story segmentation example8 l.jpg
Story segmentation example News

  • the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four

  • andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics ***in armenian gymnast tested positive for a banned stimulant

  • --- that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion

  • the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass*** a spokeswoman says republican senator strom thurmond is doing very well after falling ill saturday

  • he spent the night it will to read army medical center in washington

HLT/NAACL'06 - Practice Talk


Story segmentation example9 l.jpg
Story segmentation example News

  • the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four

  • andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics in armenian gymnast tested positive for a banned stimulant

  • that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion

  • the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass a spokeswoman says republican senator strom thurmond is doing very well after falling ill saturday

  • he spent the night it will to read army medical center in washington

HLT/NAACL'06 - Practice Talk


Lexical features l.jpg
Lexical Features News

  • TextTiling

    • Identify boundaries with locally minimal cosine similarity of the preceding and following regions.

  • LCSeg

    • Augments the above process by weighting cosine similarity by a lexical chain score: a measure of lexical repetition.

  • ‘Cue’ Unigrams

    • Those (stemmed, when available) unigrams that are significantly more likely to appear near the start or end of a story.

HLT/NAACL'06 - Practice Talk


Acoustic features l.jpg
Acoustic Features News

  • Pitch and Intensity

    • Min, max, median, mean, std.dev., mean absolute slope

    • Calculated over previous sentence, and first-order difference between previous and following

  • Vowel Duration

    • Mean vowel length, sentence final vowel length, sentence final rhyme length

  • Voiced/Total 10ms frames as an approximation of speaking rate

HLT/NAACL'06 - Practice Talk


Speaker dependent features l.jpg
Speaker-dependent Features News

  • Based on automatic speaker diarization

    • Performed by our collaborators at U.Washington

  • Normalization of acoustic features.

  • Speaker participation features as a rough approximation of speaker “role”.

    • What percent of the show’s sentences does the speaker of the previous sentence deliver?

    • First turn? Last turn?

    • Is this the first speaker in the show?

HLT/NAACL'06 - Practice Talk


Ruleset excerpts l.jpg
Ruleset Excerpts News

  • (ENG)If (speaker_distribution > .16) and (length > 15.85) and (maxI > 80.5) and (minI < 43.6) and (vowels_per_sec < 3) Then SEGMENT

  • (MAN)If (speaker_boundary) and (last_speaker_turn) and (speaker_distribution > 0.036) and (vowels_per_sec_norm > 1.03) Then SEGMENT

  • (ARB)If (following_cue_words > 1) and (preceding_cue_words > 1) and (meanI < 67.8) and (stdev_I > 8.0) Then SEGMENT

HLT/NAACL'06 - Practice Talk


Results english l.jpg
Results - English News

HLT/NAACL'06 - Practice Talk


Results mandarin l.jpg
Results - Mandarin News

HLT/NAACL'06 - Practice Talk


Results arabic l.jpg
Results - Arabic News

HLT/NAACL'06 - Practice Talk


Conclusions l.jpg
Conclusions News

  • The described approach is successful at detecting story boundaries in English and Mandarin BN, though recall could be improved.

  • The acoustic features shown here and elsewhere to be indicative of topic shifts are not discriminative on Arabic BN.

    • Arabic has a different intonation strategy

    • MSA is not any speaker’s native language

    • Errors from previous processing -- ASR, sentence segmentation -- hinder the effectiveness of acoustic analysis.

HLT/NAACL'06 - Practice Talk


Thank you l.jpg

Thank You News

{amaxwell,[email protected]


ad