Speech summarization
1 / 70

Speech Summarization - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Speech Summarization. Sameer R. Maskey. Summarization. ‘the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) [Mani and Maybury, 1999]. Indicative or Informative. Indicative

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Speech Summarization

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Speech Summarization

Sameer R. Maskey


  • ‘the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) [Mani and Maybury, 1999]

Indicative or Informative

  • Indicative

    • Suggests contents of the document

    • Better suits for searchers

  • Informative

    • Meant to represent the document

    • Better suits users who want the overview

Speech Summarization

  • Speech summarization entails ‘summarizing’ speech

    • Identify important information relevant to users and the story

    • Represent the important information

    • Present the extracted/inferred information as an addition or substitute to the story

Are Speech and Text Summarization similar?

  • NO!

  • Speech Signal

  • Prosodic features

  • NLP tools?

  • Segments – sentences?

  • Generation?

  • ASR transcripts

  • Data size

  • Yes

  • Identifying important information

  • Some lexical, discourse features

  • Extraction

Text vs. Speech Summarization (NEWS)

Speech Signal

Speech Channels

- phone, remote satellite, station


- ASR, Close Captioned

Error-free Text

Transcript- Manual

Many Speakers

- speaking styles

Lexical Features

Some Lexical Features




-Anchor, Reporter Interaction

Story presentation


Prosodic Features

-pitch, energy, duration

NLP tools

Commercials, Weather Report

Speech Summarization (NEWS)

Speech Signal

Speech Channels

- phone, remote satellite, station


- ASR, Close Captioned

Error-free Text

Transcript- Manual

Many Speakers

- speaking styles

Lexical Features

Some Lexical Features




-Anchor, Reporter Interaction

Story presentation


Prosodic Features

-pitch, energy, duration

many NLP tools

Commercials, Weather Report

Why speech summarization?

  • Multimedia production and size are increasing: need less time-consuming ways to archive, extract, use and browse speech data - speech summarization, a possible solution

  • Due to temporal nature of speech, difficult to scan like text

  • User-specific summaries of broadcast news is useful

  • Summarizing voicemails can help us better organize voicemails

[Salton, et al., 1995]

Sentence Extraction

Similarity Measures

[McKeown, et al., 2001]

Extraction Training

w/ manual Summaries




[Hovy & Lin, 1999]

Concept Level

Extract concepts units

[Witbrock & Mittal, 1999]

Generate Words/Phrases

[Maybury, 1995]

Use of Structured Data

Summarization by sentence extraction with similarity measures [Salton, et al., 1995]

  • Many present day techniques involve sentence extraction

  • Extract sentence by finding similar sentence to topic sentence or dissimilar sentences to already built summary (Maximal Marginal Relativity)

  • Find sentences similar to the topic sentence

  • Various similarity measures [Salton, et al., 1995]

    • Cosine Measure

    • Vocabulary Overlap

    • Topic words overlap

    • Content Signatures Overlap

“Automatic text structuring and summarization” [Salton, et al., 1995]

  • Uses hypertext link generation to summarize documents

  • Builds intra-document hypertext links

  • Coherent topic distinguished by separate chunk of links

  • Remove the links that are not in close proximity

  • Traverse along the nodes to select a path that defines a summary

  • Traverse order can be

    • Bushy Path: constructed out n most bushy nodes

    • Depth first Path: Traverse the most bushy path after each node

    • Segmented bushy path: construct bushy paths individually and connect them on text level

Text relationship map [Salton, et al., 1995]

Summarization by feature based statistical models[Kupiec, et al., 1995]

  • Build manual summaries using available number of annotators

  • Extract set of features from the manual summaries

  • Train the statistical model with the given set of values for manual summaries

  • Use the trained model to score each sentence in the test data

  • Extract ‘n’ highest scoring sentences

  • Various statistical models/machine learning

    • Regression Models

    • Various classifiers

    • Bayes rules for computing probability for inclusion by counting [Kupiec, et al., 1995]

      Where S is summary given k features Fj and P(Fj) & P(Fj|s of S) can be computed by counting occurrences

Summarization by concept/content level extraction and generation[Hovy & Lin, 1999] , [Witbrock & Mittal, 1999]

  • Quite a few text summarizers based on extracting concept/content and presenting them as summary

    • Concept Words/Themes]

    • Content Units [Hovy & Lin, 1999]

    • Topic Identification

  • [Hovy & Lin, 1999] uses Concept Wavefront to build concept taxonomy

    • Builds concept signatures by finding relevant words in 30000 WSJ documents each categorized into different topics

  • Phrase concatenation of relevant concepts/content

  • Sentence planning for generation

Summarization of Structured text database[Maybury, 1995]

  • Summarization of text represented in a structured form: database, templates

    • Report generation of a medical history from a database is such an example

  • Link analysis (semantic relations within the structure)

  • Domain dependent importance of events

Speech summarization: present

  • Speech Summarization seems to be mostly based on extractive summarization

  • Extraction of words, sentences, content units

  • Some compression methods have also been proposed

  • Generation as in some text-summarization techniques is not available/feasible

    • Mainly due to the nature of the content

[Christensen et al., 2004]

Sentence extraction with

similarity measures

[Hori C. et al., 1999, 2002] , [Hori T. et al., 2003]

Word scoring

with dependency structure



[Koumpis & Renals, 2004]


[He et al., 1999]

User access information

[Zechner, 2001]

Removing disfluencies

[Hori T. et al., 2003]

Weighted finite state


Content/Context sentence level extraction for speech summary[Christensen et al., 2004]

  • These are commonly used speech summarization techniques:

    • finding sentences similar to the lead topic sentences

    • Using position features to find the relevant nearby sentences after detecting the topic sentence

      where Sim is a similarity measure between two sentences

Weighted finite state transducers for speech summarization[Hori T. et al., 2003]

  • Speech Summarization includes speech recognition, paraphrasing, sentence compaction integrated into single Weighted Finite State Transducer

  • Enables decoder to employ all the knowledge sources in one-pass strategy

  • Speech recognition using WFST

    Where H is state network of triphone HMMs, C is triphone connection rules, L is pronunciation and G is trigram language model

  • Paraphrasing can be looked at as a kind of machine translation with translation probability P(W|T) where W is source language and T is the target language

  • If S is the WFST representing translation rules and D is the language model of the target language speech summarization can bee looked at as the following composition

Speech Translator







Speech recognizer


User access information for finding salient parts[He et al., 1999]

  • Idea is to summarize lectures or shows extracting the parts that have been viewed the longest

  • Needs multiple users of the same show, meeting or lecture for a statistically significant training data

  • For summarizing lectures compute the time spent on each slide

  • Summarizer based on user access logs did as well as summarizers that used linguistic and acoustic features

    • Average score of 4.5 on a scale of 1 to 8 for the summarizer (subjective evaluation)

Word level extraction by scoring/classifying words[Hori C. et al., 1999, 2002]

  • Score each word in the sentence and extract a set of words to form a sentence whose total score is the product/sum of the scores of each word

  • Example:

    • Word Significance score (topic words)

    • Linguistic Score (bigram probability)

    • Confidence Score (from ASR)

    • Word Concatenation Score (dependency structure grammar)

      Where M is the number of words to be extracted, and I C T are weighting factors for balancing among L, I, C, and T r


  • There are a few assumptions made in the previously mentioned methods

    • Segmentation

    • Information Extraction

    • Automatic Speech Recognition

    • Manual Transcripts

    • Annotation

Speech Segmentation?

  • Segmentation

    • Sentences

    • Stories

    • Topic

    • Speaker

  • Sentences

  • Topics

  • Features

  • Techniques

  • Evaluation





  • Text Retrieval Methods

  • on ASR Transcripts

Information Extraction from Speech Data?

  • Information Extraction

    • Named Entities

    • Relevant Sentences and Topics

    • Weather/Sports Information

  • Sentences

  • Topics

  • Features

  • Techniques

  • Evaluation





  • Text Retrieval Methods

  • on ASR Transcripts

Audio segmentation

Audio Segmentation










Audio segmentation methods

  • Can be roughly categorized in two different categories

    • Language Models [Dharanipragada, et al., 1999] , [Gotoh & Renals, 2000], [Maybury, 1998], [Shriberg, et al., 2000]

    • Prosody Models [Gotoh & Renals, 2000], [Meinedo & Neto, 2003] , [Shriberg, et al., 2000]

  • Different methods work better for different purposes and different styles of data [Shriberg, et al., 2000]

  • Discourse cues based method highly effective in broadcast news segmentation [Maybury, 1998]

  • Prosodic model outperforms most of the pure language modeling methods [Shriberg, et al., 2000], [Gotoh & Renals, 2000]

  • Combined model of using NLP techniques on ASR transcripts and prosodic features seem to work the best

Overview of a few algorithms:statistical model[Gotoh & Renals, 2000]

  • Sentence Boundary Detection: Finite State Model that extracts boundary information from text and audio sources

  • Uses Language and Pause Duration Model

  • Language Model: Represent boundary as two classes with “last word” or “not last word”

  • Pause Duration Model:

    • Prosodic features strongly affected by word

  • Two models can be combined

  • Prosody Model outperforms language model

  • Combined model outperforms both

Segmentation using discourse cues[Maybury, 1998]

  • Discourse Cues Based Story Segmentation

  • Sentence segmentation is not possible with this method

  • Discourse Cues in CNN

    • Start of Broadcast

    • Anchor to Reporter Handoff, Reporter to Anchor Handoff

    • Cataphoric Segment (still ahead of this news)

    • Broadcast End

  • Time Enhanced Finite State Machine to represent discourse states such as anchor, reporter, advertisement, etc

  • Other features used are named entities, part of speech, discourse shifts “>>” speaker change, “>>>” subject change

Speech Segmentation

  • Segmentation methods essential for any kind of extractive speech summarization

  • Sentence Segmentation in speech data is hard

  • Prosody Model usually works better than Language Model

  • Different prosody features useful for different kinds of speech data

  • Pause features essential in broadcast news segmentation

  • Phone duration essential in telephone speech segmentation

  • Combined linguistic and prosody model works the best

Information Extraction from Speech

  • Different types of information need to be extracted depending on the type of speech data

  • Broadcast News:

    • Stories [Merlino, et al., 1997]

    • Named Entities [Miller, et al., 1999] , [Gotoh & Renals, 2000]

    • Weather information

  • Meetings

    • Main points by a particular speaker

    • Address

    • Dates

  • Voicemail

    • Phone Numbers [Whittaker, et al., 2002]

    • Caller Names [Whittaker, et al., 2002]

Statistical model for extracting named entities[Miller, et al., 1999] , [Gotoh & Renals, 2000]

  • Statistical Framework: V denote vocabulary and C set of name classes,

    • Modeling class information as word attribute: Denote e=<c, w> and model using

    • In the above equation ‘e’ for two words with two different classes are considered different. This bring data sparsity problem

    • Maximum likelihood estimates by frequency counts

    • Most probable sequence of class names by Viterbi algorithm

    • Precision and recall of 89% for manual transcript with explicit modeling

Named entity extraction results [Miller, et al., 1999]

BBN Named Entity Performance as a function of WER [Miller, et al., 1999]

Information Extraction from Speech

  • Information Extraction from speech data essential tool for speech summarization

  • Named Entities, phone number, speaker types are some frequently extracted entities

  • Named Entity tagging in speech is harder than in text because ASR transcript lacks punctuation, sentence boundaries, capitalization, etc

  • Statistical models perform reasonably well on named entity tagging

Speech Summarization at Columbia

  • We make a few assumptions in segmentation and extraction

  • Some new techniques proposed

  • 2-level summary

    • Headlines for each story

    • Summary for each story

  • Summarization Client and Server model

Speech Summarization (NEWS)

Speech Signal


Speech Channels

- phone, remote satellite, station


- ASR, Close Captioned

Error-free Text

Transcript- Manual


Many Speakers

- speaking styles

Lexical Features

Some Lexical Features





-Anchor, Reporter Interaction

Story presentation


Prosodic Features

-pitch, energy, duration


many NLP tools

Commercials, Weather Report

Speech Summarization




/Sentence Segmentation, Speaker Identification, Speaker Clustering,

Manual Annotation,






Named Entity Detection, POS tagging

2-Level Summary

 Headlines

 Summary


  • Topic Detection and Tracking Corpus (TDT-2)

  • We are using 20 “CNN Headline shows” for summarization

  • 216 stories in total

  • 10 hours of speech data

  • Using Manual transcripts, Dragon and BBN ASR transcripts

Annotations - Entities

  • We want to detect –

    • Headlines

    • Greetings

    • Signoff

    • SoundByte

    • SoundByte-Speaker

    • Interviews

  • We annotated all of the above entities and the named entities (person, place, organization)

Annotations – by Whom and How?

  • We created a labeling manual following ACE standards

  • Annotated by 2 annotators over a course of a year

  • 48 hours of CNN headlines news in total

  • We built a labeling interface dLabel v2.5 that went through 3 revisions for this purpose

Annotations - dLabel v2.5

Annotations – ‘Building Summaries’

  • 20 CNN shows annotated for extractive summary

  • A Brief Labeling Manual

  • No detailed instruction on what to choose and what not to?

  • We built a web-interface for this purpose, where annotator can click on sentences to be included in the summary

  • Summaries stored in a MySQL database

Annotations – Web Interface

Annotations – Web Interface

Acoustic Features

  • F0 features

    • max, min, mean, median, slope

      • Change in pitch may be a topic shift

  • RMS energy feature

    • max, min, mean

      • Higher amplitude probably means a stress on the phrases

  • Duration

    • Length of sentence in seconds (endtime – starttime)

      • Very short or a long sentence might not be important for summary

  • Speaker Rate

    • how fast the speaker is speaking

      • Slower rate may mean more emphasis in a particular sentence

Acoustic Features – Problems in Extraction

  • What should be the segment to extract these features – sentences, turn, stories?

  • We do not have sentence boundaries.

  • A dynamic programming aligner to align manual sentence boundary with ASR transcripts

  • Feature values needs to be normalized by speaker: used Speaker Cluster ID available from BBN ASR

Acoustic Features – Praat: Extraction Tool

Lexical Features

  • Named Entities in a sentence

    • Person

    • People

    • Organization

    • Total count of named entities

  • Num. of words in a sentence

  • Num. of words in previous and next sentence

Lexical Features - Issues

  • Using Manual Transcript

  • Sentence boundary detection using Ratnaparkhi’s mxterminator

  • Named Entities annotated

  • For ASR transcript:

    • Sentence boundaries aligned

    • Automatic Named Entities detected using BBN’s Identifinder

    • Many NLP tools fail when used with ASR transcript

Structural Features

  • Position

    • Position of the sentence in the story and the turn

    • Turn position in the show

  • Speaker Type

    • Reporter or Not

  • Previous and Next Speaker Type

  • Change in Speaker Type

Discourse Feature

  • Given-New Feature Value

  • Computed using the following equation

where n_i is the number of ‘new’ noun stems in sentence i, d is the total

number of unique nouns, s_i is the number of noun stems that have already

been seen, t is the total number of nouns

  • Intuition:

    • ‘newness’ ~ more new unique nouns in the sentence (ni/d)

    • If many nouns already seen in the sentence ~ higher ‘givenness’ s_i/(t-d)


  • Sentence Extraction as a summary

  • Binary Classification problem

    • ‘0’ not in the summary

    • ‘1’ in the summary

  • 10 hours of CNN news shows

  • 4 different sets of features – acoustic, lexical, structural, discourse

  • 10 fold-cross validation

  • 90/10 train and test

  • 4 different classifiers

  • WEKA and YALE learning tool

  • Feature Selection

  • Evaluation using F-Measure and ROUGE metrics

Feature Sets

  • We want to compare the various combination of our “4” feature sets

    • Acoustic/Prosodic (A)

    • Lexical (L)

    • Structural (S)

    • Discourse (D)

  • Combinations of feature sets, 15 in total

    • L, A, …, L+A, L+S, … , L+A+S, … , L+S+D, … , L+A+S+D


  • Choice of available classifier may affect the comparison of feature sets

  • Compared 4 different classifiers by plotting threshold (ROC) curve and computing Area Under Curve (AOC)

  • Best Classifier has AOC of 1

ROC Curves

Results – Best Combined Feature Set

  • We obtained best F-measure for 10 fold cross validation using all acoustic (A), lexical (L), discourse (D) and structural (S) feature.

  • F-Measure is 11.5% higher than the baseline.

What is the Baseline?

  • Baseline is the first 23% of sentences in each story.

    • In Average Model summaries were 23% in length

  • In summarization selecting first n% of sentences is pretty standard baseline

  • For our purpose this is a very strict baseline, why?

    • Because stories are short. In average 18.2 sentences for each story

    • In broadcast news it is standard to summarize the story in the introduction

    • These sentences are likely to be in the summary

Baseline and the Best F-measure

F-Measure for All 15 Feature Sets

Evaluation using ROUGE

  • F-measure is a too strict measure

  • Predicted summary sentences has to match exactly with the summary sentences

  • What if we have a predicted sentence that is not an exact but has a similar content?

  • ROUGE takes account of this

ROUGE metric

  • Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

  • ROUGE-N (where N=1,2,3,4 grams)

  • ROUGE-L (longest common subsequence)

  • ROUGE-S (skip bigram)

  • ROUGE-SU (skip bigram counting unigrams as well)

Evaluation using ROUGE metric

  • In average L+S+A+D is 30.3% higher than the baseline

Results - ROUGE

Does importance of ‘what’ is said correlates with ‘how’ it is said?

  • Hypothesis: “Speakers change their amplitude, pitch, speaking rate to signify importance of words, phrases, sentences.”

  • If this is the case then the prediction labels for sentences predicted using acoustic features (A) should correlate with labels predicted using lexical features (L)

  • We found correlation of 0.74

  • This above correlation is a strong support for our hypothesis

Is It Possible to Build ‘good’ Automatic Speech Summarization Without Any Transcripts?

  • Just using A+S without any lexical features we get 6% higher F-measure and 18% higher ROUGE-avg than the baseline

Feature selection

  • We used feature selection to find the best feature set among all the features in the combined set

  • 5 best features are shown in the table

  • These 5 features consist of all 4 different feature sets

  • Feature Selection also selected these 5 features as the optimal feature set

  • F-measure using just 5 features is 0.53 which only 1% lower than using all features

Problems and Future Work

  • We assume we have a good

    • Sentence boundary detection

    • Speaker IDs

    • Named Entities

  • We obtain a very good speaker IDs and named entities from BBN but no sentence boundaries

  • We have to address the sentence boundary detection as a problem on its own.

    • Alternative solution: We can do a ‘breath group’ level segmentation and build a model based on such segmentation

More Current and Future Work

  • We annotated headlines, greetings, signoffs, interviews, soundbytes, soundbyte speakers, interviewees

  • We want to detect these entities

    • (students involved for detecting some of these entities – Aaron Roth, Irina Likhtina)

  • We want to present summary and these entities in a unified browsable frame work

    • (student involved – Lauren Wilcox)

  • The browser is implemented in client/server framework

Summarization Architecture


SGML Parser















XML Parser

























Named-Entity Tagger

Weather forecast








Generation or Extraction?

  • SENT27 a trial that pits the cattle industry against tv talk show host oprah winfrey is under way in amarillo , texas.

  • SENT28 jury selection began in the defamation lawsuit began this morning .

  • SENT29 winfrey and a vegetarian activist are being sued over an exchange on her April 16, 1996 show .

  • SENT30 texas cattle producers claim the activists suggested americans could get mad cow disease from eating beef .

  • SENT31 and winfrey quipped , this has stopped me cold from eating another burger

  • SENT32 the plaintiffs say that hurt beef prices and they sued under a law banning false and disparaging statements about agricultural products

  • SENT33 what oprah has done is extremely smart and there's nothing wrong with it she has moved her show to amarillo texas , for a while

  • SENT34 people are lined up , trying to get tickets to her show so i'm not sure this hurts oprah .

  • SENT35 incidentally oprah tried to move it out of amarillo . she's failed and now she has brought her show to amarillo .

  • SENT36 the key is , can the jurors be fair

  • SENT37 when they're questioned by both sides, by the judge , they will be asked, can you be fair to both sides

  • SENT38 if they say , there's your jury panel

  • SENT39 oprah winfrey's lawyers had tried to move the case from amarillo , saying they couldn't get an impartial jury

  • SENT40 however, the judge moved against them in that matter …




  • We talked about different techniques to build summarization systems

  • We described some speech-specific summarization algorithms

  • We showed feature comparison techniques for speech summarization

    • A model using a combination of lexical, acoustic, discourse and structural feature is one of the best model so far.

    • Acoustic features correlate with the content of the sentences

  • We discussed possibilities of summarizing speech without any transcribed text

  • Login