SCANMail: Audio Browsing and Retrieval of Voicemail

SCANMAIL: Audio Browsing and Retrieval of Voicemail Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche, Meredith Ringel, and Litza Stark)

The Problem: Navigating Audio Data • Increasing amounts of audio data available in corporate, public and private collections (recorded meetings, broadcast news and entertainment, voicemail) – but useless without tools for searching • SCANMail prototype: tool for searching speech data in voicemail domain

SCANMail • Inspired by interviews, surveys and usage logs identifying problems of heavy voicemail users: • It’s hard to quickly scan through new messages to find the ones you need to deal with (e.g. during a meeting break) • It’s hard to find the message you want in your archive • It’s hard to locate the information you want in any message (e.g. the telephone number) • SCANMail provides technology to help solve these problems, supporting content-based audio navigation

Related Research • Cambridge video mail retrieval by voice • NIST TREC Spoken Document Retrieval track • IBM voicemail transcription and information extraction • AT&T voicemail user studies • AT&T automatic speaker identification, browsing/search for voicemail, and information extraction

SCANMail Architecture

Training Corpus • Messages collected from 138 AT&T Labs voicemail boxes • 100 hr corpus includes ~10K messages from 2500 speakers • Hand-labeled for caller id, gender, age, recording condition, entities (names, dates, telephone numbers) • Gender balanced, ~12% non-native speakers • ~10% of calls not from ordinary handsets • Mean message duration 36.4 secs, median 30.0 secs

Sample Human Transcription [greeting: hi alex] [callerid: this is aaron rosenberg] uh i'm calling at uh about [time: ten twenty four] on [date: thursday morning the twenty second] uh externally from [telno: nine seven three three six o0 eight five four three] uh this time I have uh by hand registered uh uh you as having me as a caller so uh uh this message should be scored properly so let's see what happens and then I’ll send email to phil to check it [closing: bye bye]

Sample ASR Transcription i i'll accesses [callerid: aaron rosenberg] uh i'm calling you at uh about ten twenty four on thursday morning the twenty second uh externally from [telno: nine seven three three six o0 eight five four three] uh this time uh have uh back and registered uh uh you have have been me has a call on so uh uh this message at the school what properly so let's see what happens in the know uh send me mail to check it bye bye

ASR Server: baseline system • Trained on 60 hour training set • Gender independent, 8k tied states, emission probabilities modeled by 12 component Gaussian mixtures • LM uses 14k vocabulary and Katz-style backoff trigram trained on 700k words • Lexicon automatically generated by AT&T Labs TTS system • Decoder uses finite state transducers to construct recognition network • Initial search pass produces lattices used as grammars in later search passes

Accuracy • 29% wer 24.4% now  ~21% with adaptation (future) • Speed • 2x real time for first pass • 12x real time for final transcription • Details: • Bacchiani (HLT ‘00, ICASSP ‘00); Hirschberg et al (Eurospeech ‘01)

Information Retrieval • Uses SMART IR engine (Salton 1971, Buckley 1985)  Lucene • Generates weighted term vectors for ASR transcripts and queries and computes similarity based on vector inner products • Both ASR transcripts and queries are preprocessed into tokens by removing common words (stop-listing) and stemming

Information Extraction • Extracts entities from the ASR transcripts • Old implementation used finite state transducers with hand designed costs • New statistical (trainable) system extracts phone numbers and caller names

Caller Identification • Proposes caller names by matching new incoming messages against existing Text Independent Gaussian Mixture Models (TIGMMs) • If no PBX-supplied caller identification, caller ID hypothesis presented to user • Caller models trained/adapted based on user feedback • Initial model trained after 1 minute of speech collected from single caller • Model updates with each 20sec increment up to 180sec (mature model)

Setting thresholds to keep outgroup acceptance low (2.7%), system had 11.5% ingroup rejection and 1.2% ingroup confusion for 20-caller ingroup. • For more detailed experimental results see Rosenberg (ICSLP ‘00, Eurospeech ‘01)

Email Server • Composes multi-part email message and sends to address specified in user profile • ASR transcript • Speech file • Entity transcriptions and speech segments • Uses time aligned ASR transcript and IE information to include audio excerpts corresponding to entities • Newest version captures much of functionality of client in message (Netscape, MS, Eudora)

Evaluation: User Studies • Compared SCANMail with standard over-the-phone interface (Audix) • 8 subject performed fact-finding, relevance ranking and summarization tasks • SCANMail • Better for fact-finding and ranking tasks in quality/time measures (p <0.05) • Faster solutions for fact-finding task (p<0.01) • Rated higher on all subjective measures • Normalized performance scores higher when subject employed successful IR searches (p<0.05)

Trials • 18 subjects in 2 month field trial • Usage: • 52% of messages weren’t played completely through • Only ~1% of messages deleted • After using SCANMail people thought (Strongly agree (1)  Strongly disagree (5)) • “Scanning messages is difficult” (2.84.7) • “I frequently replay messages” (1.93.5) • “I frequently take notes” (2.64.3) • “It’s hard to locate old messages” (2.75.0) • “It’s hard to extract info from messages” (2.55.0)

Recent Enhancements • Faster and more accurate ASR • Presentation of information as it becomes available (e.g. audio only, rough transcript of message, accurate transcript) • New SCANMail email format • New phone, Java phone, and Ipaq interfaces • “Urgent” and “personal” rankings of messages • Cut-and-paste of audio via ASR transcript

Further Research Issues • Additional information extracted from messages • Dates, times • Message gisting • Message threading • Faster/more accurate ASR • Interface issues

Demo • SCANMail client demo • External demo at http://www.fancentral.org/~isenhour/scanmail/demo.html

SCANMail: Audio Browsing and Retrieval of Voicemail

SCANMail: Audio Browsing and Retrieval of Voicemail

Presentation Transcript

Audio Information Retrieval and Audio Search

Audio Information Retrieval and Audio Search

NMU Voicemail

Audio Segmentation, Classification, and Retrieval

StreamWIDE Voicemail

VOICEMAIL

VOICEMAIL and TELEPHONE TRAINING

GPC VOICEMAIL

FYP0202 Advanced Audio Information Retrieval System

Automatic Spoken Document Processing for Retrieval and Browsing

Potential of freely faceted classification for knowledge retrieval and browsing

Audio Meets Image Retrieval Techniques

Automatic Spoken Document Processing for Retrieval and Browsing

Audio Retrieval

Smarter Voicemail

Audio Retrieval

VOICEMAIL and TELEPHONE TRAINING

Speech and Language Technologies for Audio Indexing and Retrieval