1 / 21

SCANMail: Audio Browsing and Retrieval of Voicemail

SCANMail is a tool for searching speech data in the voicemail domain, providing content-based audio navigation to help users quickly scan, find, and locate information in their voicemail messages. This research is inspired by the increasing amount of audio data available in corporate, public, and private collections. The SCANMail architecture includes a training corpus of voicemail messages and utilizes techniques such as ASR transcription, information extraction, and caller identification to enhance the retrieval process.

lshaddix
Download Presentation

SCANMail: Audio Browsing and Retrieval of Voicemail

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCANMAIL: Audio Browsing and Retrieval of Voicemail Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche, Meredith Ringel, and Litza Stark)

  2. The Problem: Navigating Audio Data • Increasing amounts of audio data available in corporate, public and private collections (recorded meetings, broadcast news and entertainment, voicemail) – but useless without tools for searching • SCANMail prototype: tool for searching speech data in voicemail domain

  3. SCANMail • Inspired by interviews, surveys and usage logs identifying problems of heavy voicemail users: • It’s hard to quickly scan through new messages to find the ones you need to deal with (e.g. during a meeting break) • It’s hard to find the message you want in your archive • It’s hard to locate the information you want in any message (e.g. the telephone number) • SCANMail provides technology to help solve these problems, supporting content-based audio navigation

  4. Related Research • Cambridge video mail retrieval by voice • NIST TREC Spoken Document Retrieval track • IBM voicemail transcription and information extraction • AT&T voicemail user studies • AT&T automatic speaker identification, browsing/search for voicemail, and information extraction

  5. SCANMail Architecture

  6. Training Corpus • Messages collected from 138 AT&T Labs voicemail boxes • 100 hr corpus includes ~10K messages from 2500 speakers • Hand-labeled for caller id, gender, age, recording condition, entities (names, dates, telephone numbers) • Gender balanced, ~12% non-native speakers • ~10% of calls not from ordinary handsets • Mean message duration 36.4 secs, median 30.0 secs

  7. Sample Human Transcription [greeting: hi alex] [callerid: this is aaron rosenberg] uh i'm calling at uh about [time: ten twenty four] on [date: thursday morning the twenty second] uh externally from [telno: nine seven three three six o0 eight five four three] uh this time I have uh by hand registered uh uh you as having me as a caller so uh uh this message should be scored properly so let's see what happens and then I’ll send email to phil to check it [closing: bye bye]

  8. Sample ASR Transcription i i'll accesses [callerid: aaron rosenberg] uh i'm calling you at uh about ten twenty four on thursday morning the twenty second uh externally from [telno: nine seven three three six o0 eight five four three] uh this time uh have uh back and registered uh uh you have have been me has a call on so uh uh this message at the school what properly so let's see what happens in the know uh send me mail to check it bye bye

  9. ASR Server: baseline system • Trained on 60 hour training set • Gender independent, 8k tied states, emission probabilities modeled by 12 component Gaussian mixtures • LM uses 14k vocabulary and Katz-style backoff trigram trained on 700k words • Lexicon automatically generated by AT&T Labs TTS system • Decoder uses finite state transducers to construct recognition network • Initial search pass produces lattices used as grammars in later search passes

  10. Accuracy • 29% wer 24.4% now  ~21% with adaptation (future) • Speed • 2x real time for first pass • 12x real time for final transcription • Details: • Bacchiani (HLT ‘00, ICASSP ‘00); Hirschberg et al (Eurospeech ‘01)

  11. Information Retrieval • Uses SMART IR engine (Salton 1971, Buckley 1985)  Lucene • Generates weighted term vectors for ASR transcripts and queries and computes similarity based on vector inner products • Both ASR transcripts and queries are preprocessed into tokens by removing common words (stop-listing) and stemming

  12. Information Extraction • Extracts entities from the ASR transcripts • Old implementation used finite state transducers with hand designed costs • New statistical (trainable) system extracts phone numbers and caller names

  13. Caller Identification • Proposes caller names by matching new incoming messages against existing Text Independent Gaussian Mixture Models (TIGMMs) • If no PBX-supplied caller identification, caller ID hypothesis presented to user • Caller models trained/adapted based on user feedback • Initial model trained after 1 minute of speech collected from single caller • Model updates with each 20sec increment up to 180sec (mature model)

  14. Setting thresholds to keep outgroup acceptance low (2.7%), system had 11.5% ingroup rejection and 1.2% ingroup confusion for 20-caller ingroup. • For more detailed experimental results see Rosenberg (ICSLP ‘00, Eurospeech ‘01)

  15. Email Server • Composes multi-part email message and sends to address specified in user profile • ASR transcript • Speech file • Entity transcriptions and speech segments • Uses time aligned ASR transcript and IE information to include audio excerpts corresponding to entities • Newest version captures much of functionality of client in message (Netscape, MS, Eudora)

  16. Evaluation: User Studies • Compared SCANMail with standard over-the-phone interface (Audix) • 8 subject performed fact-finding, relevance ranking and summarization tasks • SCANMail • Better for fact-finding and ranking tasks in quality/time measures (p <0.05) • Faster solutions for fact-finding task (p<0.01) • Rated higher on all subjective measures • Normalized performance scores higher when subject employed successful IR searches (p<0.05)

  17. Trials • 18 subjects in 2 month field trial • Usage: • 52% of messages weren’t played completely through • Only ~1% of messages deleted • After using SCANMail people thought (Strongly agree (1)  Strongly disagree (5)) • “Scanning messages is difficult” (2.84.7) • “I frequently replay messages” (1.93.5) • “I frequently take notes” (2.64.3) • “It’s hard to locate old messages” (2.75.0) • “It’s hard to extract info from messages” (2.55.0)

  18. Recent Enhancements • Faster and more accurate ASR • Presentation of information as it becomes available (e.g. audio only, rough transcript of message, accurate transcript) • New SCANMail email format • New phone, Java phone, and Ipaq interfaces • “Urgent” and “personal” rankings of messages • Cut-and-paste of audio via ASR transcript

  19. Further Research Issues • Additional information extracted from messages • Dates, times • Message gisting • Message threading • Faster/more accurate ASR • Interface issues

  20. Demo • SCANMail client demo • External demo at http://www.fancentral.org/~isenhour/scanmail/demo.html

More Related