slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Next Generation Speech and Video: Support for Research in Advanced Speech Recognition Technologies John Garofolo IAD Spe PowerPoint Presentation
Download Presentation
Next Generation Speech and Video: Support for Research in Advanced Speech Recognition Technologies John Garofolo IAD Spe

Loading in 2 Seconds...

play fullscreen
1 / 17

Next Generation Speech and Video: Support for Research in Advanced Speech Recognition Technologies John Garofolo IAD Spe - PowerPoint PPT Presentation


  • 323 Views
  • Uploaded on

Next Generation Speech and Video: Support for Research in Advanced Speech Recognition Technologies John Garofolo IAD Speech Group Overview Directions in automatic speech recognition DARPA EARS Program NIST RT-02 Evaluation NIST Meeting Data Collection Project Our Vision of the Future

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Next Generation Speech and Video: Support for Research in Advanced Speech Recognition Technologies John Garofolo IAD Spe' - jaden


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Next Generation Speech and Video:Support for Research in Advanced Speech Recognition TechnologiesJohn GarofoloIAD Speech Group

overview
Overview
  • Directions in automatic speech recognition
  • DARPA EARS Program
  • NIST RT-02 Evaluation
  • NIST Meeting Data Collection Project
our vision of the future
Our Vision of the Future
  • Tightly-couple ASR with higher-level language technologies
    • non-lexical information from the source signal
      • speaker ID, speaking rate, prosody, emotion, non-speech sounds, etc.
    • real-time integration of language processing technologies
      • Semantic, syntactic, contextual, world knowledge, and ASR
    • integration with video input when available
      • face recognition, lip movement, gestures, people movement, object manipulation
  • Improved resources for readability and automatic content-processing technologies
    • Translation, Detection, Search, Extraction, Summarization, etc.
enriched transcription broadcast news example

Possible Enriched ASR Output

Derived Human Readable Transcript

<speaker name=“Peter Jennings”> <sent> tonight this <proper_noun> thursday </proper_noun> big pressure on the <proper_noun>clinton </proper_noun> administration to do something about the latest killing in <proper_noun>yugoslavia </proper_noun></sent><sent>airline passengers and outrageous behavior at thirty thousand feet</sent> <sent type=interrogative>what can an airline do</sent> <sent type=interrogative>and now that <proper_noun>el nino</proper_noun> …

Peter Jennings: Tonight this Thursday, big pressure on the Clinton administration to do something about the latest killing in about the latest killing in Yugoslavia. Airline passengers and outrageous behavior at thirty thousand feet. What can an airline do? And now that El Nino is virtually gone, there is La Nina to worry about.

Announcer: From ABC News World Headquarters in New York, this is World News Tonight with Peter Jennings.

Peter Jennings: Good evening.

Enriched Transcription(Broadcast News Example)

Traditional ASR Output

tonight this thursday big pressure on the clinton administration to do something about the latest killing in yugoslavia airline passengers and outrageous behavior at thirty thousand feet what can an airline do and now that el nino is virtually gone there is la nina to worry about from a. b. c. news world headquarters in new york this is world news tonight with peter jennings good evening

Annotated Word Stream Human readable

Other language processing

darpa ears program e ffective a ffordable r eusable s peech to text
DARPA EARS ProgramEffective, Affordable, Reusable Speech-to-Text
  • Multi-faceted program to improve state-of-the-art
    • Accuracy
      • novel approaches: perceptual/articulatory and prosodic features, more sophisticated search networks and language models, metadata feedback
    • Utility
      • usable interfaces
      • transcription enhanced with metadata (rich transcription)
    • Portability
      • new domains/training data, new languages, flexible language models
    • Speed
      • faster, more efficient processing algorithms
  • NIST will provide evaluation infrastructure for EARS
    • Accuracy measurement of core STT and metadata recognition
    • Usability measurement within context of integrated systems
slide6

EARS Objective

EARS

Multiple

Applications

WORDS + METADATA

Powerfulspeech-to-text technology

Input: Human-human speech(broadcasts, conversations)

Output:Rich transcript(words + metadata)

  • Accurate enough for
    • Humans to read & understand easily
    • Machines to detect, extract, summarize, translate
ears structure

RichTranscription

Summarization

Prototype System

Metadata Extraction

CoreSpeech-to-Text

Interfaces

NovelApproaches

Linguistic Data

EARS Structure

Detection

1

Extraction

TIDES Algorithms

HUMAN-HUMAN

SPEECH

Translation

WORDS + METADATA

EARSStandard

Format

(XML or DAML)

4

Adaptable to Different Languages & Media

2

3

rich transcription 2002 rt 02
Rich Transcription 2002 (RT-02)
  • Evaluation effort and workshop pushing the envelope of existing automatic transcription technology
    • Will also provide accuracy evaluation for DARPA EARS Program
  • Challenge test set to baseline current capabilities
    • ~3+ hours of news broadcasts, telephone conversations, and meeting excerpts
    • evaluation of automatic transcription of orthography AND generation of metadata annotations
      • metadata annotation will require new evaluation infrastructure
  • Dry run test April 2002
  • Workshop May 2002
  • http://www.nist.gov/speech/tests/rt/rt2002/
rt 02 metadata
RT-02 Metadata
  • Currently considered types:
    • Speaker change detection/identification
    • Acronyms
      • NIST is administering the EARS evaluations
    • Verbal edits
      • General Dynamics’, uh no General Electric’s stock soared yesterday...
    • Named entities
      • George Bush addressed the country...
    • Numeric expressions
      • The U S won thirty four medals in the Olympics
    • Temporal expressions
      • The U S was attacked on September eleventh
  • This list will most certainly expand/change in the future.
rt 02 meeting transcription
RT-02 Meeting Transcription
  • Not part of EARS, but included as a look to the future and is of interest to much of the community
    • more challenging than broadcast news or telephone conversations
  • Will consist of 8 10-minute meeting excerpts collected at:
    • CMU, ICSI, LDC, and NIST
    • Very broad test set wrt/ microphones, speakers, forums, noise
    • Will provide baseline for future meeting transcription research
  • Focus on personal mics (from head boom or lapel) and center omnidirectional mic
nist meeting data collection project
NIST Meeting Data Collection Project
  • Goals:
    • Provide rich/diverse pool of audio and video corpora for advanced recognition research
      • multiple sensor types – will add more over time
      • varied meeting forums and vocabularies
      • varying number and types of participants
    • Explore research and integration issues
    • Help provide infrastructure for integration and evaluation

www.nist.gov/speech/test_beds/mr_proj

data collection infrastructure
Data Collection Infrastructure
  • Typical meeting space and noise environment
    • Standard meeting equipment
  • Instrumented with
    • 200 mics, 5 cameras, synchronized with SmartFlow across 13 processors
    • processors under floor and in adjacent room
    • several disk arrays
  • Monitor workstation
    • operator can start/stop data streams, select video views, audio channels, and manipulate cameras
  • Review workstation
    • participants can review meeting recordings and de-select excerpts from public distribution
nist pilot corpus design
NIST Pilot Corpus Design
  • 20 hours of meetings will be collected
    • ~60GB per hour uncompressed data rate
    • data distributed on large hard disks
    • distributed via the Linguistic Data Consortium
  • Varied forums
    • focus groups, game playing, interacting with experts, real working group meetings, event planning
  • Varied meeting lengths
    • 15 minutes to 1 hour
  • Varied number of participants
    • 3 to 8 participants
  • Subset to be annotated for RT-02
nist smart flow distributed processing
NIST Smart FlowDistributed Processing

- Multi modal sensor arrays

- Multi-channel data collection

  • Large grain data flow, for distributed processing of sensor data
  • Components and flows used by name:
    • network transparent
    • component transparent
  • The system handles detail work
    • resource searching
    • data pushing
    • flow visualization interface
    • Time Tags flow buffers
  • Data types: Video, Audio, Vectors, Matrices, Opaque data
  • Promotes well defined, public, interface standard for component technologies
  • Open Source, documentation, currently downloadable

http://www.nist.gov/smartspace/toolChest/nsfs/

smart flow in the meeting room
Smart Flow in theMeeting Room
  • Creates and manages data flows
  • Captures multi-channel multi-modal meeting room sensors
    • Five Camera Views
    • Twenty Three COTS close-talk and omni microphones
    • Three 59-element microphone arrays
  • Archives and time stamps high bandwidth sensor data flows in real time
  • … about sixty gigabytes per hour
what s next
What’s Next?
  • Addition of teleconference microphones and phone-channel recordings
    • collection of multi-site video/teleconferenced meetings
  • Replacement of array microphones with next-generation models with onboard A/D
  • Addition of interactive electronic whiteboard
    • will log and timestamp interactions
    • synchronize with audio and video
  • Exploration of other room sensors/interactive devices
    • e.g., location badges, handheld devices/wireless networks, collect data streams to screen
  • Development of multi-modal/multi-channel annotation tools
review workstation demo
Review Workstation Demo
  • Multi-view/multi-audio channel
  • Permits subjects to review their meetings and request excerpts to be excluded from publication
  • LINUX-based
  • Uses SmartFlow architecture
  • Sample meeting excerpt
    • Group interaction with domain expert on office furnishing