1 / 23

SPEECH DESCRIPTORS GENERATION SOFTWARE UTILIZED FOR CLASSIFICATION AND RECOGNITION PURPOSES

SPEECH DESCRIPTORS GENERATION SOFTWARE UTILIZED FOR CLASSIFICATION AND RECOGNITION PURPOSES. L ukasz Laszko (lukaszlaszko@gmail.com). Department of Biomedical Engineering, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology. G oals.

yale
Download Presentation

SPEECH DESCRIPTORS GENERATION SOFTWARE UTILIZED FOR CLASSIFICATION AND RECOGNITION PURPOSES

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SPEECH DESCRIPTORS GENERATION SOFTWARE UTILIZED FOR CLASSIFICATION AND RECOGNITION PURPOSES Lukasz Laszko (lukaszlaszko@gmail.com) Department of Biomedical Engineering, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology

  2. Goals Create software components for speech signal descriptors extraction - low level descriptors utilized in ASR - high level descriptors utilized in SDR 2. Compare different speech recognition engines - define and describe comparison criteria - analyze different ASR methodologies Create software components for spoken content descriptors indexing and retrieval - MPEG-7 SCD extraction - SCD comparison methods 4. Provide sample SDR based medical application (suggestion)

  3. Definitions spoken content a pice of infomration consisting ofthe actual words spoken in the speech segments of an audio stream. As speech represents the primary means of human communication, a significant amountof the usable information enclosed in audiovisual documents may reside in thespoken content. speech recognition converts spoken words to machine-readable input (for example, to the binary code for a string of character codes). The term voice recognition may also be used to refer to speech recognition, but more precisely refers to speaker recognition, which attempts to identify the person speaking, as opposed to what is being said. Automatic speech recognition automatedcomputer solution for SR routines Spoken Document / Content retrieval - application of the SpokenContent tool which aims at retrieving information in speech signals based on their extracted contents.

  4. Definitions Acoutic Model - a file that contains statistical representations of each distinct sound that makes up a spoken word (called phonemes).  It must contain the sounds for each word used in your grammar (or language model). Phoneme - In human language the smallest structural unit that distinguishes meaning

  5. Speech Recognition Engine (SRE) A statistical representation of the distinct sounds that make up each word in the Language Model or Grammar. Speech Decoder Acoustic Model Language Model Grammar A very large list of words and their probability of occurrence in a given sequence. (LVCSR) Contains sets of predefined combinations of words • Types of recognition engines: • Connected Word Recognition (CWR) • Large – Vocabulary Continuous Speech Recognition (LVCSR) • Automatic Phonetic Transcription (APT) • Keyword Spotting (KS)

  6. Speech Recognition Engine (SRE) A parametric representation X (called acoustic observation) of speech acoustic properties is extracted from the input signal A. The acoustic observation X is matched against a set of predefined acoustic models. Each model represents one of the symbols used by the system for describing the spoken language of the application (e.g. words, syllables or phonemes). The best scoring models determine the output sequence of symbols.

  7. Automatic speech recognition – acoustic analysis 1. The analogue signal is first digitized. The sampling rate depends on the particular application requirements. The most common sampling rate is 16 kHz (one sample every 625s). 2. A high-pass, also called pre-emphasis, filter is often used to emphasize the high frequencies. 3. The digital signal is segmented into successive, regularly spaced time intervals called acoustic frames. Time frames overlap each other. Typically, a frame duration is between 20 and 40 ms, with an overlap of 50%. 4. Each frame is multiplied by a windowing function (e.g. Hanning). 5. The frequency spectrum of each single frame is obtained through a Fourier transform. 6. A vector of coefficients x, called an observation vector, is extracted from the spectrum. It is a compact representation of the spectral properties of the frame.

  8. Automatic speech recognition – acoustic analysis • Coefficient vectors types: • linear predictioncepstrum coefficients (LPCCs) • mel-frequency cepstral coefficients(MFCCs) Cepstrum is the result of taking the Fourier transform (FT) of the decibel spectrum as if it were a signal. Its name was derived by reversing the first four letters of "spectrum". There is a complex cepstrum and a real cepstrum. Definitions: * mathematically: cepstrum of signal = FT(log(|FT(the signal)|)+j2πm) (where m is the integer required to properly unwrap the angle or imaginary part of the complex log function) * algorithmically: signal → FT → abs() → log → phase unwrapping → FT → cepstrum

  9. Automatic speech recognition – decoding Hidden Markov model (HMM) + Statistical model + Speech signal modeled as a pricewise stationary or short-time stationary signal + Can be trained automatically Dynamic time warping (DTW) + historically used for speech recognition + displaced by HMM + measures similarity between two sequences which may vary in time or speed + speech speed independent

  10. Automatic speech recognition – existing software Commercial Software: • IBM ViaVoice • Microsoft SAPI • Vocalis Speechware • Babel Technologies • SpeechWorks • Nuance • Abbot • Entropic Free software: • XVoice • CVoiceControl/kVoiceControl • Open Mind Speech • GVoice • ISIP • CMU Sphinx • Juilus • Ears • NICO ANN Toolkit • Myers' Hidden Markov Model Software

  11. Automatic speech recognition – performance measurements Types of errors: • Substitution errors - when a symbol in the reference transcription was substituted with a different one in the recognized transcription. • Deletion errors - when a reference symbol has been omitted in the recognized transcription. • Insertion errors - when the system recognized a symbol not contained in the reference transcription. Measurements: • Error rate • Accuracy LVCSR => Accuracy > 90% IVR => Error Rate ~= 40%

  12. MPEG-7 MPEG-7 is a multimedia content description standard. This description will be associated with the content itself, to allow fast and efficient searching for material that is of interest to the user. MPEG-7 is formally called Multimedia Content Description Interface. Thus, it is not a standard which deals with the actual encoding of moving pictures and audio, like MPEG-1, MPEG-2 and MPEG-4. It uses XML to store metadata, and can be attached to timecode in order to tag particular events, or synchronise lyrics to a song, for example. MPEG-7 uses the following tools: Descriptor (D): It is a representation of a feature defined syntactically and semantically. It could be that a unique object was described by several descriptors. Description Schemes (DS): Specify the structure and semantics of the relations between its components, these components can be descriptors (D) or description schemes (DS). Description Definition Language (DDL): It is based on XML language used to define the structural relations between descriptors. It allows the creation and modification of description schemes and also the creation of new descriptors (D). System tools: These tools deal with binarization, synchronization, transport and storage of descriptors. It also deals with Intellectual Property protection.

  13. MPEG-7 - Spoken content description MPEG-7 Spoken Content Document (SCD) * SpokenContentHeader - WordLexicon - PhoneLexicon - DescriptionMetadata - SpeakerInfo * SpokenContentLattice - Blocks - Nodes - Links (reference, probability, nodeOffset, acousticScore)

  14. Spoken content retrivial – general system structure

  15. Spoken content retrivial – implementation External systems ASR server SDR server SDR client SDR DB Service Oriented Architecture

  16. Spoken content retrivial – implementation CMU Sphinx-4 SOAP + MTOM over HTTPS Web service ASR Engine Java Concurrency Framework Multithread execution pool Network ORM mapper ASR Server architecture Metadata datbase JAX-WS 2.1 with WSIT on Apache Jetty 6 Status : implemented

  17. Spoken content retrivial – implementation Services FrontEnd Workflow Runtime Data Access Logic SCD Database Diagnostic portal Indexing Workflow Indexing Service Data Connector Search Workflow External services agents ASR Connector Search Service SDR Server architecture Network Status : under development

  18. Spoken content retrivial – implementation Spoken Content Recording Media Converter SDR Client SDR Client architecture

  19. DICOM voice Q&R – indexing MPEG-7 SCD document Audio file (wav or mp3) Spoken Content Descriptors DICOM Client with Voice annotation/query plug-in Spoken Content Query converter to DICOM Query/Retrieve New image From PACS CI&OM server Clinical Image and Object Management Server Spoken description Requests undescribed image from PACS

  20. DICOM voice Q&R – query Best matching SCD document Audio file (wav or mp3) or query text Spoken Content Descriptors DICOM Client with Voice annotation/query plug-in Spoken Content Query converter to DICOM Query/Retrieve Annotated image location Search result Clinical Image and Object Management Server DICOM Q&R Spoken / text query

  21. Execution steps – table of contents 0 Abstract 1 Introduction 2 Automatic Speech Recognition and Spoken Document Retrieval 3 System structure 4 Voice and speech sources 5 Speech signal features extraction and recognition details 6 Spoken contentdescription 7 Document query 8 Bibliography 9 Appendixes

  22. Bibliography Hyoung-Gook Kim, Nicolas Moreau, Thomas Sikora MPEG-7 Audio and Beyond : Audio Content Indexing and Retrivial John Wiley & Sons Ltd, 2005 Gopala Krishna A Building ASR and TTS Systems : Building ASR Systems using Sphinx Carnegie Mellon University, 2007 Arthur Chan, Evandro Gouvˆea, Rita Singh The Hieroglyphs : Building Speech Applications Using CMU Sphinx Carnegie Mellon University, 2007 Lee Begeja, Bernard Renger, Murat Saraclar, David Gibbon, Zhu Liu Behzad Shahraray A System for Searching and BrowsingSpoken Communications AT&T Labs – Research, 2004

  23. Bibliography Frank Seide, Peng Yu, Chengyuan Ma, and Eric Chang Vocabulary-Independent Search In Spontaneous Speech Microsoft Research Asia, 2004 Ciprian Chelba Spoken Document Retrieval and Browsing Google, 2007 Jason Price Oracle Database 11g SQL : Master SQL and PL/SQL in the Oracle Database Oracle Press, 2008 http://en.wikipedia.org/wiki/Speech_recognition http://tldp.org/HOWTO/Speech-RecognitionHOWTO/software.html

More Related