1 / 23

Audient: An Acoustic Search Engine

Audient: An Acoustic Search Engine. Student: Ted Leath Supervisor: Prof. Paul Mc Kevitt School of Computing and Intelligent Systems Faculty of Engineering University of Ulster, Magee. Aims and Objectives.

yoshe
Download Presentation

Audient: An Acoustic Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Audient: An Acoustic Search Engine Student: Ted Leath Supervisor: Prof. Paul Mc Kevitt School of Computing and Intelligent Systems Faculty of Engineering University of Ulster, Magee

  2. Aims and Objectives • Development of Audient as a speech-centric, non-lexical search engine capable of handling multimodal queries for retrieving spoken audio information • Explore the efficacy of using standards-based phonogrammic streams as an internal data representation for storing, indexing, searching and retrieving spoken audio information • Compare the performance of optional compound strategies for the abstraction and refinement of standards-based phonogrammic streams • Design, implement, refine and test Audient • Demonstrate research results, comparing Audient with other existing system architectures

  3. Literature Review • Information Retrieval • Automatic Speech Recognition and Spoken Document Retrieval • Current and previous research in SDR systems • Public access SDR systems • Commercial ASR and audio mining products • Sub-word based approaches to SDR • Transcripts, annotation and phonogrammic streams • Speech and non-speech audio

  4. Information Retrieval • Typical Information Retrieval (IR) tasks involve the retrieval of relevant information items from various types of documents by matching a user request or query. • IR encompasses document media types containing different types of information like images, video and audio information in addition to text documents. Audio recordings of speech can be referred to as spoken documents.

  5. Automatic Speech Recognition • ASR attempts to mimic the human capacity for recognising speech by enabling a computer to identify spoken words and/or sub-word units. Most current ASR systems are lexical in nature, and conceptually follow the processes of encoding and decoding introduced in the figure below: (adapted from Young et al., 2002)

  6. Spoken Document Retrieval • A significant amount of research has been conducted in SDR, and performance evaluations like the Text REtrieval Conference (TREC) have encouraged development and the sharing of information. A diagram representing a typical TREC SDR process is reproduced below: (Garfolo et al., 2000)

  7. SDR Systems • CMU Informedia I, Informedia II and Sphinx Projects(Hauptmann and Witbrock, 1997) • Video Mail Retrieval and Multimedia Document Retrieval projects(Jones et al., 1997, Spärck Jones et al., 2001) • SCAN (Choi et al., 1998 and Choi et al., 1999) • THISL and Abbot (Abberley et al., 1998, Abbot, 1999) • Taiscéalaí (Smeaton et al., 1998)

  8. Public Access SDR Systems • SpeechBot (Quinn, 2000, Van Thong et al., 2001) • National Public Radio (NPR) Online(NPR, 2000, NPR Archives, 2004) • SpeechFind and The National Gallery of the Spoken Word (Hansen et al., 2004, Zhou and Hansen, 2002)

  9. Commercial ASR and Audio Mining Products • BBN Rough ‘n’ Ready (Kubala et al., 1999) • Nexidia Fast-Talk and Convera RetrievalWare(Clements et al., 2001a, Clements et al., 2001b) • ScanSoft (Network Speech, 2004, Embedded Speech, 2004, MediaIndexer, 2004, NaturallySpeaking, 2005, AudioMining, 2005, Xmode, 2004) • Virage AudioLogger (Virage, 2004) • Nuance (Nuance, 2005) • AT&T SCANMail (Hirschberg et al., 2001 and SCANMail, 2003) • Microsoft Speech Server (MSS, 2005)

  10. Sub-word Based Approaches to SDR • Wechsler (Wechsler, 1998) • Ng., K. (Ng, 2000) • Glavitsch and Schäuble (Glavitsch and Schäuble, 1992) • Ng., C. (Ng, 2001) Also other sub-word research efforts including Larson (2001), Moreau et al. (2004)

  11. Phonogrammic Streams Orthographical representations of phonemic streams. This abstraction is ancient, and partially inherent in the English alphabet. Egyptian hieroglyphs with semantic and phonetic value. Ref. http://www.omniglot.com/writing/egyptian.htm

  12. Transcription SILENCE HARD ROCK SILENCE 1-best transcriptions N-best transcriptions (Fundamentals, 2005) (Fundamentals, 2005) Lattices or graphs

  13. Annotation - Markup Languages and MPEG-7 • SSML • VoiceXML • SALT • XHTML+Voice profileAll of the above markup languages contain SSML as a subset • MPEG-7 and spoken content

  14. MELDEX Musipedia (Melodyhound/Tuneserver) Sonoda Super MBox MIRACLE SMILE Shazam Name That Clip The Humdrum Toolkit Themefinder Boogeebot Muscle Fish Non-Speech Audio Retrieval Processing of speech is handled differently by humans than non-speech acoustic information.

  15. Project Proposal Audient Architecture

  16. Audient Core Modules

  17. Audient Parrots Functional diagram for an Audient Parrot Determining recognition differences She sells sea shells by the seashore. She cells C shels bye the sea shore

  18. Comparison with Previous Work

  19. Software Analysis • Hidden Markov Model Toolkit (HTK) • LVCSR and CSLU Toolkit • Sphinx-2, Sphinx-3, Sphinx-4 • TIMIT • Linux and C++ • Perl and PHP • Festival • The CMU Pronouncing Dictionary • SSML, VoiceXML, SALT and X+V • The Apache Web Server

  20. Possible IR and Monitoring Applications • The indexing search and retrieval of Internet audio files • Indexing search and retrieval of broadcast media • Services for the blind • Library services • Surveillance and intelligence gathering • Voice mail • Audio mining and trend analysis (topic detection and tracking)

  21. Possible Philosophical and Cognitive Research Applications • Artificial self-learning systems • Philosophical investigations of speech-centric versus text-centric methods • Research models for cognitive science and consciousness theories • Examination of behaviourist versus cognitive semantic recognition of speech

  22. Project Schedule

  23. Conclusion • The introduction of standards-based phonogrammic streams as a fundamental internal data structure • Support for unconstrained multimodal queries • The development of new mimetic means for comparative evaluation and demonstration • The provision of contextual strategies for the refinement of phonogrammic streams • Movement of the man-machine boundary to allow more effective partitioning of tasks between the human and the machine portions of the system • Design, implementation and testing of the Audient acoustic search engine

More Related