1 / 35

On Organic Interfaces

On Organic Interfaces. Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory. Graduate Students. Anderson, M. Aull, A. Brown, R. Chan, W. Chang, J. Chang, S. Chen, C. Cyphers, S. Daly, N. Doiron, R. Flammia, G. Glass, J. Goddeau, D.

arion
Download Presentation

On Organic Interfaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory

  2. Graduate Students Anderson, M. Aull, A. Brown, R. Chan, W. Chang, J. Chang, S. Chen, C. Cyphers, S. Daly, N. Doiron, R. Flammia, G. Glass, J. Goddeau, D. Hazen, T.J. Hetherington, L. Huttenlocher, D. Jaffe, O. Kassel, R. Kasten,P. Kuo, J. Kuo, S. Lauritzen, N. Lamel, L. Lau, R. Leung, H. Lim, A. Manos, A. Marcus, J. Neben, N. Niyogi, P. Mou, X. Ng, K. Pan, K. Pitrelli, J. Randolph, M. Rtischev, D. Sainath, T. Sarma, S. Seward, D. Soclof, M. Spina, M. Tang, M. Wichiencharoen, A. Zeiger, K. Research Staff Eric Brill Scott Cyphers Jim Glass Dave Goddeau T J Hazen Lee Hetherington Lynette Hirschman Raymond Lau Hong Leung Helen Meng Mike Phillips Joe Polifroni Shinsuke Sakai Stephanie Seneff Dave Shipman Michelle Spina Nikko Ström Chao Wang Acknowledgements MIT Computer Science and Artificial Intelligence Laboratory

  3. Introduction MIT Computer Science and Artificial Intelligence Laboratory

  4. Virtues of Spoken Language Natural: Requires no special training Flexible: Leaves hands and eyes free Efficient: Has high data rate Economical: Can be transmitted/received inexpensively • Speech interfaces are ideal for information access and management when: • The information space is broad and complex, • The users are technically naive, • The information device is small, or • Only telephones are available. MIT Computer Science and Artificial Intelligence Laboratory

  5. Input Output Speech Speech Human Recognition Synthesis Computer Text Text Generation Understanding Communication via Spoken Language Meaning MIT Computer Science and Artificial Intelligence Laboratory

  6. Sentence LANGUAGE GENERATION SPEECH SYNTHESIS Speech Graphs & Tables DIALOGUE MANAGEMENT DATABASE Meaning Representation DISCOURSE CONTEXT Speech Meaning LANGUAGE UNDERSTANDING SPEECH RECOGNITION Words Components of a Spoken Dialogue System MIT Computer Science and Artificial Intelligence Laboratory

  7. Data Intensive Training Technological Advances Increased Task Complexity Inexpensive Computing Tremendous Progress to Date MIT Computer Science and Artificial Intelligence Laboratory

  8. BBN, 2007 MIT, 2007 KTH, 2007 Some Example Systems MIT Computer Science and Artificial Intelligence Laboratory

  9. Speech Synthesis • Recent trend moves toward corpus-based approaches • Increased storage and compute capacity • Availability of large text and speech corpora • Modeled after successful utilization for speech recognition • Many successful implementations, e.g., • AT&T • Cepstral • Microsoft compassion disputed cedar city since giant since compassion disputed cedar city since giant since computer science MIT Computer Science and Artificial Intelligence Laboratory

  10. Lippmann, 1997 But we are far from done … • Machine performance typically lags far behind human performance • How can interfaces be truly anthropomorphic? MIT Computer Science and Artificial Intelligence Laboratory

  11. Premise of the Talk • Propose a different perspective on development of speech-based interfaces • Draw from insights in evolution of computer science • Computer systems are increasingly complex • There is a move towards treating these complex systems like organisms that can observe, grow, and learn • Will focus on spoken dialogue systems MIT Computer Science and Artificial Intelligence Laboratory

  12. Organic Interfaces MIT Computer Science and Artificial Intelligence Laboratory

  13. Computation of static functions in a static environment, with well-understood specification Computation is its main goal xxxxx Single agent xxxxxxxxxxxxxxxxxx Batch processing of text and homogeneous data Stand-alone applications Binary notion of correctness Adaptive systems operating in environments that are dynamic and uncertain Communication, sensing, and control just as important Multiple agents that may be cooperative, neutral, adversarial Stream processing of massive, heterogeneous data Interaction with humans is key Trade off multiple criteria Computer: Yesterday and Today Increasingly, we rely on probabilistic representation, machine learning techniques, and optimization principles to build complex systems MIT Computer Science and Artificial Intelligence Laboratory

  14. Properties of Organic Systems • Robust to changes in environment and operating conditions • Learning through experiences • Observe their own behavior • Context aware • Self healing • … MIT Computer Science and Artificial Intelligence Laboratory

  15. Research Challenges MIT Computer Science and Artificial Intelligence Laboratory

  16. Robustness • Signal Representation • Acoustic Modeling • Lexical Modeling • Multimodal Interactions • Establishing Context • Adaptation • Learning • Statistical Dialogue Management • Interactive Learning • Learning by Imitation * Please refer to written paper for topics not covered in talk Some Research Challenges • Robustness • Signal Representation • Acoustic Modeling • Lexical Modeling • Multimodal Interactions • Establishing Context • Adaptation • Learning • Statistical Dialogue Management • Interactive Learning • Learning by Imitation MIT Computer Science and Artificial Intelligence Laboratory

  17. sentence syntax semantics word (syllable) morphology Sub-word Units phonotactics Speech Recognition Kernel phonemics phonetics Acoustic Models LM Units acoustics Robustness: Acoustic Modeling • Statistical n-grams have masked the inadequacies in acoustic modeling, but at a cost • Size of training corpus • Application-dependent performance • To promote acoustic modeling research, we may want to develop a sub-word based recognition kernel • Application independent • Stronger constraints than phonemes • Closed vocabulary for a given language • Some success has been demonstrated (e.g., Chung & Seneff, 1998) MIT Computer Science and Artificial Intelligence Laboratory

  18. “temperature” Robustness: Lexical Access • Current approaches represent words as phoneme strings • Phonological rules are sometimes used to derive alternate pronunciations • Lexical representation based on features offers much appeal (Stevens, 1995) • Fewer models, less training data, greater parsimony • Alternative lexical access models (e.g., Zue, 1983) • Lexical access based on islands of reliability might be better able to deal with variability MIT Computer Science and Artificial Intelligence Laboratory

  19. SPEECH RECOGNITION HANDWRITING RECOGNITION LANGUAGE UNDERSTANDING meaning GESTURE RECOGNITION MOUTH & EYES TRACKING Robustness: Multimodal Interactions • Other modalities can augment/complement speech MIT Computer Science and Artificial Intelligence Laboratory

  20. Speech: “Move this one over here” • Pointing: (object) (location) time Challenges for Multimodal Interfaces • Input needs to be understood in the proper context • “What about that one” • Timing information is a useful way to relate inputs • Handling uncertainties and errors (Cohen, 2003) • Need to develop a unifying linguistic framework MIT Computer Science and Artificial Intelligence Laboratory

  21. Benoit, 2000 Audio Visual Symbiosis • The audio and visual signals both contain information about: • Identity/location of the person • Linguistic message • Emotion, mood, stress, etc. • Integration of these sources of information has been known to help humans MIT Computer Science and Artificial Intelligence Laboratory

  22. Hazen et al., 2003 Audio Visual Symbiosis • The audio and visual signals both contain information about: • Identity/location of the person • Linguistic message • Emotion, mood, stress, etc. • Integration of these sources of information has been known to helps humans • Exploiting this symbiosis can lead to robustness, e.g., • Locating and identifying the speaker MIT Computer Science and Artificial Intelligence Laboratory

  23. Huang et al., 2004 Audio Visual Symbiosis • The audio and visual signals both contain information about: • Identity/location of the person • Linguistic message • Emotion, mood, stress, etc. • Integration of these sources of information has been known to helps humans • Exploiting this symbiosis can lead to robustness, e.g., • Locating and identifying the speaker • Speech recognition/understanding augmented with facial features MIT Computer Science and Artificial Intelligence Laboratory

  24. Cohen, 2005 Gruenstein et al., 2006 Audio Visual Symbiosis • The audio and visual signals both contain information about: • Identity/location of the person • Linguistic message • Emotion, mood, stress, etc. • Integration of these sources of information has been known to helps humans • Exploiting this symbiosis can lead to robustness, e.g., • Locating and identifying the speaker • Speech recognition/understanding augmented with facial features • Speech and gesture integration MIT Computer Science and Artificial Intelligence Laboratory

  25. Ezzat, 2003 Audio Visual Symbiosis • The audio and visual signals both contain information about: • Identity/location of the person • Linguistic message • Emotion, mood, stress, etc. • Integration of these sources of information has been known to helps humans • Exploiting this symbiosis can lead to robustness, e.g., • Locating and identifying the speaker • Speech recognition/understanding augmented with facial features • Speech and gesture integration • Audio/visual information delivery MIT Computer Science and Artificial Intelligence Laboratory

  26. photos calendar weather address stocks phonebook music Establishing Context • Context setting is important for dialogue interaction • Environment • Linguistic constructs • Discourse • Much work has been done, e.g., • Context-dependent acoustic and language models • Sound segmentation • Discourse modeling • Some interesting new directions • Tapestry of applications • Acoustic scene analysis (Ellis, 2006) MIT Computer Science and Artificial Intelligence Laboratory

  27. Acoustic Scene Analysis • Acoustic signals contain a wealth of information (linguistic message, environment, speaker, emotion, …) • We need to find ways to adequately describe the signals time signal type: speech transcript: although both of the, both sides of the Central Artery … topic: traffic report speaker: female . . . signal type: speech transcript: Forecast calls for at least partly sunny weather … topic: weather, sponsor acknowledgement, time speaker: male . . . signal type: speech transcript: This is Morning Edition, I’m Bob Edwards … topic: NPR news speaker: male, Bob Edwards . . . signal type: music genre: instrumental artist: unknown . . . Some time in the future … MIT Computer Science and Artificial Intelligence Laboratory

  28. Learning • Perhaps the most important aspect of organic interfaces • Use of stochastic modeling techniques for speech recognition, language understanding, machine translation, and dialogue modeling • Many different ways to learn • Passive learning • Interactive learning • Learning by imitation MIT Computer Science and Artificial Intelligence Laboratory

  29. Hetherington, 1991 Interactive Learning: An Example • New words are inevitable, and they cannot be ignored • Acoustic and linguistic knowledge is needed to • Detect • Learn, and • Utilize new words • Fundamental changes in problem formulation and search strategy may be necessary MIT Computer Science and Artificial Intelligence Laboratory

  30. Chung & Seneff, 2004 Interactive Learning: An Example • New words are inevitable, and they cannot be ignored • Acoustic and linguistic knowledge is needed to • Detect • Learn, and • Utilize new words • Fundamental changes in problem formulation and search strategy may be necessary • New words can be detected and incorporated through • Dynamic update of vocabulary MIT Computer Science and Artificial Intelligence Laboratory

  31. Fillisko & Seneff, 2006 Interactive Learning: An Example • New words are inevitable, and they cannot be ignored • Acoustic and linguistic knowledge is needed to • Detect • Learn, and • Utilize new words • Fundamental changes in problem formulation and search strategy may be necessary • New words can be detected and incorporated through • Dynamic update of vocabulary • Speak and Spell MIT Computer Science and Artificial Intelligence Laboratory

  32. Allen et.al., (2007) Learning by Imitation • Many tasks can be learned through interaction • “This is how you enable Bluetooth.”  “Enable Bluetooth.” • “These are my glasses.”  “Where are my glasses?” • Promising research by James Allen (2007) • Learning phase: • User shows the system how to perform tasks (perhaps through some spoken commentary) • System learns the task through learning algorithms and updates its knowledge base • Application phase • Looks up tasks in its knowledge base and executes the procedure MIT Computer Science and Artificial Intelligence Laboratory

  33. In Summary • Great strides have been made in speech technologies • Truly anthropomorphic spoken dialogue interfaces can only be realized if they can behave like organisms • Observe, learn, grow, and heal • Many challenges remain … MIT Computer Science and Artificial Intelligence Laboratory

  34. Thank You MIT Computer Science and Artificial Intelligence Laboratory

  35. ???? What’s the phone number of Flora in Arlington NLG TTS Dialog Hub Audio DB ASR Context NLU Dynamic Vocabulary Understanding • Dynamically alter vocabulary within a single utterance “What’s the phone number for Flora in Arlington.” What’s the phone number of Flora in Arlington Clause: wh_question Property: phone Topic: restaurant Name: Flora City: Arlington Clause: wh_question Property: phone Topic: restaurant Name: ???? City: Arlington Arlington Diner Blue Plate Express Tea Tray in the Sky Asiana Grille Bagels etc Flora …. “The telephone number for Flora is …” MIT Computer Science and Artificial Intelligence Laboratory

More Related