1 / 19

Speech, Perception, & AI

Speech, Perception, & AI. Artificial Intelligence CMSC 25000 March 5, 2002. Agenda. Speech AI & Perception Speech applications Speech Recognition Coping with uncertainty Conclusions Exam admin Applied AI talk Evaluations. AI & Perception. Classic AI

cardea
Download Presentation

Speech, Perception, & AI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech, Perception, & AI Artificial Intelligence CMSC 25000 March 5, 2002

  2. Agenda • Speech • AI & Perception • Speech applications • Speech Recognition • Coping with uncertainty • Conclusions • Exam admin • Applied AI talk • Evaluations

  3. AI & Perception • Classic AI • Disembodied Reasoning: Think of an action • Expert systems • Theorem Provers • Full knowledge planners • Contemporary AI • Situated reasoning: Thinking & Acting • Spoken language systems • Image registration • Behavior-based robots

  4. Speech Applications • Speech-to-speech translation • Dictation systems • Spoken language systems • Adaptive & Augmentative technologies • E.g. talking books • Speaker identification & verification • Ubiquitous computing

  5. Speech Recognition • Goal: • Given an acoustic signal, identify the sequence of words that produced it • Speech understanding goal: • Given an acoustic signal, identify the meaning intended by the speaker • Issues: • Ambiguity: many possible pronunciations, • Uncertainty: what signal, what word/sense produced this sound sequence

  6. Decomposing Speech Recognition • Q1: What speech sounds were uttered? • Human languages: 40-50 phones • Basic sound units: b, m, k, ax, ey, …(arpabet) • Distinctions categorical to speakers • Acoustically continuous • Part of knowledge of language • Build per-language inventory • Could we learn these?

  7. Decomposing Speech Recognition • Q2: What words produced these sounds? • Look up sound sequences in dictionary • Problem 1: Homophones • Two words, same sounds: too, two • Problem 2: Segmentation • No “space” between words in continuous speech • “I scream”/”ice cream”, “Wreck a nice beach”/”Recognize speech” • Q3: What meaning produced these words? • NLP (But that’s not all!)

  8. Signal Processing • Goal: Convert impulses from microphone into a representation that • is compact • encodes features relevant for speech recognition • Compactness: Step 1 • Sampling rate: how often look at data • 8KHz, 16KHz,(44.1KHz= CD quality) • Quantization factor: how much precision • 8-bit, 16-bit (encoding: u-law, linear…)

  9. (A Little More) Signal Processing • Compactness & Feature identification • Capture mid-length speech phenomena • Typically “frames” of 10ms (80 samples) • Overlapping • Vector of features: e.g. energy at some frequency • Vector quantization: • n-feature vectors: n-dimension space • Divide into m regions (e.g. 256) • All vectors in region get same label - e.g. C256

  10. Speech Recognition Model • Question: Given signal, what words? • Problem: uncertainty • Capture of sound by microphone, how phones produce sounds, which words make phones, etc • Solution: Probabilistic model • P(words|signal) = • P(signal|words)P(words)/P(signal) • Idea: Maximize P(signal|words)*P(words) • P(signal|words): acoustic model; P(words): lang model

  11. Language Model • Idea: some utterances more probable • Possible solution: PCFGs??? • + Probabilities, - Little context effect • “Dog bites man” as probable as “Man bites dog” • Standard solution: “n-gram” model • Typically tri-gram: P(wi|wi-1,wi-2) • Collect training data • Smooth with bi- & uni-grams to handle sparseness • Product over words in utterance

  12. Acoustic Model • P(signal|words) • words -> phones + phones -> vector quantiz’n • Words -> phones • Pronunciation dictionary lookup • Multiple pronunciations? • Probability distribution • Dialect Variation: tomato • +Coarticulation • Product along path 0.5 aa t ow m t ow 0.5 ey 0.2 ow 0.5 aa t m t ow 0.5 0.8 ax ey

  13. Acoustic Model • P(signal| phones) • Use Hidden Markov Model (HMM) • Transition probabilities, Output probabilities • Probability of sequence: sum of prob of paths • Combine into big HMM 0.3 0.9 0.4 Onset Mid End Final 0.7 0.1 0.6 C3: 0.3 C5: 0.1 C6: 0.4 C1: 0.5 C3: 0.2 C4: 0.1 C2: 0.2 C4: 0.7 C6: 0.5

  14. Viterbi Algorithm • Find BEST word sequence given signal • Best P(words|signal) • Take HMM & VQ sequence • => word seq (prob) • Dynamic programming solution • Record most probable path ending at a state i • Then most probable path from i to end • O(bMn)

  15. Learning HMMs • Issue: Where do the probabilities come from? • Solution: Learn from data • Signal + word pairs • Baum-Welch algorithm learns efficiently • See CMSC 35100

  16. Does it work? • Yes: • 99% on isolate single digits • 95% on restricted short utterances (air travel) • 80+% professional news broadcast • No: • 55% Conversational English • 35% Conversational Mandarin • ?? Noisy cocktail parties

  17. Speech Recognition asModern AI • Draws on wide range of AI techniques • Knowledge representation & manipulation • Optimal search: Viterbi decoding • Machine Learning • Baum-Welch for HMMs • Nearest neighbor & k-means clustering for signal id • Probabilistic reasoning/Bayes rule • Manage uncertainty in signal, phone, word mapping • Enables real world application

  18. Final Exam • Thursday March 14: 10:30-12:30 • One “cheat sheet” • Cumulative • Higher proportion on 2nd half of course • Style: Like midterm: apply concepts • No coding necessary • Review session: ??? • Review materials: will be available

  19. Seminar • Tomorrow: Wednesday March 6, 2:30pm • Location: Ry 251 • Title: • Computational Complexity of Air Travel Planning • Speaker: Carl de Marcken, ITA Software • Very abstract abstract: • Application of AI techniques to real world problems • aka: how to make money doing AI

More Related