1 / 36

S nack for R uby

S nack for R uby. S Legrand. Talk Objectives. Tour of API Learn the walk and talk Have Fun. S nack. Snack library is a tool to aid in the learning about sound, voice, ASR, and is hopefully a fun way to experiment Snack is a tcl-based API

phyllis
Download Presentation

S nack for R uby

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SnackforRuby S Legrand

  2. Talk Objectives • Tour of API • Learn the walk and talk • Have Fun

  3. Snack • Snack library is a tool to aid in the learning about sound, voice, ASR, and is hopefully a fun way to experiment • Snack is a tcl-based API • Snack has been adapted to and included in Standard Python Distribution

  4. Snack • Snack is Swedish for “talk” or “chat” • Kåre Sjölanderis the principal investigator for tcl-based snack • Tcl Snack is available at http://www.speech.kth.se/snack/

  5. Snack for Ruby • rbSnack is a ruby wrapper around tcl snack • rbSnack has additional ruby based utilities • rbSnack has html-based help. (rdoc+rbTeX) • rbSnack can be found at http://rbsnack.sourceforge.net/

  6. Snack Toolkit Includes • Recording, Playback • Waveform display • Spectrogram: Fourier, LPC • Formant analysis • Power analysis • Filters (will demo)

  7. The Speech Signal • Continuous speech is discretely sampled • Signal consist of rapidly changing data points. • The display of the sampled signal is called the waveform • Snack can display the waveform real-time

  8. Analysis uses frames • Signal is broken into frames • Frames may overlap • Characteristics of signal analyzed using Fourier and LPC analysis on a per frame basis.

  9. Going in Circles • Complex numbers is just a funny way of multiplying: add angles. • Eulers formula

  10. Fourier Analysis • Fourier matrix is an unitary matrix • Multiplication by Fourier matrix returns the frequency components of the signal, called the Fourier coefficients • Easy to compute the inverse: Called Fourier Inverse

  11. The Fourier Matrix Looks Like • Spinning disks Multiplication by signal produces Fourier coefficients (frequency components)

  12. Examining Fourier components • A Spectrogram gives a picture of the Fourier components (coefficients) as they evolve over time. Snack can display real time. • Looks like an X Ray • Bands of high activity correspond to formants

  13. Linear Filters • Useful to understand nature of speech signals • Generators: generate square waves, sin waves, saw tooth, etc. • Composers: composes several filters. • FIR: Finite impulse response • IIR: Infinite impulse response

  14. FIR Filter • Determined completely by response to a unit impulse. • Response finite in duration. y(t)=b0 x(t) + b1 x(t-1)+ b2x(t-2)+…+bn x(t-n) (We will demo FIR using rbSnack)

  15. IIR Filter • Also called Recursive filter • Response infinite in duration. y(t)=b0 x(t) + b1 x(t-1)+ b2x(t-2)+…+bn x(t-n) +a1 y(t-1)+ a2y(t-2)+…+an y(t-n) (We will demo IIR using rbSnack)

  16. Linear Predictive Analysis • Analogous to Fourier analysis • Assumption: For each frame, the signal is predicted by • The LPC coefficients are the best least squares approximation. • Can also be used to predict formants y(t)=a1 y(t-1)+ a2y(t-2)+…+ap y(t-p)

  17. What is Sound? What is Speech? • Sound is the resulting signal created by the longitude waves in some medium like air. • Sound waves are continuous • Can be decomposed into linear combination of sin waves. • Speech is a special noise made by humans

  18. It’s Just Tubing… • The simplest model of speech is to consider the lungs and trachea as one long tube. • Resonance frequencies are called Formants. F2 F1

  19. Some Speech Recognition Features • Formants • Pitch • Voiced/Unvoiced • Nasality • Frication • Energy Our current work only uses Formants and Energy

  20. Basic Utterances • A basic unit of speech is called a Phone • Vowels are utterances with constant formants • Diphthong is the transitioning from one vowel to another • Vowels and Diphthongs are essentially characterized by the first and second formant.

  21. Other Phones: The Consonants • Plosives: closure in oral cavity /p/ • Nasal: Closure of nasal cavity /m/ • Fricative: Turbulent airstream noise /s/ • Retroflex liquid: Vowel like-tongue high curled back /r/ • Lateral liquid: Vowel like, tongue central, side air stream /l/ • Glide: Vowel like /y/

  22. Some Problems with Speech Signals • Segmentation: when does a word begin and end? (Noise?) • Wet ware: (speaker’s internal configuration + lip smacks, breathing etc.) SegmentationWorkshop demos one approach.

  23. Code Books • A code book consists of code words. • Idea is to search through code book to find code word corresponding to best match of feature sequence. • RbSnack uses codebook approach in word recognition.

  24. Code Book Approach • ++ Easy to implement • + Good for isolated words • +- Works best on small vocabularies • -- Is insensitive to context, prone to errors

  25. Code Book Approach • WhichWay is a simple demo of this approach

  26. More Problems with Speech Signals • Accent: Southern vs. New England vs. California Valley vs. Other. • Variation in rate of speech makes it hard to compare words

  27. Dynamic Time Warping • A pattern comparison technique • A way of stretching or compressing one sequence to match another. • Evaluated using dynamic programming

  28. Dynamic Programming • Form a grid, with start at lower left, end at upper right. • Label each node with difference (error) between pattern 1 at time i and pattern 2 at time j. • Find minimal distance from start to end using

  29. Dynamic Programming Basic Assumption: If best path P(S,E) passes through node N, then P(S,E) is the concatenation of P(S,N) (best from S to N) and P(N,E) (best from N to E) • A possible path

  30. Dynamic Programming 1 RbSnack includes examples for various time alignment approaches 3 2 1 2 3 Type I Type III

  31. Dynamic Programming 1 1 1 1 1 1 1 1 Itakura Type IV

  32. Hidden Markov Models • Sometime the second (or third) best match is the right word. Use HMM’s to ascertain the correct word in the context of the sentence. (Ditto for phones within a word) • HMM’s are similar to non-deterministic finite state machines, except for they have non-deterministic output.

  33. Hidden Markov Models • Dynamic Programming is used to compute weights. • HMM’s look like .4 .2 2 3 1 P(/i/)=.5 P(/a/)=.2 P(/o/)=.3 .4 4

  34. PossibleFuture Directions • Examine other features, (pitch?) • Incorporate other libraries. (Do the computationally hard work in C) • Add more signal processing routines • Add more examples • Use Hidden Markov Models

  35. Lessons Learned/to be learned • Document everything. • Nothings perfect • Automate everything • Project is never done

  36. What’s next? • Try it out.

More Related