My Slides

My Slides • Support vector machines (brief intro) • WS04: What we accomplished • WS04: Organizational lessons • AVICAR video corpus: current status

Support Vector Machinesas (sort of) compared toNeural Networks* * Difficult to do because they have never been compared head-to-head on any speech task!

SVM = Regularized Nonlinear Discriminant Kernel: Transform to Infinite- Dimensional Hilbert Space The only way in which SVM differs from RBF-NN: THE TRAINING CRITERION SVM Discriminant Dimension c = argmin( training_error(c) + l/width(margin(c)) ) SVM Extracts a Discriminant Dimension (Bourlard/Morgan Hybrid) Niyogi & Burges, 2002: Posterior PDF = Sigmoid Model in Discriminant Dimension OR (BDFK Tandem) Borys & Hasegawa-Johnson, 2005: Likelihood = Mixture Gaussian in Discriminant Dimension

Binary Classifier = sign( Nonlinear Discriminant )

Advantages of SVMs w.r.t. NNs • Accuracy: • SVM generalizes much better from small training data sets ( training tokens > 6X observation vector size ) • As training data size increases, accuracy of NN and SVM converge • Theoretically, and in some practical experiments too • Like 3-layer-MLP, RBF-SVM is a universal approximator • Fast training: nearly quadratic optimality criterion

Disadvantages of SVMs w.r.t. NNs • No way to train with very large training set • Complexity = O(N^2): either fast or impossible • Computational complexity during test • Solution: Burges’ Reduced Set Method (extra training step; only available right now in matlab) • Accuracy: unless you optimize the hyper-parameters, accuracy is good but not great • Exhaustive hyper-training is very slow • Can get good accuracy but not great accuracy with the “theoretically correct” hyper-parameters

Disadvantages of SVMs w.r.t. NNs • The real problem: We need phonetically labeled training data • “Embedded re-estimation experiment:” • pre-trained SVMs used as HMM input (tandem system) • RBF weights re-estimated, together with HMM params, in order to maximize likelihood of the training data • Result: Training Data Likelihood, WRA increased • Result: Test Data WRA decreased

WS04

WS04: SVM/DBN hybrid recognizer Word LIKE A Canonical Form … Tongue closed Tongue Mid Tongue front Tongue open … Surface Form Tongue front Semi-closed Tongue Front Tongue open … Manner Glide Front Vowel Place Palatal … SVM Outputs p( gPGR(x) | palatal glide release) p( gGR(x) | glide release ) x: Multi-Frame Observation including Spectrum, Formants, & Auditory Model …

WS04 Organizational Lessons:What Worked • Innovative experiments, made possible by people who really wanted to be doing what they were doing • Result: Published ideas were interesting to many people • Parallel SVM classification experiments allowed us to test many different SVM definitions • Result: Classification errors mostly below 20% by end of WS • Parallel recognizer test experiments (DBN/SVM was one, MaxEnt-based lattice rescoring was another) • Result: both achieved small (nonsignificant) WER reduction over baseline

WS04 Organizational Lessons:What Didn’t Work • Software Bottleneck between the SVMs and the recognizers: Only one tool available to apply an SVM to every frame in a speech file, and only person knew how to use it. • Too Many Experimental Variables: Should SVMs be trained using (1) all frames, or (2) only landmark frames? DBN expects #1. HMM works best if manner features are #1, place features are #2. DBN? Impossible to test in six weeks. • Apples & Oranges: SVM-only classifier outputs in cases #1, #2 were incomparable => no test short of full DBN integration is meaningful.

WS04 Organizational Lessons:What Didn’t Work • Unbeatable baseline: Goal was to rescore the output of the SRI recognizer in order to reduce WER => to find acoustic information not already used by the baseline recognizer. • What information is “not already used?” Phone-based ANN/HMM hybrid system: hard to say. • When an experiment fails: why? • Better: use open-source baseline (= not state of the art, but that’s OK), construct test systems in a continuum between baseline and target.

AVICAR

8 Mics, Pre-amps, Wooden Baffle. Best Place= Sunvisor. 4 Cameras, Glare Shields, Adjustable Mounting Best Place= Dashboard AVICAR: Recording Hardware System is not permanently installed; mounting requires 10 minutes.

AVICAR: Data Summary • 100 Talkers • 5 noise conditions: • Engine idling, • 35mph, windows closed / windows open • 55mph, windows closed / windows open • 4 types of utterances: • Isolated Digits • Phone numbers • Isolated Letters (e-set = articulation test) • TIMIT sentences • Public release: 16 schools & companies (but I don’t know how many are using it)

AVICAR: Labeling & Recognition • Manual lip segmentation: 36 images • Automatic face tracking: nearly perfect • Automatic lip tracking: not so good • Manual audio segmentation: sentence boundaries • Audio Enhancement: • Audio Digit WRA 97, 89, 87, 84, 78%

AVICAR: Data Problems • DIVX encoding => database < 300G, but… • DIVX => poor edge quality in some images • Amelioration plan: re-transfer from tapes in high quality, huge data size for folks who want it.

My Slides

My Slides

Presentation Transcript

the slides from my 2005 SVPCA talk are available

My Wish Come True Bike Read all slides first.

chipotle my slides

My Three Slides

My ~Three Slides (not counting this one)

My First powerpoint Slides

My slides:

My Roaring 20’s slides

My portfolio in a few slides

My First PowerPoint Slides

Aren’t my slides colorful!

MY FIRST POWERPOINT SLIDES

slides

My Essay slides