Improvements to fMPE

Improvements to fMPE Dan Povey Title slide

Overview • Review of fMPE • Mean offsets as features • Multiple layer framework • Context expansion in multiple layer framework • Improved way of setting learning rate • Improved way of setting per-dimension scales on learning rate • “Smooth update” – more stable update rule • “Out of the box training” • Diagnostics • Other issues • What is most important? Agenda slide

Review of fMPE (1 of 3, overview) • In fMPE, we train a nonlinear offset to the features: • yt = xt + M ht • ht is a high-dimensional vector and a function of xt (and maybe context frames xt-1, xt+1 etc). • The transformation parameters M are trained using the MPE objective function, using a modified form of gradient descent. Agenda slide

Review of fMPE (2 of 3, features) • The high dimensional features ht are (in the original implementation) a vector of Gaussian posteriors with frame splicing. Obtain 100,000 Gaussians by clustering HMM set Calculate Gaussian posteriors (model-free) on each frame Splice vectors on adjacent frames together to create a larger vector (actually, splice together frames and averages of frames for larger context window). Vector ht is very sparse (even though M is not), so calculations are fast. Agenda slide

Review of fMPE (3 of 3, training) • Specific learning rates for each parameter Mij obtained by accumulating positive and negative contributions to  F/ Mij and dividing by the sum of the absolute value of both. • Compensate for different dimensions of the feature vector having different average variance. • The differential w.r.t. matrix element Mij contains an “indirect” term reflecting changes that will happen in the means and variances when we re-train the system. This is necessary because the HMM parameters are trained with ML while the matrix is trained with MPE Features affect means & vars, means & vars affect objective function ! differentiate back through the process. Agenda slide

Mean offsets as features • Probably the most important change (results already given in last EARS meeting): • Using far fewer Gaussians (e.g. 1000 instead of 100,000) and adding the offsets of the observed features from the mean. • If the posteriors were [1, 2…], we are now using: [ 5.0 1, 1 (xt(1)-1(1))/1(1), 1 (xt(2)-1(2))/1(2) … 5.0 2, 2 (xt(1)-2(1))/2(1), 2 (xt(2)-2(2))/2(2) …] • Each posterior followed by offset of the feature from the mean. • Divide by n to ensure equal scales on all offsets. • 5.0 is a scale to put more weight on the posterior itself. • For 1000 Gaussians, the final dimension of the feature ht would be 1000 (d +1) for d-dimensional features (ignoring frame splicing). • Improves both accuracy and speed. Agenda slide

Multiple layer framework • Motivation: using mean offsets combined with frame averaging and splicing reduces sparsity of ht to the point where training takes much longer. • Need to reorganize the calculation into multiple stages. • Developed a code framework where features can undergo multiple layers of processing and propagate differentials back to previous layers. • Using multiple modules with a normalized interface (e.g. a layer doing a linear transformation would be called in the same way as a layer calculating Gaussian posteriors)* • Makes it very easy to add new kinds of processing (just copy, rename and modify an existing module). • Setup controlled by a config file *except some features need to be stored sparsely Agenda slide

Context expansion in multiple layer framework (1 of 2) • Previously, would calculate htexplicitly (including splicing) and then project. But with mean offsets & splicing it is not sparse enough. • Now, calculate the “single-frame” ht with no splicing (e.g. of size 1000 * d+1) and project it to a multiple of d, e.g. 9d, then splice and project to d. ht!M1 ht !M2 (M1 ht, M1 ht+1, M1 ht-1 .. ) (dimension): 1000(d+1) ! 9d ! d • Splice the 9d-dimensional feature across e.g. 80 frames and project down to d with a projection s.t. each output dimension only “sees” 1/d of the input dimensions. #parameters = 9 * 80 * d. • Initialize projection to be equivalent to original context expansion, so the first of the 9 contexts gets projected from the central frame, the second gets projected only from one frame to the right, etc. Agenda slide

Context expansion in multiple layer framework (2 of 2) • I neglected to mention in the paper that… • the context expansion layer is trained with held-out data (one out of every 10 files). • Otherwise, it tries to scale up the fMPE contribution as much as it can to maximize overtraining. • This is a problem with all setups that involve multiplying two fMPE trained things. • [ Note – I do not bother making sure that the source of the “indirect” contributions to the differential was also held out. ] Agenda slide

Improved method of setting learning rates • When changing setups, the appropriate learning rate (controlled by E) can change. • Set a “target” criterion improvement for the first iteration and set E on the first iteration based on that. • Use the same value of E in subsequent iterations. • Using 0.06 for the main (first) matrix and 0.007 for the second (context expansion) layer. • Reduce these values for low-WER domains. • Note that the context expansion layer is trained only from the second iteration since the differential would be zero on the first iteration. Agenda slide

Improved method of setting per-dimension learning rates • The original per-dimension learning rates included factor I (an average standard deviation) for matrix element Mij to have the appropriate scale for the target dimension being added to. • This did not seem to work well for MFCC parameters: got wide variation in contribution to criterion improvement between dimensions (perhaps broken by extreme values). • Replace i with 1/sqrt(Si), where Si is average squared value of summed positive and negative contributions to each  F/ Mij. • Gives better ratio between learning rates for different dimensions. • If E is set automatically as described above, the overall learning rate will be appropriate. Agenda slide

Too far “Smooth update” • When training context expansion, sometimes an instability appeared for certain dimensions. • Developed a method to detect and stop instabilities. • Intuition – if too many parameters are changing direction and moving farther than last time, the learning rate is too high. (1) Define a set of meaningful subsets of matrix parameters (e.g. matrix rows, columns). (2) For each subset in decreasing order of size: if for more than 10% of the parameters p in the subset, the value on iteration pn is on the opposite side of pn-2 from pn-1, reduce the learning rate for that subset until this no longer holds (i.e. move the parameters pn towards pn-1). OK OK Agenda slide

“Out of the box” training • The reason for many of the changes described above is to obtain a setup that will work on different domains without tuning. • E.g. new methods of setting learning rates • And “smooth update” which can neutralize the effect of a learning rate that has been set too fast. • fMPE reliably gives improvements without tuning • E.g. recently trained some acoustic models for fast transcription of call center data (no adaptation). fMPE+MPE improved results by 8.5% absolute from a 45% baseline. • For small-vocabulary task, fMPE+MPE improved results by 30% relative from a 1.20% baseline. • Note – I now always use the same acoustic scales as normal MPE (e.g. 0.1 or 0.05, or inverse of normal LM scale if preferred). Agenda slide

Diagnostics Always use plenty of diagnostics. E.g. - • Per-dimension measures of predicted criterion improvement and sign changes; • The overall predicted and observed criterion improvement; • Check for indirect & direct differentials canceling overall (see paper); • Look at average size of fMPE contributions to features; • Check distribution of data among Gaussians used to calculate posteriors; • Use measures of difference between HMM sets. • Print out plenty of graphs and histograms where appropriate. “It doesn’t work” is not enough information to fix it if it’s broken. Agenda slide

Other issues investigated • Sigmoid layers – no improvement. • Momentum update rule – no improvement. • Training “variances” on features (a quantity added to (x-)2 quantities in training and test) – this gave some improvement ~1-2% relative. • Training multiple systems on the same data in parallel, sharing only the fMPE transform – should multiply effective amount of data (seems to help ~1-2% relative). • Note - I don’t know whether the way to obtain the Gaussians is critical. Jasha Droppo (Microsoft) suggests training a GMM on the features with a globally tied variance. Agenda slide

What is most important? • Use appropriate learning rate for your features (e.g. set target improvement). • Setting learning rate too fast can cause dramatic instability. • Setting it too slow can cause very slow convergence. • Use the indirect differential if you want to train on fMPE features. • Use frame splicing for acoustic context. • Need a baseline discriminative training setup that works (e.g. lattice generation). Agenda slide

Improvements to fMPE

Improvements to fMPE

Presentation Transcript

Improvements to A-Priori

Improvements to Configuration

Improvements

Improvements to Bounded Model Checking

Improvements to Statistical Intensity Forecasts

Improvements

Improvements

Improvements to SAR models

IMPROVEMENTS TO MAGNETIC INTERVENTION

Improvements to e-COST

Data Driven Improvements to RRT

Improvements

Improvements

IMPROVEMENTS DUE TO ISO9001 IMPLEMENTATION

Improvements to the Compiler

Improvements

Improvements to A-Priori

Roaming Improvements to TGe

WELCOME TO CENTRAL PROPERTY IMPROVEMENTS

Improvements to the Geoid Models

Experimenter Contributions to Booster Improvements