1 / 26

A Low-Power Low-Memory Real-Time ASR System

A Low-Power Low-Memory Real-Time ASR System. Outline. Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization Custom arithmetic back-end Power simulation. ASR System Organization.

kisha
Download Presentation

A Low-Power Low-Memory Real-Time ASR System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Low-Power Low-Memory Real-Time ASR System

  2. Outline • Overview of Automatic Speech Recognition (ASR) systems • Sub-vector clustering and parameter quantization • Custom arithmetic back-end • Power simulation

  3. ASR System Organization Goal: Given a speech signal find the most likely corresponding sequence of words • Front-end • Transform signal into a set of feature vectors • Back-end • Given the feature vectors, find the most likely word sequence • Accounts for 90% of the computation • Model parameters • Learned offline • Dictionary • Customizable • Requires embedded HMMs

  4. Embedded HMM Decoding Mechanism window the open 1 T

  5. ASR on Portable Devices • Problem • Energy consumption is a major problem for mobile speech recognition applications • Memory usage is a main component of energy consumption • Goal • Minimize power consumption and memory requirement while maintaining high recognition rate • Approach • Sub-vector clustering and parameter quantization • Customized architecture

  6. Outline • Overview of speech recognition • Sub-vector clustering and parameter quantization • Custom arithmetic back-end • Power simulation

  7. Sub-vector Clustering • Given a set of input vectors, sub-vector clustering involves two steps: 1) Sub-vector selection: find the best disjoint partition of each vector into M sub-vectors 2) Quantization: find the best representative sub-vectors (stored in codebooks) • Special cases • Vector quantization: no partition of the vectors (M=1) • Scalar quantization: size of each sub-vector is 1 • Two methods of quantization • Disjoint: a separate codebook for each partition • Joint: shared codebooks for same size sub-vectors

  8. Why Sub-vector Clustering? • Vector quantization • Theoretically best • In practice requires a large amount of data • Scalar quantization • Requires less data • Ignores correlation between vector elements • Sub-vector quantization • Exploits dependencies and avoids data scarcity problems

  9. Algorithms for Sub-vector Selection • Doing an exhaustive search is exponential. We use several heuristics • Common feature of these algorithms: the use of entropy or mutual information as a measure of correlation • Key idea: choose clusters that maximize intra-cluster dependencies while minimizing inter-cluster dependencies

  10. Algorithms • Pairwise MI-based greedy clustering • Rank vector component pairs by MI and choose combination of pairs that maximizes overall MI. • Linear entropy minimization • Choose clusters whose linear entropy, normalized by the size of the cluster, is the lowest. • Maximum clique quantization • Based on MI graph connectivity

  11. Experiments and Results • Quantized parameters: means and variances of Gaussian distributions. • Database: PHONEBOOK, a collection of words spoken over the telephone • Baseline word error rate (WER): 2.42% • Memory savings: ~ 85% reduction (from 400KB to 50KB) • Best schemes: • Normalized joint scalar quantization, disjoint scalar quantization. • Schemes such as entropy minimization and the greedy algorithm did well in terms of error rate but at the cost of a higher memory usage.

  12. Outline • Overview of speech recognition • Sub-vector clustering and parameter quantization • Custom arithmetic back-end • Power simulation

  13. IEEE Floating-point Pros: precise data representation and arithmetic operations Cons: expensive computation and high bandwidth Fixed-point DSP Pros: relatively efficient computation and low bandwidth Cons: loss of information potential overflows Still not efficient in operation and bandwidth use Custom arithmetic via table look-ups Pros: compact representation with varied bit-widths fast computation Cons: loss of information due to quantization overhead storage for tables complex design procedure Custom Arithmetic

  14. General Structure • Idea: replace all two-operand floating-point operations with customized arithmetic via ROM look-ups • Example: • Procedure: • Codebook design • Each codebook corresponds to a variable in the system • The bit-width depends on how precise the variable has to be represented • Table design • Each table corresponds to a two-operand function • The table size depends on the bit-widths of the indices and the entries

  15. Custom Arithmetic Design for the Likelihood Evaluation • Issue: bounded accumulative variables • Accumulating iteratively with a fixed number of iterations • Large dynamic range, possibly too large for single codebook • Solution: binary tree with one codebook per level Yt+1 = Yt + Xt+1 t = 0, 1, …,D Y X

  16. Custom Arithmetic Design for the Viterbi Search • Issue: unbounded accumulative variables • Arbitrarily long utterances; unbounded number of recursions • Unpredictable dynamic range, bad for codebook design • Solution: normalized forward probability • Dynamic programming still applies • No degradation in performance • A bounded dynamic range makes quantization possible

  17. Optimization on Bit-Width Allocation • Goal: • find the bit-width allocation scheme (bw1, bw2, bw3, …, bwL) which minimizes the cost of resources while maintaining the baseline performance • Approach: greedy algorithms • Optimal: intractable • Heuristics: • Initialize (bw1, bw2, bw3, …, bwL) according to single-variable quantization results. • Increase the bit-width of the variable which gives the best improvement concerning both performance and cost, until the performance is as good as the baseline

  18. Three Greedy Algorithms • Evaluation method: gradient • Algorithms • Single-dimensional increment based on static gradient • Single-dimensional increment based on dynamic gradient • Pair-wise increment based on dynamic gradient

  19. Results on Single-Variable Quantization

  20. Results • Likelihood evaluation: • Replace floating-point processor with only 30KB for table storage, while the baseline recognition rate was maintained • Reduce the offline storage for model parameters from 400KB to 90 KB • Reduce the memory requirement for online recognition by 80% • Viterbi search: Currently we can quantize forward probability with 9 bits; Can we hit 8 bits?

  21. Outline • Overview of speech recognition • Sub-vector clustering and parameter quantization • Custom arithmetic back-end • Power simulation

  22. Power estimate Hardware config Wattch Parameterizable power model · Scalable for processes · Accurate to 10% versus lower-level models SimpleScalar Cycle-level performance simulator · 5-stage pipeline · PISA instruction set (superset of MIPS-IV) · Execution-driven · Detailed statistics Hardware access statistics Binary program Performance estimate Simulation Environment

  23. Our New Simulator • ISA extended to support table look-ups • Three-operand instructions but need 4 values for quantization • Two inputs • Output • Table to use • Two options proposed: • One-step look-up -- different instruction for each table • Two-step look-up • Set active table, used by any quantizations until reset • Perform look-up

  24. Future Work • Immediate future • Meet with architecture groups to discuss relevant implementation details • Determine power parameters for look-up tables • Next steps • Generate power consumption data • Work with other groups for final implementation

More Related