1 / 74

Hardware Architectures to Support Low Power Natural I/O Applications

Rajeev Krishna Advanced Computer Architecture Lab University of Michigan. Hardware Architectures to Support Low Power Natural I/O Applications. Why Look at Natural I/O?. Wave of the present! Example: Basic speech recognition everywhere Representative of Natural I/O applications Why?

tbranscome
Download Presentation

Hardware Architectures to Support Low Power Natural I/O Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rajeev Krishna Advanced Computer Architecture Lab University of Michigan Hardware Architectures to Support Low Power Natural I/O Applications

  2. Why Look at Natural I/O? • Wave of the present! • Example: Basic speech recognition everywhere • Representative of Natural I/O applications • Why? • Ubiquitous computing • More versatile • What is the problem? • Computational complexity vs. available performance • Constraints of mobile computing platforms

  3. Computation and Energy: Supply vs. Demand • Continuous, Speaker-Independent, Large Vocabulary • Embedded Processor Performance • SA1110 (200MHz) - 6 hours • XScale (400MHz) - 2 hours • Embedded Processor Performance • - 20 wpm • - 50 wpm Hello Hello

  4. Computation and Energy: Supply vs. Demand • Mobile (laptop) Processor Performance • PIII (1GHz) – 200 wpm • Mobile (laptop) Processor Performance • PIII (1GHz) – 200 wpm - 6 minutes Hello Performance vs. Accuracy vs. Energy

  5. Outline of Presentation • Speech Recognition Theory • Architectural and Programming Model • Architectural Evaluation • Memory System Design • Power Management • Conclusions

  6. Speech Recognition

  7. Algorithmic Challenges • What is so hard about speech recognition? • Time Warping • Co-articulations • Boundary Identification • Word Selection • Imagine listening in a noisy room • Estimate likely sounds • Apply context clues • Guess

  8. Straightforward Key Behaviors The Process • DSP Signal Processing • Pattern Mapping to Knowledge Base • Acoustic Scoring • Linguistic Scoring

  9. Linguistic Search DH EH R [word] K AA R “Their Car” =

  10. Linguistic Search DH EH R [word] K AA R “Their Car” =

  11. Linguistic Search DH EH R [word] K AA R DH P(“DH”)

  12. Linguistic Search DH EH R [word] K AA R DH

  13. Linguistic Search DH EH R [word] K AA R DH

  14. DH EH R AX IH AH IY “The” “Ear” [word] Linguistic Search DH EH R [word] K AA R “Their”

  15. Linguistic Search DH EH R [word] K AA R “Their” DH EH R AX IH AH IY “The” “Ear” [word]

  16. T AE P DH K EH AA R R AX IH “Cap” AH IY “Cat” Linguistic Search DH EH R [word] K AA R “Their” “Car” “The” “Ear” [word] [word]

  17. EH P L K AE T OY F S N DH NH AA EH R R AX IH AH IY Linguistic Search DH EH R [word] K AA R

  18. T OW IY DH K AE G NH F P N L EH OY TH SH T S EH AA R R AX IH AH IY Linguistic Search DH EH R [word] K AA R

  19. T G SH DH K TH P AE T SH F S N L EH NH GH AX OY V IY OW G DK ZH ER IY IH CH K DUH OW IH Z OW JH F EH AA R R AX IH AH IY Linguistic Search DH EH R [word] K AA R

  20. General Characteristics • Poor Memory Performance • Large memory footprint • Little locality in reference stream • Little low level predictability • Thread Level Concurrency • 1000’s to 10,000’s of active nodes per iteration • Relatively little interdependence

  21. Architecture and Programming Model

  22. Target Model • Exploit Concurrency • Fine grain thread management • Minimal communication • Parallel execution • Tolerate Latency • Maximize processor utilization • Hardware Multithreading • Runtime Adaptation • Unknown, Input-driven behaviour • Dynamic Programming Model

  23. Architectural Model - Overview • Base Xscale 400MHz Embedded Processor • Speech processing unit • Memory System Interface

  24. Architectural Model – Processing Element • Execution model based on simple integer pipeline • Per-thread register contexts • Control logic / Work Queue • Small cache

  25. Programming Model • Maximum concurrency, minimum communication, dynamic • Expose all reasonable concurrency to hardware • Initial static workload distribution + dynamic balancing • Key based lock-less fine grain mutual exclusion spawn ([PC], [arguments], [exclusion ID]) spawn ([PC], [arguments], [node address]) Fork/Join Vector Model on XScale

  26. Memory Partition 1 Memory Partition 2 Memory Partition 3 Programming Model 11 12 21 22 31 32 13 14 23 24 33 34 35 15 18 25 26 27 36 16 17 28 19 10

  27. Architectural Evaluation

  28. Analysis Framework • Multi-pipeline simulator based on SimpleScalar/ARM • Hand parallelized copy of CMU-Sphinx library • 11447 word vocabulary, ~ 17 MB • Static load balancing via hMetis (profiled graph) • Ideal Memory System • Fixed memory latency, unlimited bandwidth • Power Model • Activity based component level energy estimation • Extensive details in Appendix B

  29. Performance • Near ideal performance • Loss mitigated by added contexts • 40% overhead

  30. Performance • Near ideal performance • Loss mitigated by added contexts • 40% overhead

  31. Idealized Energy Consumption • Energy for Ideal system • Reduction in energy due to reduced time dissipating static power • Demonstrates potential for mitigating increased energy consumption of hardware

  32. Latency Tolerance • Relative performance of 100 cycle memory latency compared to 50 cycle memory latency • Still unlimited bandwidth • Added contexts tolerates much of added delay

  33. Meet the Memory Wall • High detail 100MHz SDRAM latency simulator

  34. Meet the Memory Wall • High detail 100MHz SDRAM latency simulator

  35. Memory System Design

  36. Memory System Design • Decrease memory demand • Caching • Compression • Increase memory bandwidth • Increase channel width / clockrate / banking • Flash / ROM subsystem for immutable data • Embedded DRAM for mutable data • Focus on data stream

  37. Caching • Per-pipeline L1 data cache Cache Control Pipeline

  38. DRAM Controller Cache Control XScale Processor Speech Processor Pipeline Caching • Global L2 data cache

  39. Caching • Global L2 data cache DRAM Controller Cache XScale Processor Speech Processor

  40. 2K, 4-way Caching • Miss ratios in L1 data cache stream

  41. 128K, 4-way Caching • Miss ratios in L2 data cache stream

  42. Caching • Performance and EDP with 128K L2

  43. Caching • Where is this locality?

  44. Data Compression • Ineffective at L2 • Multi-line data elements either way • Somewhat algorithm dependent • Great potential in memory system • Off-chip decompression = no performance impact

  45. DDR Memory • Performance and EDP 200MHz DDR memory system

  46. DDR Memory • L2 over DDR • L2+DDR over L2+SDRAM

  47. Bandwidth Optimizations • Stream partitioning of immutable data • Dual-banked Flash / ROM needed • Added latency not an issue • Significant potential energy savings • Mutable data in partitioned, on-chip embedded DRAM • Still require small L2 for shared metadata • 25%+ greater performance • 15-30% greater energy consumption

  48. Power Management

  49. Power Management • What to do with extra time? • Enter low-power standby • 10% energy savings in ideal case • 2% with no frame buffering • Scale frequency / voltage • 25-30% energy savings in ideal case • 20-25% with per-frame modulation

  50. Technology Trends

More Related