1 / 25

MacroSS : Macro-SIMDization of Streaming Applications

MacroSS : Macro-SIMDization of Streaming Applications. Amir Hormati *, Yoonseo Choi ‡ , Mark Woh *, Manjunath Kudlur † , Rodric Rabbah ‡ , Trevor Mudge *, Scott Mahlke * . * Advanced Computer Arch. Lab., University of Michigan. ‡ IBM T.J. Watson Research Center. † Nvidia Corp.

jacinda
Download Presentation

MacroSS : Macro-SIMDization of Streaming Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, YoonseoChoi‡, Mark Woh*, ManjunathKudlur†, RodricRabbah‡, Trevor Mudge*, Scott Mahlke* * Advanced Computer Arch. Lab., University of Michigan • ‡ IBM T.J. Watson Research Center † Nvidia Corp.

  2. Importance of SIMD • Energy and area efficient way to exploit data-level parallelism • Performance in multimedia and communication apps • Ubiquitous in modern processors • Intel: SSE, Larrabee • IBM: Altivec, Cell SPE • ARM: Neon Control Unit Control Unit Control Unit Functional Units Functional Units Functional Units Cache Cache Cache

  3. Stream Computing • Prevalent in embedded, desktop and server systems • Many optimizations for mapping and scheduling applications to parallel architectures • Retargetability is a big plus in streaming languages • Task, pipeline, and data-level parallelism is mapped into core-level parallelism • Data-level parallelism on SIMD engines is not utilized

  4. Traditional Vectorization on Streaming Applications

  5. Why SIMD engines are under-utilized? • Finding data-level parallelism suitable for SIMD engines • Proper data-alignment • Complicated compiler optimization and transformations • Wide variety of SIMD standards

  6. In this work… • Macro-level SIMDization techniques for streaming languages. • MacroSS compiler for StreamIt language • Hardware-based buffer optimizations for packing/unpacking operations • Evaluation of MacroSS on Intel Core i7

  7. StreamIt • Main Constructs: • Filter: Encapsulate computation. • Stateful • Stateless • Pipeline  Expressing pipeline parallelism • Splitjoin Expressing task/data-level parallelism • Exposes different types of parallelism • Scheduling and rate-matching are needed filter pipeline splitjoin

  8. Macro SIMDization • SIMDization at graph level • Tunes the graph based on the target system • SIMD standards • Wide/Narrow SIMD • Actor SIMDization: • Single-Actor • Vertical • Horizontal

  9. Single-Actor SIMDization Overview Serial Execution Execution Reordering Realistic Vectorization Ideal Vectorization E(8) E E E E v E E E E E E v E E E E E E v E v E E E

  10. Single Actor SIMDization • Only stateless actors • Scalar buffer accesses • Strided pushes and pops

  11. Why Scalar Buffers? 128 bits 20 21 22 23 16 17 18 19 12 13 14 15 8 9 10 11 4 5 6 7 0 1 2 3 ?

  12. Vertical SIMDization

  13. Horizontal SIMDization Source • Find isomorphic actors in split/join structures • The isomorphic actors are merge in one vectorized actor • Actors can be both stateful or stateless. Splitter An A1 . . . . . . . . . B1 Bn C1 Cn Joiner Sink

  14. ? ?

  15. Streaming Address Generation • Area overhead less than 1% on Core i7. • Critical path two 16-bit adds and one 64-bit add. Scalar Buffer Vector Buffer 20 21 22 23 14 17 20 23 16 17 18 19 13 16 19 22 12 13 14 15 12 15 18 21 8 9 10 11 2 5 8 11 4 5 6 7 1 4 7 10 0 1 2 3 0 3 6 9

  16. Traditional vs. Macro SIMDization

  17. Experimental Setup Streaming Program • Frontend StreamIt MIT Compiler • Backend MacroSS • ICC 11.1 compile C/C++ code • Core i7 with SSE4 Frontend Compiler Backend Compiler C Code Host Compiler Intel Core i7

  18. Macro-SIMDization vs. Traditional

  19. Benefits of SAGU

  20. Conclusion • Streaming is prevalent in all computing domains. • Applying traditional SIMDization on streaming applications fails to utilize SIMD engines. • Macro-SIMDization is done at higher level. • MacroSS outperforms traditional SIMDization techniques by 54%.

  21. Questions and Comments

  22. Macro-SIMDization vs. Traditional

  23. SAGU Implementation • Area overhead less than 1% on Core i7. • Critical path two 16-bit adds and one 64-bit add. • Minor ISA modifications are needed.

  24. SIMD + Multi-core Scheduling • How to schedule for a heterogeneous SIMD system? • SIMDization reduces memory/bus traffic • Exploit SIMD parallelism before Core-level parallelism. • Is this the best we can do?

  25. Multicore + Macro-SIMDization

More Related