1 / 19

Bottlenecks of SIMD

Bottlenecks of SIMD. Haibin Wang Wei tong. Paper. Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 8, AUGUST 2003.

africa
Download Presentation

Bottlenecks of SIMD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bottlenecks of SIMD Haibin Wang Wei tong

  2. Paper Bottlenecks in MultimediaProcessingwith SIMD Style Extensions andArchitectural Enhancements One IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 8, AUGUST 2003 Deepu Talla, Member, IEEE ,Lizy Kurian John, Senior Member, IEEE, and Doug Burger, Member, IEEE

  3. Outline • Introduction • Bottlenecks Analysis • MediaBreeze Architecture • Summary

  4. Introduction • It is popular to use multimedia SIMD extensions to speed up media processing, but the efficiency is not very high. • 75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions.

  5. Introduction • The bottlenecks are caused by the loop structure and the access patterns of the media program. • So instead of exploiting more data-level parallelism, the paper focuses on improving the efficiency of the instructions supporting the core computation.

  6. Introduction • This paper has two major contributions: • Firstly, it focuses on the supporting instructions to enhance the performance of SIMD which is an innovation. • Secondly, it gives a method to reduce and eliminate supporting instructions with the MediaBreeze architecture.

  7. Nested Loop

  8. The analysis of loop architecture • The sub-block is very small which leads to the limited DLP because it needs many supporting instructions. • There are 5 loops for every block which waste so much time on braches. • You need to reorganize the data to use SIMD

  9. Access patterns

  10. Access patterns • The addressing sequences are complex and big part which need lots of supporting instructions to generate them. • Using general-purposeinstruction sets to generate multiple addressing sequences is not very efficient.

  11. The overhead instructions • Address generation: address calculation • Address transformation: data movement, data reorganization • Loads and Stores: memory • Branches : control transfer, for-loop

  12. Architecture

  13. Instruction Structure

  14. Breeze Instruction Mapping of 1D-DCT

  15. Full Map • . five branches, • . three loads and one store, • . four address value generation (one on each stream with each address generation representing multiple RISC instructions), • . one SIMD operation (2-way to 16-way parallelism depending on each data element size), • . one accumulation of SIMD result and one SIMD reduction operation, • four SIMD data reorganization (pack/unpack, permute, etc.) operations, and • . shifting and saturation of SIMD results.

  16. Performance Evaluation • cfa,dct, motest,scale • G711, decrypt • Aud, jpeg, ijpeg

  17. Any improvement? • Why not higher efficiency in cfa? Memory latency! • Solution? Prefetch!

  18. Evaluation • Advantage: Eliminating and reducing overhead. Much better than normal SIMD extension. 0.3% processor area, less 1% total power consumption. • Drawback: Complicated instruction. Who will design a compiler for this?

More Related