1 / 15

Scalable Vector Processors for Embedded Systems

Scalable Vector Processors for Embedded Systems. Kozyrakis , Patterson Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing Architectures. Outline. Introduction Instruction Set Compiler The Design Evaluation Clustered Processor Conclusion. Introduction.

simone
Download Presentation

Scalable Vector Processors for Embedded Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Vector Processors for Embedded Systems Kozyrakis, Patterson Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing Architectures

  2. Outline • Introduction • Instruction Set • Compiler • The Design • Evaluation • Clustered Processor • Conclusion

  3. Introduction • Embedded processors requires low power and complexity • Performance and scalability are primary • Superscalar and VLIW (ILP) • Superscalar requires complex hardware to detect dependence • VLIW requires a very through compiler • Scaling is difficult

  4. Introduction • Multimedia and telecommunications have data Level Parallelism (DLP) • Revise vector architecture for supercomputers • Introduce Vector IRAM (VIRAM)

  5. Instruction Set • Coprocessor extension to MIPS • Vector Register File (VRF) • 32 Registers • Integer and floating point • Flag register • Vector operations • Arithmetic: integer and floating point • Logical operations • Other functions e.g. population count

  6. Instruction Set • Supports three common access patterns and virtual addressing • Elements can be 64, 32 or 16 bit wide • The 64-bit datapath can execute multiple narrow elements • Element permutation is limited to dot product and fast Fourier transforms • Supports speculative execution using the flag register

  7. The Compiler • Based on PDGCS compilation system for Cray supercomputers • Extensive vectorization techniques: • Outer-loop vectorization • Handling partially vectorizable constructs • Does not require special functions nor custom libraries • Requires pragmas for irregular scatter/gather patterns

  8. The Compiler • Selects operation and element width • Recognizes reduction

  9. The Design • Coprocessor to 64-bit MIPS • VRF capacity is 8KB • Can be 32-64-bit, 64 32-bit or 128 16-bit • A lane has 2 64-bit ALU and vector load/store unit • On-chip 13 MB DRAM organized as 8 banks • The scalar core is a single issue in order MIPS

  10. The Design • Operates at 200MHZ with 2W power consumption

  11. Evaluation

  12. Clustered Processor • VIRAM has complex VRF • Approx. 3 ports per FU • Proposed: replace centralized VRF with clustered VRF • A cluster has a datapath for one FU and few vector registers • It contains access to intercluster network • Area, power and latency per cluster is constant

  13. Clustered Processor • Renaming is used to utilize clustered configuration • It is done using a renaming table that identifies the source and destination • It can be used to implement more than 32 registers • Clustering improves scaling

  14. Clustered Processor: Evaluation • ss

  15. Conclusion • Designed for embedded systems • Area, power and performance • Exploits DLP • Instruction set VRF • Vectorizing compiler • Evaluation • Clustered configurtaion

More Related