1 / 14

A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD

A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD. CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang. Motivation. SVD Applications Smart antennas Image processing Medical imaging VLIW Trend in high performance embedded computing Vector Out of favor

gerd
Download Presentation

A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang

  2. Motivation • SVD Applications • Smart antennas • Image processing • Medical imaging • VLIW • Trend in high performance embedded computing • Vector • Out of favor • Flynn bottleneck is a limiting factor in parallelism • Known for linear algebra performance

  3. C67 Architecture (mapped) Instruction Ram (cache optional) Decode Logic (8-way) A Register File B Register File L1 S1 M1 D1 D2 M2 S2 L2 Data Ram (>4 banks)

  4. C67 Architecture • Split Register Files • 16 registers per register file • One cross path per register file • Instruction Latencies • Branches - 6 cycles • Load - 5 cycles • FP add/multiply - 4 cycles

  5. TM 1100 VLIW Processor Core Architecture • 5-issue VLIW • 2 FP adders/multipliers • 2 Load/Store Units • 128 general purpose 32 bit registers • 16KB data cache, 32KB instruction cache • Instruction Latencies • 3 cycles for Branches, Load, FP add/multiply

  6. VIRAM-1 Microarchitecture • 2-way-issue superscalar MIPS IV core • Asynchronous vector unit • Communication to scalar core through queue • 32 general purpose vector and flag registers • 32 scalar and control register • 2 VAFU, 2 FFU, 1 VMFU • 4-lane standard configuration

  7. VIRAM-1 Microarchitecture

  8. Testing Conditions • SVD routine from CLAPACK • Random test matrices with a rank of 10 • Matrix dimension ratio of 10 • Sizes range from 100x10 to 300x30 • Suboptimal parameters used • Trends should still hold • Assumed 200 Mhz clock rate

  9. Ideal ‘C67 and TM 1100 Performance Gap • Same memory bottlenecks in both processors • Programming model • C67 • Assembly coded kernels • 1700 lines • TM 1100 • Only C level optimizations

  10. VIRAM Performance Summary • Gains from vector unit limited by Amdahl’s law. • Vector instructions comprise only ~15% of total code. • Not much else of SVD can be vectorized. • Gains limited by what cannot be vectorized. • Perhaps streamline LAPACK or handcode assembly? • Sub-linear scalability. • Scaling IRAM is cheap but gains diminish. • Efficiency and scalability increase with size of data set.

  11. Concluding Remarks • Limitations of both architecture are different • VIRAM: Scalar core • VLIW: Memory bandwidth • VLIW cannot match performance of VIRAM when computing SVD. • VLIW with vector coprocessor?

More Related