1 / 21

Tarantula A Vector Extension to the Alpha Architecture

Tarantula A Vector Extension to the Alpha Architecture. Roger Espasa, Federico Ardanaz, Joel Emerz, Stephen Felixz, Julio Gago, Roger Gramunt,Isaac Hernandez, Toni Juan, Geoff Lowneyz, Matthew Mattinaz, Andr é Seznec Universitat Polit ècnica Catalunya, Barcelona, Spain

agnes
Download Presentation

Tarantula A Vector Extension to the Alpha Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TarantulaA Vector Extension to the Alpha Architecture Roger Espasa, Federico Ardanaz, Joel Emerz, Stephen Felixz, Julio Gago, Roger Gramunt,Isaac Hernandez, Toni Juan, Geoff Lowneyz, Matthew Mattinaz, André Seznec Universitat Politècnica Catalunya, Barcelona, Spain Compaq Computer Corporation, Shrewsbury, MA

  2. State of the World • CMOS Technology progresses • More transistors, more functional units, more control overhead • VLIW and Wide Superscalar • More individually controlled units • Amount of real estate for control logic grows non-linearly • Vector ISA • Localization of parallelism, aggregation of control • Regular structures, simple control

  3. Tarantula • EV8 core + tightly integrated Vector Unit • Out of Order execution, Register Renaming • Integrated in VM and cache coherence system • SMT support • Targeted at scientific computing applications • Requires compiler support and recompilation

  4. Vector ISA • New Architectural State • 32 vector registers (v0-v31) • v31 wired to 0. Used for prefetch • Vector length (vl), Vector stride (vs), Vector Mask (vm) • 45 New Instructions • 5 Groups • Vector-Vector, Vector-Scalar, Strided Memory Access, Random Memory Access, Vector Control

  5. Allows conditional execution without EV8 scalar registers VM can be renamed A(i).ne.0.and.B(i).gt.2 vloadq A(i) --> v0 vloadq B(i) --> v1 vcmpne v0, #0 --> v6 vcmpgt v1, #2 --> v7 vand v6, v7 --> v8 setvm v8 --> vm Vector Mask

  6. Tarantula Block Diagram

  7. Vector Execution Unit • 16 independent lanes • No communication, except for gather/scatter • Each lane has • 2 functional units • Slice of Register File and Mask • Allows high bandwidth • Address generator and private TLB • 32 functional unit appear as only 2 issue ports • Simple scheduling

  8. Vector Unit – Core Interface • Vector Unit physically separate from core • Little modification to core • Large bus prevented by routing space • Core to VBox • 3 Instruction Bus • 2 Data Buses for Scalars from EV8 register file • 3 Instruction Kill Signal Bus for misspeculation • VBox to Core • 3 Instruction Completion Bus

  9. Power Consumption

  10. Vector Memory System • Bound to EV8 VM and Cache Coherence architecture • High Load/Store Bandwidth required • Goal one 64bit datum per flop • Memory Bus to slow • L1 Cache to small for vector data • Direct Connection to L2 Cache • Non-Unit Stride central problem • 20% of all accesses • Don’t match cache lines

  11. Non-Unit Strides • EV8 4MByte L2 Cache in 128 banks • 8 ways, 16 banks per way • Read 8 ways, select correct one • Non-unit stride accesses • Read 16 independent cache lines • Select one qword per line • Requires • Conflict free addresses • Conflict free writes to 16 lanes • One qword per lane per cycle

  12. Conflict Free Addresses • Possible for any 128 consecutive elements • For stride S= × 2s with s ≤ 4 • Order stored in ROM table • Elements accessed out of order • Even for length < 128 full eight cycles for address generation • Slice • Group of 16 conflict free addresses

  13. PUMP • Stride 1 accesses • 80% of all accesses • 128 Qwords in 16 (aligned) or 17 (misaligned) cache lines • Full cache lines read into PUMP latches • Two qwords per cycle sent to VBox • Similar for writes • Allows double bandwidth

  14. Gathers and Scatters • Arbitrary Address for every vector element • Reordering algorithm doesn’t work • Conflict Resolution Box (CR) • Find biggest subset of non-conflicting addresses, pack into slice • Add new addresses to remaining ones and repeat • Worst case 128 slices generated • Same algorithm used for self-conflicting strides • stride S= × 2s with s > 4

  15. Vector Misses • To handle L2 misses consider slices as atomic • On miss, slice moved to Miss Address File (MAF) • Wait for missing data • Go to retry queue • Too many retries cause Panic Mode • MAF nacks all other L2 requests, that might prevent progress

  16. Scalar-Vector Coherency • VBox by-passes L1 cache • Presence bit P indicates L2 cache line loaded by VCore • If P Set, VBox invalidates L1 • Scalar Write followed by Vector Read is not covered • Barrier command required • DrainM Purges write buffer and cause replay trap

  17. Evaluation • No Compiler support available • Hand coded assembler cores • Scientific Benchmarks • ASIM Simulator • Cycle Accurate EV8 simulator • Tarantula compared to • EV8 • EV8 + Trantula’s memory system • Tarantula4 1:4 ratio to RAMBUS frequency

  18. Operations per Cycle

  19. Speed Up over EV8

  20. Conclusions • Vector Processor most efficient solution for many applications • Vector Unit can be added to standard microprocessor core • Big Bandwidth requirement can only be satisfied by L2 cache • Potentially big performance gains • 2 to 20 over EV8 • Performance depends on good code • Tiling + aggressive prefetching • Very good power/performance ratio

  21. Questions • Can only scientific applications exploit vector processors? • Radix sort worked • Powerful memory access instructions • Masks allow logic execution • Does anyone no more about PRAM algorithms? • EV8/VBox coherency seems quirky. Does anyone see a better solution?

More Related