1 / 17

The CRAY-1 Computer System

The CRAY-1 Computer System. Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008. Background. CRAY-1 by no means first vector machine 1960s: Westinghouse Solomon/ILLIAC IV 1974: CDC STAR 100 “I never, ever want to be a pioneer” --Cray

melora
Download Presentation

The CRAY-1 Computer System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008

  2. Background CRAY-1 by no means first vector machine 1960s: Westinghouse Solomon/ILLIAC IV 1974: CDC STAR 100 “I never, ever want to be a pioneer” --Cray STAR 100, ILLIAC IV: who's this Amdahl dude? 1972: Cray Research formed after spat with CDC Seymour Cray wanted to start from scratch on 8600; CDC brass, not so much 1976: first CRAY-1 deployed at Livermore

  3. CRAY-1 Hardware

  4. Look Ma, No ASICs!

  5. CRAY-1 Architecture • 5-ton, vector uniprocessor • Word size = 64 bits • 80 MHz clock • 8MB RAM in 16 banks @ 20 MHz • fcpu/fmem = 4 (!!)‏ • Fairly RISCy 16- or 32-bit instructions • Load/store; register-register operations

  6. Scalar Operation and Octal Annoyance • 108 A-registers for 24-bit address calculations • 1008 B-registers serve as backing store for A-registers • 108 S-registers for source/dest of scalar integer/FP insns • T is to S as B is to A • 118 pipelined scalar FUs • Address add, mult • Integer add, shift, logic, pop count • FP add, mult, reciprocal

  7. Scalar Operation • Protection without virtual memory • Base & limit address regs • Ld $dest,$addr actually loads from $base+$addr • Program killed if $base+$addr >= $limit • A handful of registers for interrupts, exceptions, etc.

  8. OS and Front End • cos (CRAY OS) handles job scheduling, storage management (tapes!), other I/O, checkpointing • Packaged with CAL (assembler)‏ • ...and CFT (Fortran compiler), more later • Command-line interface and job submission via separate front-end computer, e.g. VAX

  9. Vector Operation (Finally!)‏ • 8x64-word V-registers • Vector Length Register • Indicates # ops performed by vector insns • Set from contents of an A-register • Vector Mask Register • Indicates which elements in vector to operate on • Set by vector test insns (e.g. VM[i] := ($Vk[i] == 0))‏ • 6 Vector FUs • integer add, shift, bitwise logic • FP via scalar FPU: add, mult, reciprocal

  10. Vector Load/Store Architecture • Big departure from STAR 100: register-register ops • CRAY-1 memory bandwidth == 80Mword/s == 1word/cycle • If all 2-source insns are memory-memory, then IPC=1/3! (and that assumes no bank conflicts!)‏ • Solution: the RISC approach • Combined with chaining (next), can sustain >> 1 flop/cycle

  11. Chaining • Pipeline bypass meets vectors • Consider SAXPY vector expression a*X+Y • Slow approach: compute a*X (64 mults), then compute a*X+Y (64 adds)‏ • Total latency: 128+mult latency+add latency • since, in CRAY-1, all FUs are pipelined • But... no fundamental serialization requirement • As soon as a*X[0] is computed, can compute a*X[0]+Y[0] • Total latency: 64+mult latency+add latency (speedup of almost 2)‏

  12. Chaining Example • Assume: 8-element vectors, single-cycle ops mul.ds $v2,$v3,$s1 add.d $v1,$v2,$v1 • Without chaining: m m m m m m m m a a a a a a a a • With chaining: m m m m m m m m a a a a a a a a

  13. Vector Startup Times • For vector ops to be efficient enough to justify, startup overhead must be small • CRAY-1 can issue a vector insn every cycle, assuming no structural hazards on FUs • Result: vector performance > scalar performance for as few as four elements/vector

  14. Cray Fortran Compiler (CFT)‏ • Important insight: hand-coding assembly sucks • The actual important insight: most vectorizable code is of the embarrassingly-parallel variety • Even with 1970s compiler technology, innermost-loop parallelism is low-hanging fruit • Exploit this—make the compiler do the heavy lifting • CFT is pretty good for branchless inner loops • ...but doesn't even attempt to vectorize code with IFs • So any use of the Vector Mask register must be hand-coded • Upshot: a good start, but not quite there

  15. Analysis • Extremely fast computer for 1976 • Thought experiment: what if CRAY-1's parameters scaled with Moore's Law? (32 years == 21 doublings)‏ • 200,000 transistors => 400 billion transistors • 8MB main memory => 16TB main memory • 80 MHz clock => petahertz? (if only)‏ • For a (merely) 2nd-generation vector processor, the CRAY-1 was ahead of its time (I think)‏ • I'm not the only one: it was commercially phenomenal • However, design techniques (discrete logic) are totally unscalable

  16. Questions? Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008

  17. The CRAY-1 Computer System Richard M. Russell Presented by Andrew Waterman ECE259 Spring 2008

More Related