1 / 25

CS718 : Data Parallel Processors

CS718 : Data Parallel Processors. 27 th April, 2006. Data Parallel Architectures. SIMD Processors Multiple processing elements driven by a single instruction stream Associative Processors SIMD like processors with associative memory Vector Processors

dom
Download Presentation

CS718 : Data Parallel Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS718 : Data Parallel Processors 27th April, 2006 Anshul Kumar, CSE IITD

  2. Data Parallel Architectures • SIMD Processors • Multiple processing elements driven by a single instruction stream • Associative Processors • SIMD like processors with associative memory • Vector Processors • Uni-processors with vector instructions • Systolic Arrays • Application specific VLSI structures Anshul Kumar, CSE IITD

  3. SIMD M P DS IS C P DS One of the earliest model of parallel computer Anshul Kumar, CSE IITD

  4. P P P P M M M M ILLIAC IV SIMD Model I/O CU bus PE1 PE2 PEn Interconnection network Planned for 64 x 4 PEs, built only 64 Anshul Kumar, CSE IITD

  5. P M Burroughs Scientific Processor (BSP) Model I/O CU bus P1 P2 Pn Interconnection network M1 M2 Mk Anshul Kumar, CSE IITD

  6. SIMD algorithms: sum of vector elements a0 a1 a2 a3 a4 a5 a6 a7 Si = ai + ai+1 i = 0,2,4,6 Si = Si + Si+2 i = 0,4 Si = Si + Si+4 i = 0 step 1: a0+a1 a2+a3 a4+a5 a6+a7 a0+a1+ a2+a3 a4+a5+ a6+a7 step 2: a0+a1+a2+a3+ a4+a5+a6+a7 step 3: OR Si = ai + ai+4 i = 0,1,2,3 Si = Si + Si+2 i = 0,1 Si = Si + Si+1 i = 0 Anshul Kumar, CSE IITD

  7. No. of processors vs time Adding vector elements: • n processors – log n steps • n/log n processors – log n steps Matrix multiplication: • n processor – n2 steps • n2 processors – n steps • n3 processors – log n steps • n3/log n processors – log n steps Important factors: data distribution, network Anshul Kumar, CSE IITD

  8. Rise and fall of SIMDs • Introduced in 60’s (e.g. Illiac, BSP) • Problems: • not cost effective • serial fraction and Amdahl’s law • I/O bottle neck • Overshadowed by Vector Processors • Resurrected in 80’s (MPP from Goodyear, Connection machine from Thinking Machines Inc., MP-1 from MasPar) • Did not survive because of high cost Anshul Kumar, CSE IITD

  9. Related ideas • Coarse grain SIMD with off the shelf processors (synchronized MIMD), e.g. CM5 of Thinking Machines • This gave rise to SPMD (single program multiple data) • MMX and SIMD instructions in Pentium Anshul Kumar, CSE IITD

  10. Vector Processors I-cache I-unit and control D-cache Memory V-reg GPRs address unit Mem control Buses VFU VFU FU Anshul Kumar, CSE IITD

  11. Four Generations of CRAY systems (vector processors) System CPUs Clock Flops/ Words Mflops Gates/ MHz clock/ moved/ chip CPU clk/CPU CRAY-1 1 80 2 1 80 2 X-MP 4 105 2 3 840 16 Y-MP 8 166 2 3 2667 2500 C90 16 240 4 6 15360 10000 Anshul Kumar, CSE IITD

  12. Cray History • http://www.cray.com/company/history.html Anshul Kumar, CSE IITD

  13. 8GB central memory shared by 16 CPUs 128 CPU - mem paths word = 64 bits + 16 ECC Dual vector pipes 128 element segments Memory 8 sections 8x8 sub sections 8x8x2 bank groups 8x8x2x8 banks CRAY C90 Anshul Kumar, CSE IITD

  14. Convex C4/XA system • CPU: 7.5 ns clock, 1620 MFLOPs • Mem: 32 MB x 32 banks, 64 bit word, 50ns access time • 3 FP pipes, 2 results each • Vector regs - FPU cross bar • 1.1 GB/s per I/O port 5 x 5 crossbar CPUs memories I/O utilities Anshul Kumar, CSE IITD

  15. NEC SX - X 4 CPUs 4 x 2 pipes each Fujitsu VP5000 7 - 222 CPUs 2 LS pipes 3 Func pipes 2 mask pipes Other examples Fujitsu VP2000 1 - 2 CPUs Anshul Kumar, CSE IITD

  16. Systolic Arrays (H.T. Kung 1978) Simplicity, Regularity, Concurrency, Communication Example : Band matrix multiplication Anshul Kumar, CSE IITD

  17. T=0 B31 A23 A22 B21 A12 A31 A21 A11 B11 B12

  18. T=1 B31 A23 A32 A22 A12 B22 B21 A31 A21 A11 B11 B12

  19. T=2 A33 B32 A23 B31 B22 A32 A22 A12 B21 A31 A21 A11 B11 B12

  20. T=3 A34 B42 B32 B31 A23 A33 A32 B21 B22 A22 A12 A42 B23 A31 A11 B11 A21 B12

  21. T=4 A34 B42 A23 A43 A33 B32 B33 B31 A11 B11 A12 B21 A32 A22 A42 B22 B23 A31 A21 B11 A11 B12

  22. T=5 A34 B42 A23 B32 B33 B31 A43 A33 C11 A21 B11 A22 B21 A11 B12 A12 B22 A32 A42 B23 A21 B12 A31 B11

  23. T=6 B43 A44 B42 A34 C11 A21 B11 A22 B21 A23 B31 B32 B33 A33 A43 C12 A53 A31 B11 A32 B21 A21 B12 A22 B22 A42 A12 B23 A31 B12

  24. WARP: Programmable Systolic Processor [Kung, CMU 1987] Complete contrast to the original idea • not application specific • not a single VLSI • complex cell (pipelined FP adder, mult, FIFOs, RAM, cross bar) • linear • asynchronous Anshul Kumar, CSE IITD

  25. References • D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design Space Approach", Addison Wesley, 1997. • K. Hwang, "Advanced Computer Architecture : Parallelism, Scalability, Programmability", McGraw Hill, 1993. Anshul Kumar, CSE IITD

More Related