1 / 27

by Emre Tapcı

“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan. by Emre Tapcı. Outline. Introduction Specification of CP-PACS Pseudo Vector Processor PVP-SW Interconnection Network of CP-PACS Hyper-crossbar Network

bambi
Download Presentation

by Emre Tapcı

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “The Architecture of Massively Parallel Processor CP-PACS”Taisuke Boku, Hiroshi Nakamura, et al.University of Tsukuba, Japan by Emre Tapcı

  2. Outline • Introduction • Specification of CP-PACS • Pseudo Vector Processor PVP-SW • Interconnection Network of CP-PACS • Hyper-crossbar Network • Remote DMA message transfer • Message broadcasting • Barrier synchronization • Performance Evaluation • Conclusion, References, Questions & Comments

  3. Introduction • CP-PACS: Computational Physics by Parallel Array Computer Systems • To construct a dedicated MMP for computational physics, study Quantum-Chromo Dynamics • Center for Computational Physics, University of Tsukaba, Japan

  4. Specification of CP-PACS • MIMD parallel processing system with distributed memory. • Each Processing Unit (PU) has a RISC processor and a local memory. • 2048 of such PU’s, connected by an interconnection network. • 128 IO units, that support a distributed disk space.

  5. Specification of CP-PACS

  6. Specification of CP-PACS • Theoretical performance • To be able to solve problems like QCD, Astro-fluid dynamics, etc. a grat number of PUs are required. • For budget, reliability reasons, number of PUs is limited at 2048.

  7. Specification of CP-PACS • Node processor • Improve function of node processors first. • Caches do not work efficiently on ordinary RISC processors. • New technique for cache function is introduced: PVP-SW

  8. Specification of CP-PACS • Interconnection Network • 3-dimensional Hyper-Crossbar (3-D HXB) • Peak throughput of a single link: 300 MB/sec • Provides • Hardware message broadcasting • Block-stride message transfer • Barrier synchronization

  9. Specification of CP-PACS • I/O system • 128 I/O units, equipped with RAID-5 hard disk system. • 528 GB total system disk space. • RAID-5 system increases fault tolerance.

  10. Pseudo Vector Processor PVP-SW • MPPs require high performance node processors. • A node processor cannot achieve high performance unless cache system works efficiently. • Little temporal locality exists • Data space of application is much larger than cache size.

  11. Pseudo Vector Processor PVP-SW • Vector processors • Main memory is pipelined. • Vector length of load/store is long. • Load/store is executed in parallel with arithmetic execution. • We require these in our node processor • PVP-SW is introduced. • It is pseudo-vector.

  12. Pseudo Vector Processor PVP-SW • Cannot increase number of registers, register field in instructions is limited. • So, a new technique, Slide-Windowed Registers is introduced.

  13. Pseudo Vector Processor PVP-SW • Slide-Windowed Registers • Physical registers consist of logical windows, a window consists of 32 registers. • Total number of registers is 128. • Global registers & Window registers • Global registers are static and shared by all windows • Local registers are not shared. • One window active at a certain time.

  14. Pseudo Vector Processor PVP-SW • Slide-Windowed Registers • Active window is identified by a pointer, FW-STP. • New instructions are introduced, to deal with FW-STP: • FWSTPSet: Sets new location for FW-STP. • FRPreload: Load data from memory into a window. • FRPoststore: Store data into memory from a window.

  15. Pseudo Vector Processor PVP-SW

  16. Interconnection Network ofCP-PACS • Topology is a Hyper-Crossbar Network (HXB) • 8 x 17 x 16, 2048 PUs, 128 I/O units. • On a dimension of hypercube, the PUs are interconnected by a crossbar. • For example: On Y dimension, a Y x Y size crossbar is used. • Routing is simple, route on 3 dimensions consecutively. • Wormhole routing is employed.

  17. Interconnection Network ofCP-PACS • Wormhole routing & HXB together has these properties: • Small network diameter • Same sized torus can be simulated. • Message broadcasting by hardware. • Binary hypercube can be emulated. • Througput in even random transfer is high.

  18. Interconnection Network ofCP-PACS • Remote DMA transfer • Making a system call to OS and copying data to OS area is messy. • Instead, access remote node’s memory directly. • Remote DMA is good, because: • Mode switching (kernel/user mode) is tedious. • Redundant data copying (user  kernel space) is not done.

  19. Interconnection Network ofCP-PACS • Message Broadcasting • Supported by hardware. • First, perform on one dimension • Then perform on other dimensions • Hardware mechanisms to prevent deadlock caused by two nodes broadcasting at the same time are present. • Hardware partitioning is possible. • Send broadcast message to nodes in the sender’s partition only.

  20. Interconnection Network ofCP-PACS • Barrier Synchronization • A synchronization mechanism is required in IPC systems. • CP-PACS supports a hardware barrier synchronization facility. • Makes use of special syncronization packets, other than usual data packets. • CP-PACS also supports partitioned pieces of network to use barrier synchronization.

  21. Performance Evaluation • Based on LINPACK benchmark. • LU decomposition of a matrix. • Outer product method is used, based on 2-dimensional block-cyclic distribution. • All floating point and data loading/storing operations are done in PVP-SW manner.

  22. Performance Evaluation

  23. Performance Evaluation

  24. Performance Evaluation

  25. Conclusion • CP-PACS is operational in University of Tsukuba. • Working on large scale QCD calculations. • Sponsored by Hitachi Ltd. & Grant-in-aid of Ministry of Education, Science of Culture, in Japan.

  26. References • T.Boku, H. Nakamura, K. Nakazawa, Y. Iwasaki, The architecture of Massively Parallel Processor CP-PACS, Institute of Information Sciences and Electronics, University of Tsukuba

  27. Questions & Comments

More Related