From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model

From EARTH to HTMT:The Evolution of a Multithreaded Architecture Model Guang R. Gao Computer Architecture & Parallel Systems Laboratory (CAPSL) University of Delaware \Seminar\Spain-00-01

Outline • Introduction • The EARTH Execution and Architecture Model • The EARTH Programming Model and Threaded-C • Application Studies and Performance Evaluation • Related Work and Conclusions \Seminar\Spain-00-01

Scalable for both Class A and Class B Applications Main Challenges: High-Performance Parallel Systems \Seminar\Spain-00-01

Challenges: The “Killer Latency Problem” P Latency due to: - Communication - Synchronization - task spawning - load balancing NI C M Network P NI C M SP2 is hard enough, PC clusters is much worse ! \Seminar\Spain-00-01

Meeting High-End Application Challenges: • Observation I: Many such Applications have “Bad Latencies” demanding good support of adaptive fine-grain parallelism [Petaflop-2 Conference, 99-2] \Seminar\Spain-00-01

Here Comes the Surprise![Theobald’s Ph.D. thesis, May, 1999] Observation II: It is not necessarily too hard to “generate” and “program” fine-grain threads! However, it may be hard to statically group them into coarse-grain threads! \Seminar\Spain-00-01

ABaseAdaptiveFine-GrainMultithreaded Execution Model C1 (Abundance) : a very large pool of threads C2 (ultra-light weight): can be spawned as easily and as quickly as possible C3 (Mobility): Adaptively migratable as easily and as quickly as possible \Seminar\Spain-00-01

Motivation of The EARTH Project • How to exploit fine-grainmultithreadeding • on a parallel system given • off-the-shelf microprocessors \Seminar\Spain-00-01

Two Types of Fine-Grain Threads • A parallel function invocation • Strand/Fiber - a function body can be divided into several “strands/fibers” \Seminar\Spain-00-01

Fiber within a frame Parallel function invocation Call a procedure SYNC ops Threads and Fibers • A fiber becomes enabled if it has received all input signals • An enabled fiber may be selected for execution when the required hardware resource has been allocated • After finished execution, a signal is sent to all destination fiber to update the corresponding sync slots Note: The role of strand ! \Seminar\Spain-00-01

Fibers 2 2 1 2 Signal Token 0 2 0 1 2 4 The Execution Model of Fibers • Dependence-Driven firing rule for fibers • Fiber is atomic and ultra-light weighted • Relation with dataflow model (Dennis72) \Seminar\Spain-00-01

The Threaded C Language • Threaded C = ANSI C + extensions for multithreading • Extensions include: • Threaded functions • Threaded synchronization • Support for global addresses • Data transfer primitives • Threaded C is: • The “instruction set” of the EARTH processor • A target language for high-level compilers C FORTRAN High-Level Language Translation Users Treaded C Threaded C Compiler EARTH Platforms \Seminar\Spain-00-01

\Seminar\Spain-00-01

CPU SU LINK CPU CPU An Evolutionary Path for EARTH SU-int SU-ext CPU / SU CPU SU MANNA-dual/spn - Parallel machines - PC-clusters - ... <= SEMi Simulation Platform (Theobald99) \Seminar\Spain-00-01

Platforms for EARTH • MANNA: • MANNA is architecture testbed from GMD • benchmarking platform for fine-grain multithreading • EARTH-SP2 • EARTH-Beowulf (Linux based) • EARTH-SUN/SMP/Cluster \Seminar\Spain-00-01

Unique Advantages of EARTH-MANNA Platform • We can push OS completely out of the way! • We can design the EARTH runtime system from very low level up • The invaluable experience/lessons learned from EARTH-MANNA are essential for the successful migration of the EARTH model to other platforms (e.g. the IBM SP-2 story, etc.) \Seminar\Spain-00-01

Sumamry of Recent Experimental Results (Kevin99) • Impressive speedup and scalability (scalable even with high overhead fine-grain parallel programs: e.g. fib) • Enhanced Programmability (N-queen-p example) • Broad applicability \Seminar\Spain-00-01

Experiements • Example 1 (assorted benchmarks): fib, nqueen, paraffin, tomcatv, matrix-multiply,etc. • Example 2: Adaptive unstructured grids • Example 3: Wavelet computation \Seminar\Spain-00-01

Performance of N-Queens(12)[Theobald99] • 117.8 fold speedup on a 120 node simulation! • 1,637,099 tokens are generated ! • average, 30+ tokens are maintained per processors • n-QUUEN is a useful HTMT benchmark after all ! (Phil Murkey) \Seminar\Spain-00-01

Coarse-Grain Applications • 116 fold speedup on 120-node machine is achieved for Cannon’s matrix multiply algorithm! • Deep software systolic-style implementation to exploit paralelism • Fine-grain mechanisms \Seminar\Spain-00-01

Y Partitioning Balanced? N Mapping Repartitioning Initialization Y Expensive? Solution N Remapping Y Execution Adapt? N Finalization Example 2 --- Adaptive Unstructured Mesh Computation Observation • The critical part of the framework is mesh adaptation and load balancing • Partitioning problem in better shape, remapping problem open \Seminar\Spain-00-01

The Initial Picture The Mapping After a Few Iterations * * * Node N Node 0 Node 1 Node 0 Node N Node 1 * * * \Seminar\Spain-00-01

Initial Results • About 3000 lines of Threaded-C code • migration >= 70% (good) • Unbiased variance = 3 - 5% (very good) • A good speedup on EARTH-MANNA has been observed \Seminar\Spain-00-01

Example 3 --- Adaptive Wavelet Transformation • Load evolution pattern is dynamically changing, but is statically predictable • Need adaptive load redistribution/grouping • Mapping onto EARTH [IPPS99] \Seminar\Spain-00-01

HTMT Facility (Perspective) \Seminar\Spain-00-01

DPIM DPIM DPIM DPIM DPIM DPIM SPELLs DPIM SPIM SPIM DPIM SPIM SPIM DPIM DPIM SPIM SPIM DPIM SPIM SPIM DPIM SPIM SPIM SPIM SPIM DPIM DPIM DPIM DPIM DPIM DPIM DPIM DPIM HTMT Architecture \Seminar\Spain-00-01

Extensions to CurrentEARTH Model • Percolation Model • Memory Model: Location Consistency • Load balancing and percolation \Seminar\Spain-00-01

SCP Execution Unit Split-Phase Synchronization to SRAM CRAM DMA start done HTMT Percolation Model CRYOGENIC AREA A-Pool I-Pool Parcel Dispatcher & Dispenser Parcel Assembly & Disassembly Parcel Invocation & Termination T-Pool D-Pool Run Time System DMA SRAM-PIM \Seminar\Spain-00-01

High-level languages e.g. parallel C etc. Applications High-level language compiler HTMT-C/ Threaded-C Threaded-C Compiler and Tool Set Performance Models PXM Interface RTS OS Hardware Architectures Threaded-C Compiler - RTS interface RTS-hardware architecture interface RTS-OS interface The System Software Architecture Note: • The threaded-C compiler has part of its functions embedded in RTS • The RTS will work with architecture and OS layers to provide the PXM interface • The performance models Are defined across all layers \Seminar\Spain-00-01

Evolution of Multithreaded Architecture Models CHoPP’77 CHoPP’87 Non-dataflow based MASA Halstead 1986 Alwife Agarwal 1989-96 XMT Vishkin HEP B. Smith 1978 Tera B. Smith 1990- CDC 6600 1964 J-Machine Dally 1988-93 M-Machine Dally 1994-98 Flynn’s Processor 1969 Cosmic Cube Seiltz 1985 Others: Multiscalar (1994), SMT (1995), etc. Monsoon Papadopoulos & Culler 1988 Dataflow model inspired P-RISC Nikhil & Arvind 1989 *T/Start-NG MIT/Motorola 1991- MIT TTDA Arvind 1980 Cilk Leiserson TAM Culler 1990 Iannuci’s 1988-92 Static Dataflow Dennis 1972 MIT EM-5/4/X RWC-1 1992-97 Manchester Gurd & Watson 1982 SIGMA-I Shimada 1988 Arg-Fetching Dataflow DennisGao 1987-88 MTA HumTheobald Gao 94 EARTH PACT95’, ISCA96, Theobald99 MDFA Gao 1989-93 \Seminar\Spain-00-01

NSERC, FCAR, DARPA,NSA,NSF,NASA Acknowledgement(Incomplete List) • Shashank Nemawarkar • Zach Ruiz • Sean Ryan • V.C. Sreedhar • Xinan Tang • Kevin Theobald • Ruppa Thulasiram • Parimala Thulasiraman • Xinmin Tian • Yingchun Zhu • J. Nelson Amaral • Erik Altman • Haiying Cai • Nasser Elmasri • Gerd Heber • Laurie J. Hendren • Herbert Hum • Alberto Jimenez • Prasad Kakulavarapu • Cheng Li • Olivier Maquelin • Andres Marquez \Seminar\Spain-00-01

From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model