The Tera Computer System

The Tera Computer System Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, Burton Smith Tera Computer Company Seattle, Washington USA ICS '90 Proceedings of the 4th international conference on Supercomputing Ran Manevich - Computer Architecture and Parallel Systems seminar (236604) – Spring 2012

Tera Computers Company • 1972 - Seymour Cray founds Cray Research ,Inc. • 1976 – Cray 1 – 250 MFlops, 1MB Memory • 1987 – James Rottstolk and Burton Smith found Tera Computers Company • 2000 – Tera Acquires Cray’s Research assets and becomes Cray, Inc. Seymour Cray standing next to the core unit of the Cray 1 computer, circa 1974

Tera Computers Company (Cray)

Tera Computers Company (Cray) • Jaguar - World #3 – 224162 cores, 1759 TFlops, 6950 KW. Ridge National Lab. US.

Tera Computer System • A shared memory MIMD supercomputer introduced at ~1990. • Resources: • 256 Processors • 512 memory units • 256 I/O cache units • 256 I/O Processors

Interconnection Network • Pipelined packet switched nodes(routers). • A packet consists of source and destination addresses, opcode and 64 bits of data (164 bits total*) . • Each link can transport a packet in both directions on a single clock cycle (i.e. single flit packets). • *George Davison, Constantine Pavlakos, Claudio Silva. Final Report for the Tera Computer TTI CRADA. Sandia National Labs Report SAND97-0134, January 1997.

Interconnection Network • 3D 16x16x16 Tourus:

Interconnection Network • 1280 of the 4096 routers are attached to recourses ( 256 processors + 512 memory units + 256 I/O caches + 256 I/O processors) • X links and Y links are missing on alternate Z layers in order to speed-up router performance. • This reduces router crossbar degree from 6 to 4 and from 7 to 5 in routers without/with a recourse respectively. • Recourses are distributed homogeneously across the layers – average communication distance reduction.

Interconnection Network • Odd Z layers:

Interconnection Network • Even Z layers:

Data Memory • 512 data memory units of 128 MB each. • Total: 64 GB • Memory is byte addressable and organized in 64 bit words. • Four additional access bit states per word: • 2 trap bits • 1 invisible indirect addressing bit • 1 full/empty bit for synchronization • Additional code bits for single error correction and double error detection separately for data and access state

Data Memory • Virtual addresses randomization to avoid hotspots. • Randomization for each processor can be limited to a sub-set of the 512 segment to exploit physical locality.

Data Memory - Synchronization • 4 Types of load/store access control for hardware based synchronization:

I/O Caches • “Disk speeds have not kept pace with advances in processor and memory performance in recent years.” • Tera system needs up to 70 GB/s of sustained bandwidth between data memory and secondary storage (e.g. magnetic disks). • This bandwidth is supplied by directly addressable 256 I/O caches, 1GB each (total 256 GB). • I/O cache units are functionally identical to data memory but slower. • Each processor fetches instructions to a neighboring I/O cache unit.

Processors • 256 Processors. • Each processor can execute up to 128 instruction streams (i.e. threads) simultaneously. • Every clock tick , one among the streams that are in “ready” state is allowed to issue an instruction.

Processors • If there are enough streams, execution latency (70 ticks on average) can be hidden by parallelism. Performance Max performance • ZvikaGuz et. al. Bandwidth Limitations Memory access execution # Threads 16

Processors – stream state • Stream state is defined by the following registers: • 1 64-bit Stream Status Word (SSW) – for program counter and additional mode flags. • 32 64-bit General Registers (R0-R31) • 8 64-bit Target Registers (T0-T7) – for trap handler and branch targets. • To enable a rapid context switch (on every tick), there are 128 sets of context registers. Each processor has 128 SSW’s, 4096 general registers and 1024 target registers. • With target registers, branch target addressed are prefeached in parallel to branch decision calculation.

Instructions • To enable multiple operations issue per tick, “Mildly horizontal” VILW (Very Long Instruction Word) instructions are use. These instructions typically specify three operations: • Memory reference operation (e.g. UNS_LOADB). • Arithmetic operation (e.g. FLOAT_ADD_MUL). • Control (e.g. JUMP) or second arithmetic operation.

Explicit-Dependence Lookahead • Each instruction contains a 3 bit lookahead field that specifies how many instructions from this stream will issue before encountering an instruction that depends on the current one. • INS. LA • R0 = R0 + 1 1 • R1 = R1 + 1 4 • R0 = R0 + 1 2 • R3 = R3 + 1 4 • R4 = R4 + 1 4 • R0 = R0 + 1 4 • R1 = R1 + 1 4 • R2 = R2 + 1 4 • … • New instruction is issued only when the instructions with lookahead values referring to it have completed. • If instructions are independent (lookahead value is 7), 9 streams are enough to hide instruction latency of 72 ticks.

Protection Domains (Processes) • Each processor supports as many as 16 active protection domains (processes/address spaces). • A protection domain defines program memory, data memory and the mapping between physical and virtual addresses. • Each instruction stream (thread) is assigned to a protection domain. The exact domain is not known to the user program. • A protection domain can be seen as a virtual processor and can be moved from one physical processor to another.

Protection Domains (Processes) • Retry limit - Defines in each protection domain how many times a memory reference can fail (in testing full/empty) before it will trap (exception).

Privilege Levels • Privilege levels are defined independently for each stream. • 4 levels of privilege: user, supervisor, kernel and IPL. • IPL is the highest and is the only that operates in absolute addressing mode.

Arithmetic • Operations supported directly by hardware: addition, subtraction, multiplication, conversion(?) and comparison. • Types that are directly supported: • 64-bit 2’s-complement and unsigned integers. • 64 bit floating point numbers. • 64 bit complex numbers. • Types that are indirectly supported: • 8, 16 and 32 2’s-complement and unsigned integers. • Arbitrary length integers. • 32 bit floating point numbers. • 128 bit “double percision” numbers.

Software * • Operating System - Custom fully symmetric, distributed parallel version of UNIX. • Programming Model - • Thread-based programming model that permits a mixture of implicit and explicit parallelism. • The virtual machine has an unbounded number of processors with uniform access to all memory locations. • Tera’s compilers perform automatic parallelization of Fortran, C and C++ (loop unrolling, operations on vectos, etc.) • *George Davison, Constantine Pavlakos, Claudio Silva. Final Report for the Tera Computer TTI CRADA. Sandia National Labs Report SAND97-0134, January 1997.

Performance* • Nominal clock frequency: 333 MHz • Data bandwidth per node: 2.67 GB/s • Peak performance: 1Gflop per processor, 256 Gflops total. • Processors power dissipation: 6KW per processor, 1.536MW total. • 167 Kflops/Watt  • *George Davison, Constantine Pavlakos, Claudio Silva. Final Report for the Tera Computer TTI CRADA. Sandia National Labs Report SAND97-0134, January 1997.

Thank You!!!

The Tera Computer System

The Tera Computer System

Presentation Transcript

The Computer System

The Computer System

The Computer System

The Computer System

Excent Tera Overview

MAIN TERA HERO

The Computer System

TERA: PAMS Reporting

TERA Foundation

Two New TERA Modules

Distributed Tera-Mining

Tera Sampson

The Computer System

tera monroe

2. The Computer System

The Computer System

Tera Term

The Computer System

Computer System

The Computer System

The Computer System

Boundless Tera V3