Origin system architecture
This presentation is the property of its rightful owner.
Sponsored Links
1 / 63

Origin System Architecture PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on
  • Presentation posted in: General

Origin System Architecture. Hardware and Software Environment. Scalar Architecture. memory. Register File. Functional Unit (mult, add). Cache. Processor. Reduced Instruction Set (RISC) Architecture: load/store instructions refer to memory

Download Presentation

Origin System Architecture

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Origin system architecture

Origin System Architecture

Hardware

and

Software Environment


Scalar architecture

Scalar Architecture

memory

Register File

Functional Unit

(mult, add)

Cache

Processor

  • Reduced Instruction Set (RISC) Architecture:

    • load/store instructions refer to memory

    • functional units operate on items in the register file

    • memory hierarchy in the Scalar Architecture

      • Most recently used items are captured in the cache

      • Access to cache is much faster than access to memory

~2GB/s

~10 cy

~500 MB/s

~100 cycles


Vector architecture

Vector Architecture

Vector

Registers

Functional Unit

(mult, add)

Processor

k

i

i

=

X

k

C

A

B

  • Vectors will be loaded (loadv instruction) from memory

  • The performance is determined by memory bandwidth

  • Optimization takes vector length (64 words) into account

Vector Operation

DO i=1,n

DO k=1,n

C(i,1:n)=C(i,1:n) + A(i,k)*B(k,1:n)

ENDDO ENDDO

memory

loadff2,(r3) load scalar A(i,k)

loadvv3,(r3) load vector B(k,1:n)

mpyvsv3,v3,v2calculate A(I,k)*B(k,1:n)

addvvv4,v4,v3update C(I,1:n)

+ Accumulate C(1,1:n)

in a vector register


Multiprocessor architecture

Multiprocessor Architecture

memory

Register File

Register File

Functional

Unit

(mult, add)

Functional

Unit

(mult, add)

Cache

Cache

Cache

Coherency

Unit

Cache

Coherency

Unit

Processor

Processor

  • Cache coherency unit will intervene if two or more processors attempt to update same cache line

    • All memory (and I/O) is shared by all processors

    • Read/write conflicts between processors on the same memory location are resolved by cache coherency unit

    • Programming model is an extension of single processor programming model


Multicomputer architecture

Multicomputer Architecture

Main

memory

Main

memory

Register File

Register File

Functional

Unit

(mult, add)

Functional

Unit

(mult, add)

Cache

Cache

Processor

Processor

  • All memory and I/O path are independent

  • Data movement across the interconnect is “slow”

  • Programming model is based on message passing

    • Processors explicitly engage in communication by sending and receiving data


Origin 2000 node board

Origin 2000 Node Board

Main

Memory

Directory

Basic Building Block

  • 2 X R12000 Processors

  • 64 MB to 4 GB Main Memory

  • Hub Bandwidth Peaks

  • 780 MB/s [625] --- CPUs

  • 780 MB/s [683] --- memory

  • 1.56 GB/s [1.25] -- XIO link

  • 1.56 GB/s [1.25] -- CrayLink

XIO

Directory

>32P

Hub

CrayLink

R1*K

R1*K

Cache

Cache

Node Board


O2000 node board

O2000 Node Board

HUB Crossbar ASIC:

  • Single chip integrates all 4 Interfaces:

    • Processor Interface; two R1x000 processors multiplex on the same bus

    • Memory Interface, integrating the memory controller and (Directory) Cache Coherency

    • Interface to the CrayLink Interconnect to other nodes in the system

    • Interface to the I/O devices with XIO-to-PCI bridges

  • Memory Access characteristics:

    • Read Bandwidth single processor 460 MB/s sustained

    • Average access latency 315 ns to restart processor pipeline

Directory

SDRAM

Main Memory up to 4 GB/node

SDRAM ([email protected] MHz=800MB/s)

L2 Cache

1-4-8 MB

Memory Interface

CrayLink

duplex connection

([email protected] MHz,

2x800 MB/s)

to other nodes

R1x000

processor

HUB

Link Interface

Proc Interface

R1x000

processor

I/O Interface

HUB ASIC:

950K gates

100MHz 64bit

BTE

64 counters /(4KB)page

L2 Cache

1-4-8 MB

Input/Output on every node: 2x800 MB/s


Origin 2000 switch technology

Origin 2000 Switch Technology

Main

Memory

Directory

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

R

R

R

R

R

R

R

R

6 ports to XIO

Directory

>32P

XBOW

Hub

Proc.

Proc.

Cache

Cache

Node Board

ccNUMA

hypercube

Router to other

Node Boards


O2000 scalability principle

O2000 Scalability Principle

  • Distributed switch does scale:

    • Network of crossbars allows for full remote bandwidth

    • The switch components are distributed and modular

Main

Memory

Directory

SDRAM

Directory

SDRAM

Main

Memory

L2 Cache

1-4-8 MB

L2 Cache

1-4-8 MB

Memory Interface

Memory Interface

R1x000

processor

R1x000

processor

HUB

Proc Interface

Link Interface

HUB

Link Interface

Proc Interface

R1x000

processor

R1x000

processor

I/O Interface

I/O Interface

L2 Cache

1-4-8 MB

L2 Cache

1-4-8 MB

Crossbar router

network


Origin 2000 module

Origin 2000 Module

System Building Block

  • Module Features:

  • Up to 8 R12000 CPUs (1-4 Nodes)

  • Up to 16 GB physical memory

  • Up to 12 XIO slots

  • 2 XBOW Switches

  • 2 Router Switches

  • 64 bit internal PCI Bus (optional)

  • Up to 2.5 [3.1] GB/sec system bandwidth

  • Up to 5.0 [6.2] GB/sec I/O bandwidth


Origin system architecture

Origin 2000 Module

N

N

N

N

R

R

  • Deskside System

    • 2-8 CPUs

    • 16GB Memory

    • 12 XIO slots

  • SGI 2100 / 2200


Origin 2000 single rack

Origin 2000 Single Rack

N

N

N

N

N

N

N

N

R

R

R

R

  • Single Rack System

    • 2-16 CPUs

    • 32GB Memory

    • 24 XIO slots

  • SGI 2400


Origin 2000 multi rack

Origin 2000 Multi-Rack

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

R

R

R

R

R

R

R

R

  • Multi-Rack System

    • 17-32 CPUs

    • 64GB Memory

    • 48 XIO slots

    • 32-processor hypercube building block


Origin 2000 large systems

Origin 2000 Large Systems

  • Large Multi-Rack Systems

    • up to 512 CPUs

    • up to 1 TB Memory

    • 384+ XIO slots

  • SGI 2800

+

+

+

=


S calable n ode product concept

ScalableNode Product Concept

Address diverse customer requirements

Independent scaling of CPU, I/O, and storage…tailor ratios to suit application

Large dynamic range of product configurations

RAS via component isolation

Independent evolution and upgrade of system components

Maximize leverage of engineering and technology development efforts

INTERCONNECT

SUBSYSTEMS

PROCESSOR

SUBSYSTEMS

Modular Architecture

Interface and

Form Factor Standards

I/O SUBSYSTEMS


Origin system architecture

Origin 3000 Hardware Modules (BRICKS)

G-brick

Graphics Expansion

C-brickCPU Module

R-brick

Router Interconnect

I-brick

Base I/O Module

P-brick

PCI Expansion

X-brick

XIO Expansion

D-brick

Disk Storage


Origin 3000 mips node

Origin 3000 MIPS Node

R1*000

R1*000

R1*000

R1*000

Bedrock

ASIC

Mem/Dir

Two Independent SysAD Interfaces

Each 2x O2K Bandwidth

200 MHz, 1600 MB/sec each

128 Nodes / 512 CPUs

per System (Max)

L2

Cache

L2

Cache

L2

Cache

L2

Cache

Memory Interface

4x O2K Bandwidth

200 MHz, 3200 MB/sec

60% O2K Latency

180 ns local

8 GB/node (Max)

DDR SDRAM

NUMALink3 Network Port

2x O2K Bandwidth

800 MHz, 1600 MB/sec

Bi-directional

XIO+ Port

1.5x O2K Bandwidth

600 MHz, 1200 MB/sec

Bi-directional


Origin 3000 cpu brick c brick

Origin 3000 CPU Brick (C-brick)

  • 3U high x 28” deep

  • Four MIPS or IA64 CPUs

  • 1 - 4 DIMM pairs: 256MB, 512MB, 1024MB (premium)

  • 48V DC power input

  • N+1 redundant, hot-plug cooling

  • Independent power on/off

  • Each CPU module can support one I/O brick


Origin 3000 bedrock chip

Origin 3000 BEDROCK Chip


Origin system architecture

SGI Origin 3000 Bandwidth Theoretical vs. Measured (MB/s)

900

900

1600

1600

CPU

CPU

CPU

CPU

900

900

1600

1600

CPU

CPU

CPU

CPU

1150

1150

1600

1600

2x1250

2x1600

Hub

Hub

2100

3200

Memory

Memory

node

node


Origin system architecture

STREAMS Copy Benchmark

SGI Confidential


Origin 3000 router brick r r brick

Origin 3000 Router Brick (r/R-brick)

  • 2U high x 25” deep

  • Replaces system mid-plane

  • Multiple Implementations

    • r-Brick…6-port (up to 32 CPUs)

    • R-Brick…8-port (up to 128 CPUs)

    • metarouter…(128 to 512 CPUs)

  • 48V DC power input

  • N+1 redundant, hot-plug cooling

  • Independent power on/off

  • Latency 50% ORIGIN 2000

    • 45 ns

8 NUMAlink™ 3 NW Ports

Each port...3.2GB/s

(2x O2K bandwidth)

45ns roundtrip latency

(50% O2K router latency)

NUMAlink™ 3

Router


Origin system architecture

SGI Origin 3000 Measured Bandwidth

5000 MB/s

Router

2500

2500


Sgi numa 3 scalable architecture 16p 1hop

SGI NUMA 3Scalable Architecture (16p - 1hop)

R1*000

R1*000

R1*000

R1*000

R1*000

R1*000

R1*000

R1*000

R1*000

R1*000

R1*000

R1*000

R1*000

R1*000

R1*000

R1*000

Bedrock

ASIC

Bedrock

ASIC

Bedrock

ASIC

Bedrock

ASIC

8-port

Router

To other Routers


Origin 3000 i o bricks

Origin 3000I/O Bricks

I-brick:

Base I/O Module

P-brick:

PCI Expansion

X-brick:

XIO Expansion

  • Base system I/O:

    • system disk

    • CD-ROM

    • 5 PCI slots

  • No need to duplicate starting I/O infrastructure

  • 12 industry-standard,64-bit, 66MHz slots

  • Supports almost allsystem peripherals

  • All slots are hot-swap

  • Highest performanceI/O expansion

  • Supports HIPPI,GSN, VME, HDTV

  • 4 XIO slots per brick

New I/O bricks (e.g., PCI-X) can be attached via same XIO+ port


Types of computer architecture characterised by memory access

Types of Computer Architecturecharacterised by memory access

PVP (SGI/Cray T90)

UMA

Central Memory

SMP

(Intel SHV, SUN E10000, DEC 8400

SGI Power Challenge, IBM R60, etc.)

COMA (KSR-1, DDM)

Multiprocessors

Single Address space

Shared Memory

NUMA

distributed memory

CC-NUMA

(SGI Origin2000, Origin3000,

Cray T3E, HP Exemplar,

Sequent NUMA-Q, Data General)

NCC-NUMA (Cray T3D, IBM SP3)

MIMD

Cluster

(IBM SP2, DEC TruCluster,

Microsoft Wolfpack, “Beowolf”, etc.)

loosely coupled, multiple OS

Multicomputers

Multiple Address spaces

NORMA

no-remote memory access

“MPP” (Intel TFLOPS,TM-5)

tightly coupled & single OS

MIMDMultiple Instruction s Multiple DataPVPParallel Vector Processor

UMAUniform Memory Access SMPSymmetric Multi-Processor

NUMANon-Uniform Memory Access COMACache Only Memory Architecture

NORMANo-Remote Memory Access CC-NUMACache-Coherent NUMA

MPPMassively Parallel Processor NCC-NUMANon-Cache Coherent NUMA


Origin dsm ccnuma architecture

Origin DSM-ccNUMA Architecture

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Cache

Cache

Cache

Cache

Cache

Cache

Cache

Main

Memory

Dir

DistributedSharedMemory

Processor

Cache

Bedrock

XIO+

Bedrock

XIO+

Main

Memory

Dir

NUMALink3 and R-Bricks


Distributed shared memory architecture dsm

Distributed Shared Memory Architecture (DSM)

Main

memory

Main

memory

Register File

Register File

Functional

Unit

(mult, add)

Functional

Unit

(mult, add)

Cache

Cache

Processor

Processor

Cache

Coherency

Unit

Cache

Coherency

Unit

  • Local memory and independent path to memory as with the Multicomputer Architecture

  • Memory of all nodes is organized as one logical “shared memory”

  • Non-uniform memory access (NUMA):

    • “Local memory” access is faster than “remote memory” access

  • Programming model is (almost) the same as for the Shared Memory Architecture

    • data distribution is available for optimization

  • Scalability properties similar to the Multicomputer Architecture

interconnect


Origin dsm ccnuma architecture1

Origin DSM-ccNUMA Architecture

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Cache

Cache

Cache

Cache

Cache

Cache

Cache

Main

Memory

Dir

Directory-BasedScalableCache Coherence

Processor

Cache

Bedrock

XIO+

Bedrock

XIO+

Main

Memory

Dir

NUMALink3 and R-Bricks


Origin cache coherency

Origin Cache Coherency

Data Block or Cache line 128 Bytes (32 words)

Data Block or Cache line 128 Bytes (32 words)

directory

page

presence

(64 bits)

presence

(64 bits)

state

8bits

state

8bits

  • Memory page is divided in data blocks of 32 words or 128 Bytes each (L2 cache line size)

  • Each data request transfers one data block (128 Bytes)

  • Each data block has associated presence and state information

  • If a node (HUB) requests a data block, the corresponding presence bit is set and the state of that cache line is recorded

  • HUB runs the Cache Coherency protocol, updating the state of the data block and notifying nodes for which the presence bit is set.

Unowned: no copies

Shared: read-only copies

Exclusive: one read-write

Busy: state in transition

Each L2 cache line contains 4 data blocks of 8 words

or 32 Bytes each (L1 data cache line size)


Cc numa architecture programming

CC-NUMA Architecture: Programming

Proc 1

Proc 2

k

i

i

Proc 3

=

X

j

j

k

  • All data is shared

  • Additional optimization to place data close to the processor that would do most of the computations on that data

  • Automatic (compiler) optimizations for single processor and parallel performance

  • The data access (data exchange) is implicit in the algorithm;

  • Except for the additional data placement directives, the source is the same as for the single processor programming (SMP principle)

C every processor holds a column of each matrix:

C$distribute A(*,block),B(*,block),C(*,block)

C$omp parallel do

DO i=1,n

DO j=1,n

DO k=1,n

C(i,j)=C(i,j) + A(i,k)*B(k,j)

ENDDO ENDDO ENDDO


Problems of cc numa architecture

Problems of CC-NUMA Architecture

  • SMP programming style + data placement techniques (directives)

SMP programming Cliff

remote memory latency jump ~3-5

requires correct data placement

Based on 1 GB/s SCI link;

latency/hop ~ 500 ns

64-128 processor O2000

ta(remote)/ta(local) ~3-5

->correct data placement


Dsm ccnuma memory

DSM-ccNUMA Memory

Distributed Shared Memory

Systems [ccNUMA)

Easy to Program

Easy to Scale

Hard to scale

Hard to program

Shared-memory

Systems (SMP)

Massively Parallel

Systems (MPP)

Easy to Program

Easy to Scale


Origin system architecture

SGI 3200 (2-8p)

Router-less configurations in deskside form factor

Short Rack

(17U config. space)

C-Brick

Network

Network

P

P

P

P

P, I, or, X-Brick

BR

BR

P

P

P

P

I-Brick

I-Brick

XIO+

XIO+

C-Brick

XIO+ Ports

XIO+ Ports

C-Brick

C-Brick

Power Bay

Power Bay

I-Brick

P,I, or X-Brick

Minimum (2p) System

Maximum (8p) System

System Topology


Origin system architecture

SGI 3400 (4-32p)

P

P

P

P

P

P

P

P

BR

BR

BR

BR

P

P

P

P

P

P

P

P

XIO+

XIO+

XIO+

XIO+

Full-size Rack

(39U config. space)

C-Brick

P, I, or, X-Brick

XIO+

XIO+

XIO+

XIO+

I-Brick

C-Brick

P, I, or, X-Brick

P

P

P

P

P

P

P

P

C-Brick

BR

BR

BR

BR

P, I, or, X-Brick

P

P

P

P

P

P

P

P

C-Brick

P, I, or, X-Brick

r-Brick

r-Brick

6-port router

r-Brick

6-port router

C-Brick

P, I, or, X-Brick

C-Brick

P, I, or, X-Brick

r-Brick

r-Brick

C-Brick

P, I, or, X-Brick

C-Brick

I-Brick

C-Brick

C-Brick

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

System Topology

Minimum (4p) System

Maximum (32p) System


Origin system architecture

SGI 3800 (16-128p)

Rack 1

Rack 2

Rack 3

Rack 4

1

2

3

4

C

C

C

C

C

C

C

C

C

C

C

C

R

R

R

R

C

C

C

C

R

R

R

R

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

I-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

Power Bay

Power Bay

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

Power Bay

Power Bay

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

R-Brick

R-Brick

R-Brick

R-Brick

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

R-Brick

R-Brick

R-Brick

R-Brick

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

I-Brick

P, I, or, X-Brick

C-Brick

C-Brick

C-Brick

C-Brick

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

128P System Topology

R-Brick

8-port router

Minimum (16p) System

Maximum (128p) System


Origin system architecture

SGI 3800 System: 128 processors

16 proc

16 proc

16 proc

16 proc

16 proc

16 proc

16 proc

16 proc


Origin system architecture

SGI 3800 (32-512p)

P, I, or, X-Brick

P, I, or, X-Brick

R-Brick

R-Brick

R-Brick

P, I, or, X-Brick

P, I, or, X-Brick

C-Brick

C-Brick

C-Brick

C-Brick

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

R-Brick

R-Brick

R-Brick

R-Brick

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

R-Brick

R-Brick

R-Brick

R-Brick

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

P, I, or, X-Brick

C-Brick

C-Brick

C-Brick

C-Brick

P, I, or, X-Brick

I-Brick

C-Brick

C-Brick

C-Brick

C-Brick

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

Power Bay

One Quadrant of a 512p System

512p Power Estimates:

MIPS = 77 KW

ItaniumTM= 150 KW

McKinley = 231 KW

No I/O or storage included in power estimates.

Premium memory required


Router to router connections for 256 processor systems

Router-to-Router Connections for 256 Processor Systems


512 processor systems

512 Processor Systems


Origin system architecture

R1xK Family of Processors

MIPS R1x000 is an out-of-order, dynamic-scheduling

superscalar processor with non-blocking caches

  • Supports the 64-bit MIPS IV ISA

  • 4-way superscalar

  • Five separate execution units

  • 2 floating point results / cycle

  • 4-way deep speculative execution of branches

  • Out-of-order execution (48 instruction window)

  • Register re-naming

  • Two-way set associative non-blocking caches

    • Up to 4 outstanding memory read requests

    • Prefetching of data

    • 1MB to 8MB secondary data cache

  • Four user-accessible event counters


Origin 3000 mips processor roadmap

Origin 3000 MIPS Processor Roadmap

1999

2000

2001

2002

2003

O3K-MIPS

R18000

xxx MHz, xxx GFlops

R16000

xxx MHz, xxx GFlops

Origin 2000

R14000(A)

500+ MHz, 1000+ MFlops

8 MB DDR SRAM@ 250+ MHz

R12000A

400 MHz, 800 MFlops

8 MB @ 266 MHz

R12000

300 MHz, 600 MFlops

8 MB @ 200 MHz

R10000

250 MHz, 500 MFlops

4 MB @ 250 MHz


R14000 cache interfaces

R14000 Cache Interfaces


Origin system architecture

Memory Hierarchy

Cache subsystem

memory

disk

~2-3

cy

1

~10 cy

1400

0.1

1169

Origin3000 Latency

64reg

1200

1067

Origin2000 Latency

Speed of Access 1/clock

1000

836

0.01

759

759

32KB

(L1)

800

Remote Latency (ns)

~100 - 300 cy

(NUMA)

554

600

8MB

(L2)

585

485

343

400

435

335

335

285

~4000 cy

200

235

175

175

~1 - 100s GB

0

2p

4p

8p

16p

32p

64p

128p

256p

512p

Device Capacity (size)


Effects of memory hierarchy

Effects of Memory Hierarchy

1MB cache

32 KB L1 cache

4 MB L1 cache

L2 cache:

2MB cache

4MB cache


Instruction latencies r12k

Instruction Latencies (R12K)

  • Integer units latency Repeat rate

    • ALU 1

      • add, sub, logic ops, shift, br11

    • ALU 2

      • add, sub, logic ops11

      • signed multiply (32/64 bit) 6/10 6/10

      • (unsigned multiply: +1 cycle)

      • divide (32/64 bit) 35/67 35/67

    • Address Unit

      • load integer 21

      • load floating point 31

      • store -1

      • Atomic LL,ADD,SC sequence 66

  • Floating point units

    • FPU 1

      • add, sub, compare, convert21

    • FPU 2

      • multiply 21

      • multiply-add (madd)41

    • FPU 3

      • divide, reciprocal (32/64 bit)12/19 14/21

      • sqrt (32/64 bit)18/33 20/35

      • rsqrt (32/64 bit)30/52 34/56

Repeat rate of 1 means that after

pipelining processor can complete

1 operation per cycle.

Thus the peak rates:

Int operations: 2 int operations/cycle

FP operations: 2 fp operations/cycle

For the [email protected]:

4*500 MHz = 2000 MIPS

2*500 MHz = 1000 Mflop/s

Compiler has this table build in.

The goal of compiler scheduling

is finding instructions that can be

executed in parallel to fill all slots:

ILP - Instruction Level Parallelism


Instruction latencies daxpy example

Instruction Latencies: DAXPY Example

Loop parallelism:

2 loads, 1 store

1 multiply-add (madd)

2 address increments

1 loop-end test

1 branch

per single loop iteration

Processor parallelism:

1 load or store

1 ALU1 instruction

1 ALU2 instruction

1 FP add

1 FP multiply

per processor cycle

  • There are 2 loads (x,y) and 1 store (y)= 3 mem ops.

  • There are 2 fp operations (+,*) which can be done with 1 madd

  • 3 mem ops require 3 cycles minimum (processor can do 1 mem op/cycle)

  • theoretically in 3 cycles processor can do 6 fp operations

  • only 2 fp operations are available in the code

  • max processor speed is 2fp/6fp=1/3 peak on this code;

  • I.e. for the [email protected] processor 600/3=200 Mflop/s.

  • DO I=1,n

    Y(I) = Y(I) + A*X(I)

    ENDDO


    Daxpy example schedules

    DAXPY Example: Schedules

    cycleinstructions

    0ld xx++

    1ld y

    2

    3madd

    4

    5

    6

    7st ybry++

    x load delay 3 cycles

    cycleinstructions

    0ld x0

    1ld x1

    2ld y0x+=4

    3ld y1madd0

    4madd1

    5

    6

    7st y0

    8st y1y+=4br

    madd delay 4 cycles

    x load delay 3 cycles

    madd delay 4 cycles

    • Simple schedule:unrolled by 2:

    • 2fp/(8cycles*2fp/cy)=1/8 peak 4fp/(9cycles*2fp/cy)=2/9 peak

    • [email protected] ~ 75 Mflop/s ~133 Mflop/s

    DO I=1,n-1,2

    Y(I+0) = Y(I+0) + A*X(I+0)

    Y(I+1) = Y(I+1) + A*X(I+1)

    ENDDO

    DO I=1,n

    Y(I) = Y(I) + A*X(I)

    ENDDO


    Daxpy example software pipelining

    DAXPY Example: Software Pipelining

    #<swp> replication 0#cy

    ld x0 ldc1 $f0,0($1)#[0]

    ld x1 ldc1 $f1,-8($1)#[1]

    st y2 sdc1 $f3,-8($3)#[2]

    st y3 sdc1 $f5,0($3)#[3]

    y+=2 addiu $3,$2,16#[3]

    madd.d $f5,$f2,$f0,$f4#[4]

    ld y0ldc1 $f0,-8($2)#[4]

    madd.d $f3,$f0,$f1,$f4#[5]

    x+=2addiu $1,$1,16#[5]

    beq $2,$4,.BB21.daxpy#[5]

    ld y3ldc1 $f2,0($3)#[5]

    #<swp> replication 1#cy

    ld x3ldc1 $f1,0($1)#[0]

    ld x2ldc1 $f0,-8($1)#[1]

    st y1sdc1 $f3,-8($2)#[2]

    st y0sdc1 $f5,0($2)#[3]

    y+=2addiu $2,$3,16#[3]

    madd.d $f5,$f2,$f1,$f4#[4]

    ld y3ldc1 $f1,-8($3)#[4]

    madd.d $f3,$f1,$f0,$f4#[5]

    x+=2addiu $1,$1,16#[5]

    ld y0ldc1 $f2,0($2)#[5]

    • Software pipeliningis the way to fill all processor slots by mixing iterations

    • replications gives how many iterations are mixed

    • number of replications depends on the distance (in cycles) between the load and the calculation

    • DAXPY 6 cy schedule with 4 fp ops: 4fp/(6cy*2fp/cy)=1/3 peak


    Daxpy swp compiler messages

    DAXPY SWP: Compiler Messages

    • F77 -mips4 -O3 -LNO:prefetch=0 -S daxpy.f

      • With the -S switch the compiler will produce file daxpy.s with assembler instructions and comments about software pipelining schedules

    • #<swps>Pipelined loop line 6 steady state

    • #<swps>50 estimated iterations before pipelining

    • #<swps> 2 unrolling before pipelining

    • #<swps> 6 cycles per 2 iterations

    • #<swps> 4 flops( 33% of peak)(madds count 2fp)

    • #<swps> 2 flops( 16% of peak)(madds count 1fp)

    • #<swps> 2 madds( 33% of peak)

    • #<swps> 6 mem refs(100% of peak)

    • #<swps> 3 integer ops( 25% of peak)

    • #<swps>11 instructions( 45% of peak)

    • #<swps> 2 short trip threshold

    • #<swps> 7 ireg registers used.

    • #<swps> 6 fgr registers used.

      • The schedule is the max 1/3 peak processor performance, as expected

      • note: it is necessary to switch off prefetch to attain max schedule


    Multiple outstanding mem refs

    Multiple Outstanding Mem Refs

    Wait for data

    Wait for data

    Wait for data

    Execution

    independent

    instructions

    Execution

    independent

    instructions

    Execution

    Execution

    time

    • Processor can support 4 outstanding memory requests

  • Timing linked list references:

  • while(x) x=x->p;

  • #outstanding reftime per pointer fetch:

  • 1230 ns (480 ns)

  • 2160 ns(250ns)

  • 4110 ns(240ns)

  • “Sequential” cache miss

    “Parallel” cache miss


    Origin 3000 memory latency

    Origin 3000 Memory Latency

    ORIGIN

    O3K

    Local

    NI to NI

    Per Router

    320 ns

    165 ns

    105 ns

    180 ns

    50 ns

    45 ns

    485 ns + #hops*105 ns

    230 ns + #hops*45 ns

    32 CPU O3K Max Latency:315 ns


    Origin system architecture

    Remote Memory Latency

    SGI™ 3000 Family vs. SGI™ 2000 Series

    Origin 3000 Series


    R1x000 event counters

    R1x000 Event Counters

    • The R1x000 processor family allows extensive performance monitoring with counters that can be triggered by 32 events:

      • R10000 has 2 event counters

      • R12000 has 4 event counters

    • The counters are incremented when an event happens in the processor (e.g. cache miss) and the event is selected by the user.

    • The first counter can be triggered by the events 0-15,

    • the second counter is incremented in response to events 15-31.

    • R12000 has 2 additional counters that allow to monitor conditional events (i.e. events based on previous events).

    • User access to the counters is through a software library or shell level tools provided by the IRIX OS.


    Origin address space

    Origin Address Space

    1 TB max

    (40 bits)

    Physical

    Address

    Empty slots

    memory present

    32

    31

    39

    0

    Node id 8 bitsNode offset 32 bits (4 GB)

    Max for a single node:

    4 GB memory

    0

    1

    2

    3

    4

    ...

    Node id

    TLB

    12 GB

    Page k

    Page 0

    8 GB

    Page 1

    Page 1

    4 GB

    0

    Page n

    Page 2

    Virtual

    Page 0

    Page n

    Physical

    • Physically the memory is distributed and is not contiguous.

    • Node id is assigned at boot time

    • Logically memory is a shared single contiguous address space, the virtual address space is 44 bits (16 TB)

    • The program (compiler) uses the virtual address space.

    • Translation from the virtual to thephysical address space is by the CPU.

    Page size is configurable as 16 KB (default),

    64 KB, 256 KB, 1 MB, 4 MB, 16 MB

    TLB = Translation Look-aside Buffer


    Process scheduling

    Process Scheduling

    system

    Real time

    255

    Time

    share

    128

    40

    1

    0

    • Irix is a Symmetric Multiprocessing Operating System

      • Processes and Processors are independent

      • Parallel programs are executed as jobs with multiple processes

      • The Scheduler will allocate processes to processors

    Priority range from 0 to 255

    0 weightless (batch)

    1-40 time share (interactive) (TS)

    90-239 system (daemons and interrupts)

    1-255 real time processes (FIFO & RR)


    Process scheduling1

    Process Scheduling


    Process scheduling2

    Process Scheduling


    Process scheduling3

    Process Scheduling


    System monitoring commands

    System Monitoring Commands

    • uptime(1)returns information about system usage and user load

    • w(1)who is on the system and what are they doing?

    • sysmonsystem log viewer

    • ps(1)a "snapshot" of the process table

    • top, gr_topprocess table dynamic display

    • osviewsystem usage statistics

    • sarsystem activity reporter

    • gr_osviewsystem usage statistics in graphical form

    • gmemusagegraphical memory usage monitor

    • sysconfsystem limits, options, and parameters


    System monitoring commands1

    System Monitoring Commands

    • ecstats -CR10K Counter Monitor

    • jajob accounting statistics

    • oviewPerformance Co-Pilot (bundled with IRIX)

    • pmchartPerformance Co-Pilot (licensed software)

    • nstats,linkstatCrayLink connection statistics (man refcnt(5) )

    • bufviewsystem buffer statistics

    • parprocess activity report

    • numa_view, dlookprovides process memory placement information

    • limit [-h]displays system soft [hard] limits


    System monitoring commands2

    System Monitoring Commands

    • hinvhardware inventory

    • topologysystem interconnect description


    Summary origin properties

    Summary: Origin Properties

    • Single machine image

      • it behaves like a fat workstation

        • same compilers

        • time sharing

      • all your old code will run

      • OS schedules all the hardware resources on the machine

    • Processor scalability 2-512 cpu

    • I/O scalability 2-300 GB/s

    • All memory and I/O devices are directly addressable

      • no limitation on the size of a single program, it can use all the available memory

      • no limitation on the location of the data, all disks can be used in a single file system

    • 64 bit operating system and file system

      • HPC Features: Checkpoint/Restart, DMF, NQE/LSF, TMF, Miser, job limits, cpusets, enhanced accounting

    • Machine stability


  • Login