230 likes | 358 Views
OGO 2.1 SGI Origin 2000 . Robert van Liere CWI , Amsterdam TU/e, Eindhoven. 11 September 2001. unite.sara.nl. SGI Origin 2000 Located at SARA in Amsterdam Hardware configuration : 128 MIPS R10000 CPUs @ 250 Mhz 64 Gbyte main memory 1 Tbyte disk storage 11 ethernet @ 100 Mbits
 
                
                E N D
OGO 2.1SGI Origin 2000 Robert van Liere CWI, Amsterdam TU/e, Eindhoven 11 September 2001
unite.sara.nl • SGI Origin 2000 • Located at SARA in Amsterdam • Hardware configuration : • 128 MIPS R10000 CPUs @ 250 Mhz • 64 Gbyte main memory • 1 Tbyte disk storage • 11 ethernet @ 100 Mbits • 1 ethernet @ 1 Gbit
Contents • Architecture • Overview • Module interconnect • Memory hierarchies • Programming • Parallel models • Data placement • Pros and cons
Overview - Features • 64 bit RISC microprocessors • Large main memory • “Scalable” in CPU, memory and I/O • Shared memory programming model
Overview - Applications • Worldwide : +/- 30.000 systems • ~ 50 with >128 CPUs • ~ 100 with 64-128 CPUs • ~ 500 with 32-64 CPUs • Computing serving : many CPUs and memory • Database serving : many disks • Web serving : many I/O
System architecture – 1 CPU • CPU + cache • One system bus • Memory • I/O (network + disk) • Cached data
System architecture – N CPU • Symmetric multi-processing (SMP) • Multi-CPU + caches • One shared bus • Memory • I/O
N CPU – cache coherency • Problem: • Inconsistent cached data • Solution: • Snooping • Broadcasting • Not scalable
Architecture – Origin 2000 • Node board • 2 CPU + cache • Memory • Directory • HUB • I/O
Origin 2000 Interconnect • Node boards • Routers • Six ports
Virtual Memory • One CPU, multi programs • Page • Paging disk • Page replacement
O2000 Virtual Memory • Multi CPU, Multi progs • Non-Uniform Memory Access • Efficient programs: • Minimize data movement • Data “close” to CPU
Application performance • Scientific computing • LU, ocean, barnes, radiosity • Linear speedup • More CPUs -> performance
Programming support • IRIX operating system • Parallel programming • C source level with compiler pragmas • Posix Threads • UNIX processes • Data placement • dplace , dlock, dperf • Profiling • timex, ssrun
Parallel Programs • Functional Decomposition • Decompose the problem into different tasks • Domain Decomposition • Partition the problem’s data structure • Consider • Mapping tasks/parts onto CPUs • Coordinate work and communication of CPUs
Task Decomposition • Decompose problem • Determine dependencies
Task Decomposition • Map tasks on threads • Compare: • Sequential case • Parallel case
Efficient programs • Use many CPUs • Measure speedups • Avoid: • Excessive data dependencies • Excessive cache misses • Excessive inter-node communication
Multi-processor (128 ) Large memory (64 Gbyte) Shared memory programming Slow integer CPU Performance penalty: Data dependencies Off board memory Pros vs Cons