High Performance Computing: Concepts, Methods, & Means Enabling Technologies

High Performance Computing: Concepts, Methods, & MeansEnabling Technologies Prof. Thomas Sterling Department of Computer Science Louisiana State University March 13th, 2007

Topics Introduction Taxonomy of Technologies State of the Art Technology Trends Summary – Material for Test

Topics Introduction Taxonomy of Technologies State of the Art Technology Trends Summary – Material for Test 3

Space of Consideration

Why do WE care? Speed Density Balance Power, size, cost Architecture Operations Configuration, total system size

1 103 106 109 1012 1015 One OPS KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS A Growth-Factor of a Billion in Performance in a Single Lifetime 1959 IBM 7094 1976 Cray 1 1991 Intel Delta 1996 T3E 2003 Cray X1 1949 Edsac 1823 Babbage Difference Engine 2001 Earth Simulator 1951 Univac 1 1964 CDC 6600 1982 Cray XMP 1988 Cray YMP 1997 ASCI Red 1943 Harvard Mark 1

Current Technologies & Metrics • Memory – DRAMs • Access Times • Bandwidth • Capacity, Size • Microprocessors • Clock rate • Instructions per Cycles (ILP) • Power • I/O Channels • Bandwidth • Latency • Disks • Access Times • Bandwidth • Capacity

SMP Node Diagram Legend : MPU : MicroProcessor Unit L1,L2,L3 : Caches M1.. : Memory Banks S : Storage NIC : Network Interface Card MPU MPU MPU MPU L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 M1 M2 Mn-1 S PCI-e Controller JTAG Ethernet S Peripherals USB NIC NIC

Memory - Overview DDR2 PC2-6400 • Temporary storage location used to store instructions and data. • Instructions, actual operations executed by the processor. • Data used and produced by peripherals such as harddisk or network controllers and intermediate results from program execution etc. • Both Instructions and data required by processor to compute meaningful results. • Processor is constantly issuing commands to load and store data from memory across memory bus. • Due to the constant memory accesses by the processor and the large gap between processor clock rate and memory bus is one of the largest impediments to achieving theoretical peak performance.

Another View of the Memory Hierarchy Regs Upper Level Instr. Operands Faster Cache Blocks L2 Cache Blocks Memory Pages Disk Files Larger Tape Lower Level

Memory - Overview • Memory bus performance is characterized by : • Memory Bandwidth : The burst rate at which data can be copied between the DRAM memory chips and the CPU (total number of accesses per unit time) eg: current rates range up to 6.4 GB/s for DDR2 PC2-6400 • Memory Latency : The amount of time it takes to move data between RAM and the CPU eg : current latencies range up to 80.5 ns for DDR2 PC2-6400 • Many applications depend on availability of entire datasets in RAM. • Alternatively disk storage could be used; however this usually entails performance penalties due to higher access and retrieval times. • Thus Memory becomes a crucial factor in system design and determines the size of the problem that can be run on the system. • Usual rule-of-thumb 1 byte of RAM for every floating point operation. (actual requirements vary on case by case basis).

Magnetic Core Memory 13

2nd Generation: Transistors Replaced vacuum tubes Smaller & Cheaper Less heat dissipation Solid State device (silicon) Invented 1947 at Bell Labs The First Transistor

Integrated Circuit Costs Wafer cost Cost of die = ------------------------------------- Dies per wafer * Die yield where die yield is the percentage of good dies in the wafer.

1-Transistor Memory Cell (DRAM) Write: 1. Drive bit line 2.. Select row Read: 1. Precharge bit line to Vdd 2.. Select row 3. Cell and bit line share charges Very small voltage changes on the bit line 4. Sense (fancy sense amp) Can detect changes of ~1 million electrons 5. Write: restore the value Refresh 1. Just do a dummy read to every cell. row select bit

Classical DRAM Organization (square) Row and Column Address together: Select 1 bit a time bit (data) lines r o w d e c o d e r Each intersection represents a 1-T DRAM Cell RAM Cell Array word (row) select Column Selector & I/O Circuits row address Column Address data

DRAM Read Timing Every DRAM access begins at: The assertion of the RAS_L 2 ways to read: early or late v. CAS RAS_L CAS_L WE_L OE_L A 256K x 8 DRAM D 9 8 RAS_L DRAM Read Cycle Time CAS_L A Row Address Col Address Junk Row Address Col Address Junk WE_L OE_L D High Z Junk Data Out High Z Data Out Read Access Time Output Enable Delay Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L

Static RAM Cell Write: 1. Drive bit lines (bit=1, bit=0) 2.. Select row Read: 1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal! 2.. Select row 3. Cell pulls one line low 4. Sense amp on column detects difference between bit and bit 6-Transistor SRAM Cell word word (row select) 0 1 0 1 bit bit bit bit

Typical SRAM Organization: 16-word x 4-bit Wr Driver & Precharger Wr Driver & Precharger Wr Driver & Precharger Wr Driver & Precharger - + - + - + - + - + - + - + - + Sense Amp Sense Amp Sense Amp Sense Amp Din 3 Din 2 Din 1 Din 0 WrEn Precharge A0 Word 0 SRAM Cell SRAM Cell SRAM Cell SRAM Cell A1 Address Decoder A2 Word 1 SRAM Cell SRAM Cell SRAM Cell SRAM Cell A3 : : : : Word 15 SRAM Cell SRAM Cell SRAM Cell SRAM Cell Q: Which is longer: word line or bit line? Dout 3 Dout 2 Dout 1 Dout 0

Microprocessor - Overview Opteron 246 The single component that implements instruction execution Lowest level binary encoding of instructions and the actions they perform are dictated by the microprocessor instruction set architecture (ISA). Most common ISA used for a cluster node is the IA32 or x86_64 family. This includes all generations of Pentium and Athlon processor family. A processor runs at a particular clock rate ie it can execute instructions at a particular frequency usually measured in megahertz or gigahertz. Note :A processor’s clock rate is not a direct measure of its performance. Two processors with the same clock rate can perform differently for some tasks.

Computer Generations

IBM 360 series 1964 Replaced (& not compatible with) 7000 series First planned “family” of computers Similar or identical instruction sets Similar or identical O/S Increasing speed Increasing number of I/O ports (i.e. more terminals) Increased memory size Increased cost Multiplexed switch structure

DEC PDP-8 1964 First minicomputer (after miniskirt!) Did not need air conditioned room Small enough to sit on a lab bench $16,000 $100k+ for IBM 360 Embedded applications & OEM BUS STRUCTURE

DEC - PDP-8 Bus Structure Console Controller Main Memory I/O Module I/O Module CPU OMNIBUS

Microprocessor - Overview • Every processor has a theoretical peak speed ie, the maximum rate of instruction execution a processor can achieve. • Theoretical peak performance (TPP) of a processor is determined by clock rate, ISA and components included in the processor. • TPP is measured in floating point operations per second or flops. The current fastest supercomputer BlueGene/L has two commercial IBM PowerPC 440 microprocessors on each compute node with a TPP of 2.8 (each)/5.6 (combined) GF/s • Both instruction and data that are utilized by the processor are stored in the Memory. • Memory usually runs at a much slower clock rate than the processor, hence the processor often waits for memory. • Hence the overall rate at which programs run is usually a combination of the three factors namely: • the memory system performance and • the processor’s clock speed. • the number of operations issued per instruction

Microprocessors - Overview fast slow small large Delays introduced by constant memory accesses by the processor can be mitigated using the cache. The cache is a small amount of fast memory usually co-located with the CPU. When data is accessed from memory, it is stored in cache. Future repeated accesses of the same data can be expedited by utilizing preexisting cache copies of the data. Applications optimized to utilize these patterns can improve processor utilization as the processor spends less time waiting for data and more time processing information.

Intel Microprocessor Performance

DRAM and Processor Characteristics

Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) 10 DRAM 9%/yr. (2X/10 yrs) Performance DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

I/O Channels PCI X based Intel PRO/1000 Gigabit Ethernet adapter M3F-PCIXD-2 Myrinet-Fiber/PCI-X Interface PCI-X 133 MHz Card Two 4X Infiniband Ports (10 Gb/sec each)256 MB memory PCI-X slots on a Motherbord • I/O channels are buses that connect peripherals with main memory • Peripherals include : • disk and network controllers • USB and firewire etc.. • Each of these devices are connected to the main memory via a bridge (usually referred to as the PCI chipset). • Since I/O tasks are most common on computers, this subsystem is an integral part of any system. • Most common I/O channel in community hardware is the PCI buses. Several flavors of PCI exist, PCI, PCI-X, PCIe

I/O Channels – Motherboard • The motherboard provides the logical and physical infrastructure for integrating the subsystems of a cluster node and determines the set of components that may be used. • Sockets and connectors on the motherboard include the following : • Microprocessor(s), Memory • Peripheral Controllers (PCI-X), AGP port (graphics) • Power,External I/O for USB, Keyboard, mouse etc. • Other chips on the motherboard provide : • The system bus that links processor(s) to memory • The interface between the peripheral buses and the system bus • Programmable read-only memory (PROM) containing the BIOS software.

I/O Channels – Chipsets & BIOS • Chipsets are combination of all logic on the motherboard, these include the memory bus, PCI, PCI-X and AGP bridges, disk controllers, USB controllers etc. • Chipsets can be split into two logical portions : • North bridge: connects the front side bus that connects the processor, the memory bus and AGP. AGP is located on the Northbridge so as to have special access to main memory. • South bridge: contains I/O bus bridges and any integrated peripherals that may be included like disk and USB controllers. • BIOS is the software that initializes all system hardware into a state that OS can boot. • PXE (Pre execution environment) is a system by which nodes can boot based on a network-provided configuration and boot image. Many new machines support this feature and cluster management systems utilize this feature for installations. • LinuxBIOS : BIOS based on Linux kernel that can perform all important tasks needed for OS to boot. Since source code for BIOS is available firmware upgrades can be more easily carried out. These BIOSs also have faster boot times than conventional BIOSs

Storage – Local Hard Disks A hard drive contains several platters, data is read off these platters as they rotate. Logic in the drive optimizes the read & write requests based on the geometry of the disks to provide better collective performance. The Logic also contains memory cache which helps prevents the need for multiple reads for the same data. Hard disks are magnetic storage media that interface with some sort of storage bus. Three most commonly used storage buses are IDE (EIDE or ATA), SCSI, Serial ATA. Controllers to manage these busses are integrated into most motherboards and can support up to 4 devices. UDMA133 is one such bus that runs at the rate of 133 MB/s.

Storage - Locality • Often and application reads consecutive sectors • Most hard drives do read ahead • The disk has a buffer that stores sectors after the one just read • It can be as large as 4MB • It’s just a cache of sectors • The “smarts” in there are not well-known due to proprietary technology • Can also store sectors that need to be written to disk • Transfers to/from the buffer are at the speed of the I/O bus, not the magnetic device • Can be > 300MB/sec • More on the I/O bus later

Memory Speeds and Trends Source : www.crucial.com

Memory

DRAM Implementations: DDR • DDR (Double Data Rate) memory • 2x64 bits transferred in a single bus cycle (at both clock edges) • DDR-400 operates at 200 MHz clock • The corresponding memory module is PC3200, delivering a peak bandwidth of 3.2 GB/s • Cycle time 5ns, CAS latency 3 • Module capacity: up to 4 GB • Features 2-bit wide prefetch buffers • DDR2 memory • Operates at twice the bus speed of DDR • DDR2-800 achieves 800 million transfers per second using 400 MHz bus clock • The corresponding module PC2-6400 has a peak bandwidth of 6.4 GB/s • Cycle time 2.5ns, CAS latency 5 • Module capacity: up to 4 GB • Features 4 bit wide prefetch buffers • DDR3 (successor to DDR2) is currently sampling • Expected to achieve up to 1600 million transfers per second (12.8 GB/s per module) with 800 MHz clock • Features 8 bit wide prefetch buffers

Other DRAM Implementations • XDR (eXtreme Data Rate) memory • Based on Rambus DRAM technology • Eight bits per clock per lane (“Octal Data Rate”) • One chip provides either 8 or 16 lanes • At typical 400 MHz clock, the peak bandwidth is 6.4 GB/s per chip • Planned clock speeds up to 1 GHz, currently the fastest parts run at 500 MHz • Current capacity: 512 Mbit per chip • GDDR4 (Graphics Double Data Rate version 4) memory • 2.8 Gbit/s data rate at 1.4 GHz clock per pin • 11.2 GB/s per chip with 32-bit data bus • CAS latency of 18 clock cycles • Current capacity: 512 Mbit per chip • 8 bit prefetch buffer width

Modern Processor Parameters

I/O Channel www.dell.com/content/topics/global.aspx/vectors/en/2004_pciexpress?c=us&l=en&s=corp • PCI Express (3GIO): 16Gb/sec • HyperTransport (LDT): 41.6GB/sec @ 2.6GHz • PCI Bus, PCI-X Bus: 1GB/sec • AGP (Accelerated Graphics Port): 2.134GB/sec • Others: • PCMCIA (Personal Computer Memory Card International Association) • ISA Bus (Industry Standard Architecture) • USB (Universal Serial Bus) • RapidIO

I/O Channels • PCI Express 1.1: • 250 MB/s per lane • Card slots may include up to 32 lanes for peak rate of 8 GB/s • PCI-X 2.0: • 64 bit wide at 533 MHz • 4.3 GB/s throughput • AGP 8x (Advanced Graphics Port): • 32-bit channel operating at 66 MHz (strobing 8 times per clock) • Peak bandwidth of 2133 MB/s • HyperTransport 3.0: • Up to 32 bits at 2.6 GHz, transmitted at both clock edges • Peak bandwidth 20,800 MB/s

Chipset I/O Capabilities • PCI Express: • 56 lanes, 250 MB/s each • 12 links • 5 slots • SATA: 12 channels, 3 Gbps each • HyperTransport: • 8 GB/s throughput to the CPU • Supports up to 8 processors • Gigabit Ethernet: 4 MAC units • USB 2.0 ports: 10 at 480 Mbps each • Support of RAID 0, 1, 0+1 and 5 • High Definition Audio (HDA) • 8 channels • 192 kHz/32-bit quality

Hypertransport (AMD)LDT: Lightning Data Transport Aggregate Bandwidth : 41.6 GB/s (HyperTransport 3.0) • Point-to-Point bus with [at least] two unidirectional links • Uses 2, 4, 8, 16 or 32 bits [in each direction]. • Data rate is 800MBs/per 8 bit pair(s) with a 400MHz clock. • BW in both directions is 1.6GBps for 8 bit [bi-directional] pairs. • 16 bi-directional pairs brings the data rate up to 3.2GBps per direction. • HT has an I/O Link protocol specifiere: packet-based. AMD MotherBoards uses a bridge to communicate to PCI-X [high-end PCs] / PCI [Desktops] buses In HyperTransport their is an identical uni-directional link coming back from the far end. one uni-directional link http://www.interfacebus.com/Design_Connector_HyperTransport.html

Permanent Storage: Hard Disks • Storage capacity: 1 TB per drive • Areal density: 132 Gbit/in2 (perpendicular recording) • Rotational speed: 15,000 RPM • Average latency: 2 ms • Seek time • Track-to-track: 0.2 ms • Average: 3.5 ms • Full stroke: 6.7 ms • Sustained transfer rate: up to 125 MB/s • Non-recoverable error rate: 1 in 1017 • Interface bandwidth: • Fibre channel: 400 MB/s • Serially Attached SCSI (SAS): 300 MB/s • Ultra320 SCSI: 320 MB/s • Serial ATA (SATA): 300 MB/s

Storage – SATA & Overview PATA vs SATA Serial ATA is the newest commodity hard disk standard. SATA uses serial buses as opposed to parallel buses used by ATA and SCSI. The cables attached to SATA drives are smaller and run faster (around 150 MB/s). The Basic disk technologies remain the same across the three busses The platters in disk spin at variety of speeds, faster the platters spin the faster the data can be read off the disk and data on the far end of the platter will become available sooner. Rotational speeds range between 5400 RPM to 15000 RPM Faster the platters rotate, the lower the latency and higher the bandwidth.

Storage - RAID RAID stands for Redundant Array of Inexpensive Disks provides a mechanism by which the performance and storage properties of individual disks can be aggregated Group of disks appear to be a single large disks; performance of multiple disks is better than single disks. Using multiple disks helps store data in multiple places allowing the system to continue functioning. Both software and hardware raid solutions available. Hardware solutions are more expensive, but provide better performance without CPU overhead. Software solutions provide various levels of flexibility but have associated computational overhead.

Storage - Raid Allocation • Variety of RAID allocation schemes : • RAID 0 : • Data is striped across multiple disks. • The result of striping is a logical storage device that has the capacity of each disk times the number of disks present in the raid array. • Both read and write performances are accelerated. • Each byte of data can be read from multiple locations, so interleaving reads between disks can help double read performance. • RAID 1 : • Complete copies of data are stored on multiple locations. • Capacity of one of these RAID sets will be half of its raw capacity. Read performance is accelerated and is comparable to Raid 0. • Writes are slowed down, as new data needs to be transmitted multiple times. • RAID 5: • Like Raid 0 data is striped across multiple disks, with one disk being dedicated to parity. • For any block of data stored across the N-1 drives, their parity checksum is computed and is stored on the last disk. • Read performance of RAID 5 tends to be good, but the write performance lags behind mirrors because of checksum computation. http://www.drivesolutions.com/datarecovery/raid.shtml

High Performance Computing: Concepts, Methods, &amp; Means Enabling Technologies

High Performance Computing: Concepts, Methods, &amp; Means Enabling Technologies

Presentation Transcript

High Performance Computing: Concepts, Methods, & Means Enabling Technologies

High Performance Computing: Concepts, Methods, & Means Enabling Technologies