Supercomputing in Plain English Overview: What the Heck is Supercomputing?

Supercomputingin Plain EnglishOverview:What the Heck is Supercomputing?

Outline • What is a supercomputer/supercomputing? • Van Neumann architecture vs. Flynn’s taxonomy • What is memory hierarchy all about? • What is parallel programming/computing? • Serial vs. parallel • Real world examples • Why parallel computing

What is Supercomputing? Supercomputing is the biggest, fastest computingright this minute. Likewise, a supercomputeris one of the biggest, fastest computers right this minute. So, the definition of supercomputing is constantly changing. Rule of Thumb: A supercomputer is typically at least 100 times as powerful as a PC. Jargon: Supercomputing is also known as High Performance Computing(HPC) orHigh End Computing (HEC) or Cyberinfrastructure(CI).

What is Supercomputing About? Size Speed Laptop

What is Supercomputing About? Size: Many problems that are interesting to scientists and engineers can’t fit on a PC – usually because they need more than a few GB of RAM, or more than a few 100 GB of disk. Speed: Many problems that are interesting to scientists and engineers would take a very very long time to run on a PC: months or even years. But a problem that would take a month on a PC might take only a few hours on a supercomputer.

Moore, OK Tornadic Storm What Is HPC Used For? [1] May 3 1999[2] [3] • Simulation of physical phenomena, such as • Weather forecasting • Galaxy formation • Oil reservoir management • Data mining: finding needles of information in a haystack of data, such as • Gene sequencing • Signal processing • Detecting storms that might produce tornados • Visualization: turning a vast sea of data into pictures that a scientist can understand

Parallelism

Parallelism Parallelism means doing multiple things at the same time: you can get more work done in the same time. Less fish … More fish!

What Is Parallelism? Parallelism is the use of multiple processing units – either processors or parts of an individual processor – to solve a problem, and in particular the use of multiple processing units operating concurrently on different parts of a problem. The different parts could be different tasks, or the same task on different pieces of the problem’s data.

Common Kinds of Parallelism • Instruction Level Parallelism • Shared Memory Multithreading (for example, OpenMP) • Distributed Multiprocessing (for example, MPI) • GPU Parallelism (for example, CUDA) • Hybrid Parallelism • Distributed + Shared (for example, MPI + OpenMP) • Shared + GPU (for example, OpenMP + CUDA) • Distributed + GPU (for example, MPI + CUDA)

Why Parallelism Is Good The Trees: We like parallelism because, as the number of processing units working on a problem grows, we can solve the same problem in less time. The Forest: We like parallelism because, as the number of processing units working on a problem grows, we can solve bigger problems.

Parallelism Jargon Threads are execution sequences that share a single memory area (“address space”) Processes are execution sequences with their own independent, private memory areas … and thus: Multithreading: parallelism via multiple threads Multiprocessing: parallelism via multiple processes Generally: Shared Memory Parallelism is concerned with threads, and Distributed Parallelism is concerned with processes.

Jargon Alert! In principle: “shared memory parallelism”  “multithreading” “distributed parallelism”  “multiprocessing” In practice, sadly, the following terms are often used interchangeably: Parallelism Concurrency (not as popular these days) Multithreading Multiprocessing Typically, you have to figure out what is meant based on the context.

Supercomputing Issues The tyranny of the storage hierarchy Parallelism: doing multiple things at the same time

What is a Cluster? “… [W]hat a ship is … It's not just a keel and hull and a deck and sails. That's what a ship needs. But what a ship is ... is freedom.” – Captain Jack Sparrow “Pirates of the Caribbean”

What a Cluster is …. A cluster needs of a collection of small computers, called nodes, hooked together by an interconnection network (or interconnect for short). It also needs software that allows the nodes to communicate over the interconnect. But what a cluster is … is all of these components working together as if they’re one big computer ... a super computer.

An Actual Cluster Interconnect Nodes

A Quick Primeron Hardware

Typical Computer Hardware Central Processing Unit Primary storage Secondary storage Input devices Output devices

Central Processing Unit Also called CPU or processor: the “brain” Components Control Unit: figures out what to do next – for example, whether to load data from memory, or to add two values together, or to store data into memory, or to decide which of two possible actions to perform (branching) Arithmetic/Logic Unit: performs calculations – for example, adding, multiplying, checking whether two values are equal Registers: where data reside that are being used right now

Memory HierarchyMobeen LudinHenry Neeman

Types of Memory • Volatile/Power-On Memory • Random-Access Memory (RAM) • Static RAM (SRAM) • Dynamic RAM (DRAM) • Nonvolatile/Power-Off Memory • Variety of Nonvolatile Memory exits • Read only Memory (ROM) • Programmable ROM (PROM) • Erasable Programmable ROM(EPROM) • Electrically EPROM • Cell phones, SSD, Flash Memory, PS3, XBOX, etc…

Random-Access Memory (RAM) • Key features • RAM is packaged as a chip • Basic storage unit is a cell (one bit per cell) • Multiple RAM chips form a memory • Static RAM (SRAM) • Each cell stores bit with a six-transistor circuit • Retains value indefinitely, as long as it is kept powered • Relatively insensitive to disturbances such as electrical noise • Faster and more expensive than DRAM • Dynamic RAM (DRAM) • Each cell stores bit with a capacitor and transistor • Value must be refreshed every 10-100 ms • Sensitive to disturbances • Slower and cheaper than SRAM

Fundamentals of Hardware & Software • Even a sophisticated processor may perform well bellow the ordinary processor: • Unless supported by matching performance by the memory system • Programmers demand an infinite amount of fast memory • Fast storage technologies cost more per byte and have less capacity • Gap between CPU and main memory speed is widening • Well-written programs tend to exhibit good locality • These fundamental properties complement each other beautifully • They suggest an approach for organizing memory and storage systems known as a memory hierarchy

Defining Performance • What does it mean to say x is faster than y? • Supersonic Jet, fastest in the world as of 2013 • 2,485 miles on hour • 1-2 passengers • Boeing 777 • 440 max passengers • 559 miles per hour • Latency vs. Bandwidth Fast, fewer capacity, expensive Slow, large capacity, cheaper

L1 cache holds cache lines retrieved from the L2 cache memory L2 cache holds cache lines retrieved from main memory Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers An Example of Memory Hierarchy Smaller, faster, and costlier (per byte) storage devices L0: registers CPU registers hold words retrieved from L1 cache on-chip L1 cache (SRAM) L1: On/off-chip L2 cache (SRAM) L2: main memory (DRAM) L3: Larger, slower, and cheaper (per byte) storage devices local secondary storage (local disks) L4: remote secondary storage (distributed file systems, Web servers) L5:

Registers [25]

What Are Registers? CPU Registers Control Unit Arithmetic/Logic Unit Fetch Next Instruction Add Sub Integer Fetch Data Store Data Mult Div Increment Instruction Ptr … Floating Point And Or Execute Instruction … Not … … Registers are memory-like locations inside the Central Processing Unit that hold data that are being used right nowin operations.

How Registers Are Used operand Register Ri result Register Rk operand Register Rj Operation circuitry addend in R0 5 ADD Example: 12 sum in R2 7 augend in R1 Every arithmetic or logical operation has one or more operands and one result. Operands are contained in source registers. A “black box” of circuits performs the operation. The result goes into a destination register.

Cache [4]

What is Cache? A special kind of memory where data reside that areabout to be used or havejust been used. Very fast => very expensive => very small (typically 100 to 10,000 times as expensive as RAM per byte) Data in cache can be loaded into or stored from registers at speeds comparable to the speed of performing computations. Acts as staging area for subset of data in a larger, slower device Data that are not in cache (but that are in Main Memory) take much longer to load or store. Cache is near the CPU: either inside the CPU or on the motherboard that the CPU sits on.

Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 8 Level k: 9 14 3 Data is copied between levels in block-sized transfer units Caching in a Memory Hierarchy 4 10 10 4 0 1 2 3 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. 4 4 5 6 7 Level k+1: 8 9 10 10 11 12 13 14 15

Main Memory [13]

What is Main Memory? Where data reside for a program that is currently running Sometimes called RAM (Random Access Memory): you can load from or store into any main memory location at any time Sometimes called core (from magnetic “cores” that some memories used, many years ago) Much slower => much cheaper => much bigger

What Main Memory Looks Like … 0 3 4 8 1 2 5 6 7 9 10 536,870,911 You can think of main memory as a big long 1D array of bytes.

The Relationship BetweenMain Memory & Cache

RAM is Slow CPU 307 GB/sec[6] The speed of data transfer between Main Memory and the CPU is much slower than the speed of calculating, so the CPU spends most of its time waiting for data to come in or go out. Bottleneck 4.4 GB/sec[7] (1.4%)

Why Have Cache? CPU Cache is much closer to the speed of the CPU, so the CPU doesn’t have to wait nearly as long for stuff that’s already in cache: it can do more operations per second! 27 GB/sec (9%)[7] 4.4 GB/sec[7](1%)

How Cache Works When you request data from a particular address in Main Memory, here’s what happens: The hardware checks whether the data for that address is already in cache. If so, it uses it. Otherwise, it loads from Main Memory the entire cache line that contains the address. For example, on a 1.83 GHz Pentium4 Core Duo (Yonah), a cache miss makes the program stall (wait) at least 48 cycles (26.2 nanoseconds) for the next cache line to load – time that could have been spent performing up to 192 calculations! [26]

Memory Read Transaction (1) • CPU places address A on memory bus Registers CPU chip ALU Load operation:movl A, R2 R2 L1 L2 memory bus L3 system bus main memory I/O bridge 0 A bus interface A x

Memory Read Transaction (2) • Main memory reads A from memory bus, retrieves word x, and places it on bus Registers ALU Load operation:movl A, R2 R2 L1 L2 L3 main memory I/O bridge 0 x bus interface A x

Memory Read Transaction (3) • CPU reads word x from bus and copies it into register R2 Registers ALU Load operation:movl A, R2 R2 x L1 L2 L3 main memory I/O bridge 0 bus interface A x

Hard Disk

Why Is Hard Disk Slow? • Your hard disk is much much slower than main memory (factor of 10-1000). Why? • Well, accessing data on the hard disk involves physically moving: • the disk platter • the read/write head • In other words, hard disk is slow because objects move much slower than electrons: Newtonian speeds are much slower than Einsteinian speeds.

Disk Geometry • Disks consist of platters, each with two surfaces • Each surface consists of concentric rings called tracks • Each track consists of sectors separated by gaps tracks surface track k gaps spindle sectors

Disk Access Time • Average time to access some target sector approximated by : • Taccess = Tavg seek + Tavg rotation + Tavg transfer • Seek time (Tavg seek) • Time to position heads over cylinder containing target sector • Typical Tavg seek = 9 ms • Rotational latency (Tavg rotation) • Time waiting for first bit of target sector to pass under r/w head • Tavg rotation = 1/2 x 1/RPMs x 60 sec/1 min • Transfer time (Tavg transfer) • Time to read the bits in the target sector. • Tavg transfer = 1/RPM x 1/(avg # sectors/track) x 60 secs/1 min

Storage Use Strategies Register reuse: do a lot of work on the same data before working on new data. Cache reuse: the program is much more efficient if all of the data and instructions fit in cache; if not, try to use what’s in cache a lot before using anything that isn’t in cache (e.g., tiling). Data locality:try to access data that are near each other in memory before data that are far. I/O efficiency:do a bunch of I/O all at once rather than a little bit at a time; don’t mix calculations and I/O.

Parallelism

Parallelism Parallelism means doing multiple things at the same time: you can get more work done in the same time. Less fish … More fish!

What Is Parallelism? Parallelism is the use of multiple processing units – either processors or parts of an individual processor – to solve a problem, and in particular the use of multiple processing units operating concurrently on different parts of a problem. The different parts could be different tasks, or the same task on different pieces of the problem’s data.

Supercomputing in Plain English Overview: What the Heck is Supercomputing?