Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More. Dr. Jay Boisseau Director, Texas Advanced Computing Center [email protected] December 3, 2001. Texas Advanced Computing Center. The University of Texas at Austin. Outline.
16 CPU, 16GB
20 2-way nodes
Itanium (800 MHz) processors
2 GB memory/node
72 GB disk/node
Myrinet 2000 switch
180GB shared disk
32 2-way nodes
Pentium III (1 GHz) processors
1 GB Memory
18.2 GB disk/node
Myrinet 2000 SwitchNew HPC Systems
750 GB IBM GPFS parallel file system for both clusters
Although Moore’s Law ‘predicts’ that single processor performance doubles every 18 months, eventually physical limits on manufacturing technology will be reached
Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000)
Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters)
Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors, or SMPs (Sun E10000)
Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin)
Clock period (cp): the minimum time interval between successive actions in the processor. Fixed: depends on design of processor. Measured in nanoseconds (~1-5 for fastest processors). Inverse of frequency (MHz).
Instruction: an action executed by a processor, such as a mathematical operation or a memory operation.
Register: a small, extremely fast location for storing data or instructions in the processor.
Functional Unit (FU): a hardware element that performs an operation on an operand or pair of operations. Common FUs are ADD, MULT, INV, SQRT, etc.
Pipeline : technique enabling multiple instructions to be overlapped in execution.
Superscalar: multiple instructions are possible per clock period.
Flops: floating point operations per second.
Cache: fast memory (SRAM) near the processor. Helps keep instructions and data close to functional units so processor can execute more instructions more rapidly.
Translation-Lookaside Buffer (TLB): keeps addresses of pages (block of memory) in main memory that have recently been accessed (a cache for memory addresses)
SRAM: Static Random Access Memory (RAM). Very fast (~10 nanoseconds), made using the same kind of circuitry as the processors, so speed is comparable.
DRAM: Dynamic RAM. Longer access times (~100 nanoseconds), but hold more bits and are much less expensive (10x cheaper).
Memory hierarchy: the hierarchy of memory in a parallel system, from registers to cache to local memory to remote memory. More later.
Topology: the manner in which the nodes are connected.
Performance Gap:(grows 50% / year)
From D. Patterson, CS252, Spring 1998 ©UCB
2-way set-associative cache
Least Recently Used (LRU): Cache replacement strategy for set associative caches. The cache block that is least recently used is replaced with a new block.
Random Replace: Cache replacement strategy for set associative caches. A cache block is randomly replaced.
MemoryClustered SMP Diagram
Fortran 77 + OpenMP
x(i) = y(i) + z(i)
!$OMP PARALLEL DO
x(i) = y(i) + z(i)
Highlighted directive specifies that loop is executed in parallel. Each processor executes a subset of the loop iterations.
MPI messages are two-way: they require a send and a matching receive:
PE 0 calls MPI_SEND to pass the real variable x to PE 1.
PE 1 calls MPI_RECV to receive the real variable y from PE 0
MPI also has global operations to broadcast and reduce (collect) information
PE 5 broadcasts the single (1) integer value n to all other processors
PE 6 collects the single (1) integer value n from all other processors and puts the sum (MPI_SUM) into into sum
There are some hurdles in parallel computing:
In this case, the parallel code achieves perfect scaling, but does not match the performance of the serial code until 32 processors are used
A simplified memory
PE 1Load Balancing
The figures below show the timeline for parallel codes run on two processors. In both cases, the total amount of work done is the same, but in the second case the work is distributed more evenly between the two processors resulting in a shorter time to solution.
Sequential: t = t(comp) + t(comm)
Overlapped: t = t(comp) + t(comm) - t(comp) t(comm)
The following examples of “phoning home” illustrate the value of combining many small messages into a single larger one.
By transmitting a single large message, I
only have to pay the price for the dialing
latency once. I transmit more information
in less time.
In the following example, a stencil operation is performed on a 10 x 10 array that has been distributed over two processors. Assume periodic boundary conditions.
Boundary elements - requires datafrom neighboring processor
Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahl’s Law are given below:
tN = (fp/N + fs)t1 Effect of multiple processors on run time
S = 1/(fs + fp/N) Effect of multiple processors on speedup
fs = serial fraction of code
fp = parallel fraction of code = 1 - fs
N = number of processors
It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors
Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications (and I/O) will result in a further degradation of performance.
A good distribution if the physics of the
problem is the same in both directions.
Minimizes the amount of data that must
be communicated between processors.
If expensive global operations need to be
carried out in the x-direction (ex. FFTs),
this is probably a better choice.
Imagine that we are doing a simulation
in which more work is required for the
grid points covering the shaded object.
Neither data distribution from the
previous example will result in good
May need to consider an irregular grid
or a different data structure.
PE 0 Code
High-granularity applicationChoosing a Resource: Granularity
Granularity is a measure of the amount of work done by each processor between synchronization events.
Generally, latency is the critical parameter for low-granularity codes, while processor performance is the key factor for high-granularity applications.
Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.
gSelected Major Grid Projects
There are also many technology R&D projects: e.g., Globus, Condor, NetSolve, Ninf, NWS, etc.
Authorization & policy
Remote data access
High-speed data transfer
Accounting & payment
Etc.Some Grid Requirements – Systems/Deployment Perspective
“The thing about change is that things will be different afterwards.”
Alan McMahon (Cornell University)