1.06k likes | 1.07k Views
Overview of Parallel Architecture and Programming Models. What is a Parallel Computer?. A collection of processing elements that cooperate to solve large problems fast Some broad issues that distinguish parallel computers: Resource Allocation: how large a collection?
E N D
What is a Parallel Computer? • A collection of processing elements that cooperate to solve large problems fast • Some broad issues that distinguish parallel computers: • Resource Allocation: • how large a collection? • how powerful are the elements? • how much memory? • Data access, Communication and Synchronization • how do the elements cooperate and communicate? • how are data transmitted between processors? • what are the abstractions and primitives for cooperation? • Performance and Scalability • how does it all translate into performance? • how does it scale?
Why Parallelism? • Provides alternative to faster clock for performance • Assuming a doubling of effective per-node performance every 2 years, 1024-CPU system can get you the performance that it would take 20 years for a single-CPU system to deliver • Applies at all levels of system design • Is increasingly central in information processing • Scientific computing: simulation, data analysis, data storage and management, etc. • Commercial computing: Transaction processing, databases • Internet applications: Search; Google operates at least 50,000 CPUs, many as part of large parallel systems
How to Study Parallel Systems • History: diverse and innovative organizational structures, often tied to novel programming models • Rapidly matured under strong technological constraints • The microprocessor is ubiquitous • Laptops and supercomputers are fundamentally similar! • Technological trends cause diverse approaches to converge • Technological trends make parallel computing inevitable • In the mainstream • Need to understand fundamental principles and design tradeoffs, not just taxonomies • Naming, Ordering, Replication, Communication performance
Outline • Drivers of Parallel Computing • Trends in “Supercomputers” for Scientific Computing • Evolution and Convergence of Parallel Architectures • Fundamental Issues in Programming Models and Architecture
Drivers of Parallel Computing • Application Needs:Our insatiable need for computing cycles • Scientific computing: CFD, Biology, Chemistry, Physics, ... • General-purpose computing: Video, Graphics, CAD, Databases, TP... • Internet applications: Search, e-Commerce, Clustering ... • Technology Trends • Architecture Trends • Economics • Current trends: • All microprocessors have support for external multiprocessing • Servers and workstations are MP: Sun, SGI, Dell, COMPAQ... • Microprocessors are multiprocessors. Multicore: SMP on a chip
Performance (p processors) Performance (1 processor) Application Trends • Demand for cycles fuels advances in hardware, and vice-versa • Cycle drives exponential increase in microprocessor performance • Drives parallel architecture harder: most demanding applications • Range of performance demands • Need range of system performance with progressively increasing cost • Platform pyramid • Goal of applications in using parallel machines: Speedup • Speedup (p processors) = • For a fixed problem size (input data set), performance = 1/time • Speedup fixed problem (p processors) = Time (1 processor) Time (p processors)
Scientific Computing Demand • Ever increasing demand due to need for more accuracy, higher-level modeling and knowledge, and analysis of exploding amounts of data • Example area 1: Climate and Ecological Modeling goals • By 2010 or so: • Simply resolution, simulated time, and improved physics leads to increased requirement by factors of 104 to 107. Then … • Reliable global warming, natural disaster and weather prediction • By 2015 or so: • Predictive models of rainforest destruction, forest sustainability, effects of climate change on ecoystems and on foodwebs, global health trends • By 2020 or so: • Verifiable global ecosystem and epidemic models • Integration of macro-effects with localized and then micro-effects • Predictive effects of human activities on earth’s life support systems • Understanding earth’s life support systems
Scientific Computing Demand • Example area 2: Biology goals • By 2010 or so: • Ex vivo and then in vivo molecular-computer diagnosis • By 2015 or so: • Modeling based vaccines • Individualized medicine • Comprehensive biological data integration (most data co-analyzable) • Full model of a single cell • By 2020 or so: • Full model of a multi-cellular tissue/organism • Purely in-silico developed drugs; personalized smart drugs • Understanding complex biological systems: cells and organisms to ecosystems • Verifiable predictive models of biological systems
Engineering Computing Demand • Large parallel machines a mainstay in many industries • Petroleum (reservoir analysis) • Automotive (crash simulation, drag analysis, combustion efficiency), • Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism), • Computer-aided design • Pharmaceuticals (molecular modeling) • Visualization • in all of the above • entertainment (movies), architecture (walk-throughs, rendering) • Financial modeling (yield and derivative analysis) • etc.
Learning Curve for Parallel Applications • AMBER molecular dynamics simulation program • Starting point was vector code for Cray-1 • 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D
Commercial Computing • Also relies on parallelism for high end • Scale not so large, but use much more wide-spread • Computational power determines scale of business that can be handled • Databases, online-transaction processing, decision support, data mining, data warehousing ... • TPC benchmarks (TPC-C order entry, TPC-D decision support) • Explicit scaling criteria provided: size of enterprise scales with system • Problem size no longer fixed as p increases, so throughput is used as a performance measure (transactions per minute or tpm) • E-commerce, search and other scalable internet services • Parallel applications running on clusters • Developing new parallel software models and primitives • Insight from automated analysis of large disparate data
400,000 $100 342K $90 350,000 $80 300,000 $70 234K 250,000 $60 165K Performance (tpmC) 200,000 $50 Price-Perf ($/tpmC) $40 150,000 $30 100,000 57K $20 40K 50,000 19K $10 12K Performance Price-Performance 7K 0 $0 1996 1997 1998 1999 2000 2001 2002 2002 TPC-C Results for Wintel Systems • Parallelism is pervasive • Small to moderate scale parallelism very important • Difficult to obtain snapshot to compare across vendor platforms 6-way Unisys AQ HS6 Pentium Pro 200 MHz 12,026 tpmC $39.38/tpmC Avail: 11-30-97 TPC-C v3.3 (withdrawn) 4-way Cpq PL 5000 Pentium Pro 200 MHz 6,751 tpmC $89.62/tpmC Avail: 12-1-96 TPC-C v3.2 (withdrawn) 4-way IBM NF 7000 PII Xeon 400 MHz 18,893 tpmC $29.09/tpmC Avail: 12-29-98 TPC-C v3.3 (withdrawn) 8-way Cpq PL 8500 PIII Xeon 550 MHz 40,369 tpmC $18.46/tpmC Avail: 12-31-99 TPC-C v3.5 (withdrawn) 8-way Dell PE 8450 PIII Xeon 700 MHz 57,015 tpmC $14.99/tpmC Avail: 1-15-01 TPC-C v3.5 (withdrawn) 32-way Unisys ES7000 PIII Xeon 900 MHz 165,218 tpmC $21.33/tpmC Avail: 3-10-02 TPC-C v5.0 32-way Unisys ES7000 Xeon MP 2 GHz 234,325 tpmC $11.59/tpmC Avail: 3-31-03 TPC-C v5.0 32-way NEC Express5800 Itanium2 1GHz 342,746 tpmC $12.86/tpmC Avail: 3-31-03 TPC-C v5.0
Summary of Application Trends • Transition to parallel computing has occurred for scientific and engineering computing • Also occurred commercial computing • Database and transactions as well as financial • Scalable internet services (at least coarse-grained parallelism) • Desktop also uses multithreaded programs, which are a lot like parallel programs • Demand for improving throughput on sequential workloads • Greatest use of small-scale multiprocessors • Solid application demand, keeps increasing with time • Key challenge throughout is making parallel programming easier • Taking advantage of pervasive parallelism with multi-core systems
Drivers of Parallel Computing • Application Needs • Technology Trends • Architecture Trends • Economics
Technology Trends: Rise of the Micro The natural building block for multiprocessors is now also about the fastest!
General Technology Trends • Microprocessor performance increases 50% - 100% per year • Clock frequency doubles every 3 years • Transistor count quadruples every 3 years • Moore’s law: xtors per chip = 1.59year-1959 (originally2year-1959) • Huge investment per generation is carried by huge commodity market • With every feature size scaling of n • we get O(n2) transistors • we get O(n) increase in possible clock frequency • We should get O(n3) increase in processor performance. • Do we? • See architecture trends
Die and Feature Size Scaling • Die Size growing at 7% per year; feature size shrinking 25-30%
Clock Frequency Growth Rate (Intel family) • 30% per year
Transistor Count Growth Rate (Intel family) • Transistor count grows much faster than clock rate • - 40% per year, order of magnitude more contribution in 2 decades • Width/space has greater potential than per-unit speed
How to Use More Transistors • Improve single threaded performance via architecture: • Not keeping up with potential given by technology (next) • Use transistors for memory structures to improve data locality • Doesn’t give as high returns (2x for 4x cache size, to a point) • Use parallelism • Instruction-level • Thread level • Bottom line: Not that single-threaded performance has plateaued, but that parallelism is natural way to stay on a better curve
Similar Story for Storage • Divergence between memory capacity and speed more pronounced • Capacity increased by 1000x from 1980-95, and increases 50% per yr • Latency reduces only 3% per year (only 2x from 1980-95) • Bandwidth per memory chip increases 2x as fast as latency reduces • Larger memories are slower, while processors get faster • Need to transfer more data in parallel • Need deeper cache hierarchies • How to organize caches?
Similar Story for Storage • Parallelism increases effective size of each level of hierarchy, without increasing access time • Parallelism and locality within memory systems too • New designs fetch many bits within memory chip; follow with fast pipelined transfer across narrower interface • Buffer caches most recently accessed data • Disks too: Parallel disks plus caching • Overall, dramatic growth of processor speed, storage capacity and bandwidths relative to latency (especially) and clock speed point toward parallelism as the desirable architectural direction
Drivers of Parallel Computing • Application Needs • Technology Trends • Architecture Trends • Economics
Architectural Trends • Architecture translates technology’s gifts to performance and capability • Resolves the tradeoff between parallelism and locality • Recent microprocessors: 1/3 compute, 1/3 cache, 1/3 off-chip connect • Tradeoffs may change with scale and technology advances • Four generations of architectural history: tube, transistor, IC, VLSI • Here focus only on VLSI generation • Greatest delineation in VLSI has been in scale and type of parallelism exploited
Architectural Trends in Parallelism • Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit • slows after 32 bit • adoption of 64-bit well under way, 128-bit is far (not performance issue) • great inflection point when 32-bit micro and cache fit on a chip • Basic pipelining, hardware support for complex operations like FP multiply etc. led to O(N3) growth in performance. • Intel: 4004 to 386
Architectural Trends in Parallelism • Mid 80s to mid 90s: instruction level parallelism • Pipelining and simple instruction sets, + compiler advances (RISC) • Larger on-chip caches • But only halve miss rate on quadrupling cache size • More functional units => superscalar execution • But limited performance scaling • N2 growth in performance • Intel: 486 to Pentium III/IV
Architectural Trends in Parallelism • After mid-90s • Greater sophistication: out of order execution, speculation, prediction • to deal with control transfer and latency problems • Very wide issue processors • Don’t help many applications very much • Need multiple threads (SMT) to exploit • Increased complexity and size leads to slowdown • Long global wires • Increased access times to data • Time to market • Next step: thread level parallelism
Can Instruction-Level get us there? • • Reported speedups for superscalar processors • Horst, Harris, and Jardine [1990] ...................... 1.37 • Wang and Wu [1988] .......................................... 1.70 • Smith, Johnson, and Horowitz [1989] .............. 2.30 • Murakami et al. [1989] ........................................ 2.55 • Chang et al. [1991] ............................................. 2.90 • Jouppi and Wall [1989] ...................................... 3.20 • Lee, Kwok, and Briggs [1991] ........................... 3.50 • Wall [1991] .......................................................... 5 • Melvin and Patt [1991] ....................................... 8 • Butler et al. [1991] ............................................. 17+ • Large variance due to difference in • application domain investigated (numerical versus non-numerical) • capabilities of processor modeled
ILP Ideal Potential • Infinite resources and fetch bandwidth, perfect branch prediction and renaming • real caches and non-zero miss latencies
Results of ILP Studies • Concentrate on parallelism for 4-issue machines • Realistic studies show only 2-fold speedup • More recent work examines ILP that looks across threads for parallelism
Architectural Trends: Bus-based MPs • Micro on a chip makes it natural to connect many to shared memory • dominates server and enterprise market, moving down to desktop • Faster processors began to saturate bus, then bus technology advanced • today, range of sizes for bus-based systems, desktop to large servers No. of processors in fully configured commercial shared-memory systems
Do Buses Scale? • Buses are a convenient way to extend architecture to parallelism, but they do not scale • bandwidth doesn’t grow as CPUs are added • Scalable systems use physically distributed memory
Drivers of Parallel Computing • Application Needs • Technology Trends • Architecture Trends • Economics
Finally, Economics • Fabrication cost roughly O(1/feature-size) • 90nm fabs cost about 1-2 billion dollars • So fabrication of processors is expensive • Number of designers also O(1/feature-size) • 10 micron 4004 processor had 3 designers • Recent 90 nm processors had 300 • New designs very expensive • Push toward consolidation of processor types • Processor complexity increasingly expensive • Cores reused, but tweaks expensive too
Design Complexity and Productivity • Design complexity outstrips human productivity
Economics • Commodity microprocessors not only fast but CHEAP • Development cost is tens of millions of dollars • BUT, many more are sold compared to supercomputers • Crucial to take advantage of the investment, and use the commodity building block • Exotic parallel architectures no more than special-purpose • Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors • Standardization by Intel makes small, bus-based SMPs commodity • What about on-chip processor design?
What’s on a processing chip? • Recap: • Number of transistors growing fast • Methods to use for single-thread performance running out of steam • Memory issues argue for parallelism too • Instruction-level parallelism limited, need thread-level • Consolidation is a powerful force • All seems to point to many simpler cores rather than single bigger complex core • Additional key arguments: wires, power, cost
Wire Delay • Gate delay shrinks, global interconnect delay grows: short local wires
Power • Power dissipation in Intel processors over time
Power • Power grows with number of transistors and clock frequency • Power grows with voltage; P = CV2f • Going from 12V to 1.1V reduced power consumption by 120x in 20 yr • Voltage projected to go down to 0.7V in 2018, so only another 2.5x • Power per chip peaking in designs: • - Itanium II was 130W, Montecito 100W • - Power is first-class design constraint • Circuit-level power techniques quite far along • - clock gating, multiple thresholds, sleeper transistors
Power versus Clock Frequency • Two processor generations; two feature sizes
Architectural Implication of Power • Fewer transistors per core a lot more power efficient • Narrower issue, shorter pipelines, smaller OOO window • Get per-processor performance back on O(n3) curve • But lower single thread performance. • What complexity to eliminate? • Speculation, multithreading, … ? • All good for some things, but need to be careful about power/benefit
ITRS Projections (contd.) • #Procs on chip will outstrip individual processor performance