Contents - PowerPoint PPT Presentation

slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Contents PowerPoint Presentation
play fullscreen
1 / 40
Contents
119 Views
Download Presentation
gafna
Download Presentation

Contents

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. High-Level Programming Models for Clusters Issues and ChallengesHans P. ZimaInstitute of Scientific Computing, University of Vienna, AustriaandJet Propulsion Laboratory, California Institute of Technology, Pasadena, CACCGSC 2006 Flat Rock, NC September 13, 2006

  2. Contents • Introduction • Issues and Challenges for HPC Languages • From HPF to High Productivity Languages • A Short Overview of Chapel • Future Research • Conclusion

  3. Abstraction in Programming Programming models and languages bridge the gap between “reality” and hardware – at different levels of abstraction - e.g., • assembly languages • general-purpose procedural languages • functional languages • very high-level domain-specific languages • libraries Abstraction implies loss of information – gain in simplicity, clarity, verifiability, portability versus potential performance degradation

  4. The Emergence of High-Level Sequential Languages The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …” High-level algorithmic languages became generally accepted standards for sequential programming since their advantages outweighed any performance drawbacks For parallel programming no similar development took place

  5. HPC Programming Paradigm: State of the Art • Current HPC programming is dominated by the use of a standard language (Fortran, C/C++), combined with message-passing (MPI) • MPI has made a tremendous contribution to the field, providing a portable standard accessible everywhere BUT: • there is a wide gap between the domain of the scientist and this programming model • conceptually simple problems (e.g., stencil computations) can result in very complex programs • conceptually simple changes (like replacing a block data distribution with a cyclic distribution) are not easy to handle • exploiting performance may require “heroic” programmer effort

  6. Fortran+MPI Communication for 3D Stencil in NAS MG (Source: Brad Chamberlain, Cray Inc.) do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo dir = +1 buff_id = 3 + dir buff_len = nm2 do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo dir = +1 buff_id = 2 + dir buff_len = 0 if( axis .eq. 1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) enddo enddo endif if( axis .eq. 2 )then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id )= u( i1,n2-1,i3) enddo enddo endif if( axis .eq. 3 )then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,n3-1) enddo enddo endif dir = -1 buff_id = 2 + dir buff_len = 0 if( axis .eq. 1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) enddo enddo endif if( axis .eq. 2 )then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo enddo endif if( axis .eq. 3 )then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,2) enddo enddo endif do i=1,nm2 buff(i,4) = buff(i,3) buff(i,2) = buff(i,1) enddo dir = -1 buff_id = 3 + dir indx = 0 if( axis .eq. 1 )then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo enddo endif if( axis .eq. 2 )then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,n2,i3) = buff(indx, buff_id ) enddo enddo endif if( axis .eq. 3 )then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,n3) = buff(indx, buff_id ) enddo enddo endif dir = +1 buff_id = 3 + dir indx = 0 if( axis .eq. 1 )then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) enddo enddo endif if( axis .eq. 2 )then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,1,i3) = buff(indx, buff_id ) enddo enddo endif if( axis .eq. 3 )then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,1) = buff(indx, buff_id ) enddo enddo endif return end do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) else if( dir .eq. +1 ) then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id )= u( i1,n2-1,i3) enddo enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) endif endif if( axis .eq. 3 )then if( dir .eq. -1 )then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,2) enddo enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) else if( dir .eq. +1 ) then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,n3-1) enddo enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) endif endif return end subroutine take3( axis, dir, u, n1, n2, n3 ) use caf_intrinsics implicit none include 'cafnpb.h' include 'globals.h' integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 ) integer buff_id, indx integer i3, i2, i1 buff_id = 3 + dir indx = 0 if( axis .eq. 1 )then if( dir .eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo enddo else if( dir .eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) enddo enddo endif endif if( axis .eq. 2 )then if( dir .eq. -1 )then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,n2,i3) = buff(indx, buff_id ) enddo enddo else if( dir .eq. +1 ) then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,1,i3) = buff(indx, buff_id ) enddo enddo endif endif if( axis .eq. 3 )then if( dir .eq. -1 )then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,n3) = buff(indx, buff_id ) enddo enddo else if( dir .eq. +1 ) then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,1) = buff(indx, buff_id ) enddo enddo endif endif return end subroutine comm1p( axis, u, n1, n2, n3, kk ) use caf_intrinsics implicit none include 'cafnpb.h' include 'globals.h' integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 ) integer i3, i2, i1, buff_len,buff_id integer i, kk, indx dir = -1 buff_id = 3 + dir buff_len = nm2 subroutine comm3(u,n1,n2,n3,kk) use caf_intrinsics implicit none include 'cafnpb.h' include 'globals.h' integer n1, n2, n3, kk double precision u(n1,n2,n3) integer axis if( .not. dead(kk) )then do axis = 1, 3 if( nprocs .ne. 1) then call sync_all() call give3( axis, +1, u, n1, n2, n3, kk ) call give3( axis, -1, u, n1, n2, n3, kk ) call sync_all() call take3( axis, -1, u, n1, n2, n3 ) call take3( axis, +1, u, n1, n2, n3 ) else call comm1p( axis, u, n1, n2, n3, kk ) endif enddo else do axis = 1, 3 call sync_all() call sync_all() enddo call zero3(u,n1,n2,n3) endif return end subroutine give3( axis, dir, u, n1, n2, n3, k ) use caf_intrinsics implicit none include 'cafnpb.h' include 'globals.h' integer axis, dir, n1, n2, n3, k, ierr double precision u( n1, n2, n3 ) integer i3, i2, i1, buff_len,buff_id buff_id = 2 + dir buff_len = 0 if( axis .eq. 1 )then if( dir .eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) enddo enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) else if( dir .eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) enddo enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = > buff(1:buff_len,buff_id) endif endif if( axis .eq. 2 )then if( dir .eq. -1 )then

  7. Contents • Introduction • Issues and Challenges for HPC Languages • From HPF to High-Productivity Languages • A Short Overview of Chapel • Future Research • Conclusion

  8. Productivity Challenges for Peta-Scale Systems • Large-scale architectural parallelism • tens of thousands to hundreds of thousands of processors • component failures may occur frequently • Extreme non-uniformity in data access • more than 1000 cycles to access local memory • Applications: large, complex, and long-lived • multi-disciplinary, multi-language, multi-paradigm • dynamic, irregular, and adaptive • long-lived, surviving many hardware generations: support for efficient migration has high priority

  9. Key Requirements for High-Productivity Languages High-Level Support for: • Explicit concurrency • Locality-awareness • Distributed collections • Multi-lingual, multi-paradigm, multi-disciplinary programming-in-the-large

  10. Goal and Vision • Goal: Enhance productivity of scientists and engineers, without compromising performance • Vision: A programming environment built around a universal high-productivity language, a common intermediate language and execution model, and an associated common infrastructure: • auniversal High Productivity Language, to serve as a standard for programming parallel systems over the next 20-30 years • a common Intermediate Language and Execution Model,providing a common infrastructure for multi-platform, multi-lingual, multi-paradigm compilation and performance-portable migration of legacy codes • an infrastructure for an Intelligent ProgrammingEnvironment supporting autonomous system operation and self-tuning based on advanced expert system and introspection technology

  11. UNCOL Revisited: Portable Compilation/Tool Environment legacy lg 1 source program HPL source program … architecture-independent unified compiler front end Front End legacy lg m source program IHPL intermediate program optimization transformations … architecture-specific compiler back ends Back End n Back End 1 TPL n target program TPL 1 target program

  12. Contents • Introduction • Issues and Challenges for HPC Languages • From HPF to High Productivity Languages • A Short Overview of Chapel • Future Research • Conclusion

  13. The Path to High Productivity Languages • High Performance Fortran (HPF) Language Family • HPF Predecessors: CM-Fortran, Fortran D, Vienna Fortran • High Performance Fortran (HPF): HPF-1 (1993); HPF-2(1997) • Post-HPF Developments: HPF+, JAHPF • OpenMP • ZPL • Partitioned Global Address Space (PGAS) Languages • Co-Array Fortran, UPC, Titanium • High-Productivity Languages • Chapel, X10, Fortress

  14. The High Performance Fortran Idea Message Passing Approach HPF Approach initialize MPI global computation do while (.not. converged) do J=1,N do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do end do A(1:N,1:N) = B(1:N,1:N) local computation do while (.not. converged) do J=1,M do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do end do A(1:N,1:N) = B(1:N,1:N) data distribution processors P(NUMBER_OF_PROCESSORS) distribute(*,BLOCK) onto P :: A, B if (MOD(myrank,2) .eq. 1) then call MPI_SEND(B(1,1),N,…,myrank-1,..) call MPI_RCV(A(1,0),N,…,myrank-1,..) if (myrank .lt. s-1) then call MPI_SEND(B(1,M),N,…,myrank+1,..) call MPI_RCV(A(1,M+1),N,…,myrank+1,..) endif else … … communication compiler-generated

  15. HPF: Problems and Successes • HPF-1 lacked important functionality • data distributions do not support irregular data structures and algorithms • lack of flexibility for processor-mapping and data/thread affinity • focus on SPMD data parallelism • Fortran 90 as the base language lacked vendor compiler support • Compiler and runtime support for some HPF features (e.g., dynamic data distributions) was not sufficiently mature • HPF+/JAHPF plasma code reached efficiency of 40% on Earth Simulator • explicit high-level formulation of communication patterns • explicit high-level control of communication schedules and “halos” • Thebasic idea underlying HPF has been re-incarnated, in a more general context, in the recently developed HPCS languages However the HPF concept survived:

  16. PGAS Language Overview • Co-Array Fortran, UPC, Titanium typical representatives • Based on explicit SPMD model • Both private and shared data • Support for global distributed data structures • global address space is logically partitioned and mapped to processors • in general, static distinction between local and global references • Processor-centric view (“fragmented programming model”) • One sided shared-memory communication • Collective communication and I/O libraries

  17. Example: Setting up a block-distributedarray in Titanium // determine parameters of local block: Point<3> startCell = myBlockPos * numCellsPerBlockSide; Point<3> endCell = startCell + (numCellsPerBlockSide-[1,1,1]); //create local myBlock array: double [3d] myBlock = new double[startCell:endCell]; //build the distributed structure: //declare blocks as a 1D-array of references (one element per processor) blocks.exchange(myBlock); P0 P1 P2 blocks blocks blocks myBlock myBlock myBlock Source: K.Yelick et al.: Parallel Languages and Compilers: Perspective from the Titanium Experience

  18. Example: Setting up a block-distributed array in Chapel Standard Distribution Library class block: Distribution{…} class cyclic: Distribution{…} … class sparse-brd: Distribution{…} const D: domain(3) distributed (block) = [l1..u1,l2..u2,l3..u3]; … var A: [D] float; … 14

  19. HPCS Language Overview • HPCS Languages • Chapel (Cascade Project, led by Cray Inc.) • X10 (PERCS Project, led by IBM) • Fortress (HERO Project, led by Sun Microsystems Inc.) • Global name space and global data access • in general, no static distinction between local and global references • Explicit high-level specification of parallelism • Explicit high-level specification of locality • data distribution and alignment, affinity (on-clause) • High-level support for distributed collections • Support for data and task parallelism • Object orientation

  20. Contents • Introduction • Issues and Challenges for HPC Languages • From HPF to High Productivity Languages • A Short Overview of Chapel • Future Research • Conclusion

  21. The Cascade Project • Phase 1: Concept Study July 2002 -- June 2003 • Phase 2: Prototyping Phase July 2003 -- July 2006 • Led by Cray Inc. • Cascade Partners • Caltech/JPL • University of Notre Dame • Stanford University • Chapel is the Cascade High Productivity Language David Callahan, Brad Chamberlain, Steve Deitz, Roxana Diaconescu, John Plevyak, Hans Zima

  22. High-Level Control of Locality • Locality Control is Key for Performance • all modern HPC architectures have distributed memory • Fully Automatic Control is Beyond State-of-the-Art • automatic locality analysis had some limited successes for regular codes • for irregular and adaptive applications compiler knowledge is not sufficient to exploit locality efficiently; runtime analysis can be highly expensive • explicit user control of locality is essential for achieving high performance • Chapel Approach • responsibility for generating communication delegated to compiler/runtime system • locality specified by data distributions, data alignment, and affinity • all data distributions are user-defined • additional user-provided locality assertions help the compiler where static knowledge is not available

  23. Example: Matrix-Vector Multiplication (dense) var Mat: domain(2) = [1..m, 1..n]; var MatCol: domain(1) = Mat(2); var MatRow: domain(1) = Mat(1); var A:[Mat] float; var v:[MatCol] float; var s:[MatRow] float; s = sum(dim=2) [i,j in Mat] A(i,j)*v(j); Version 1 var L:[1..p1,1..p2] locale = reshape(Locales); var Mat: domain(2) distributed(myB,myB) on L =[1..m,1..n]; var MatCol: domain(1) aligned(*,Mat(2))= Mat(2); var MatRow: domain(1) aligned(Mat(1),*)= Mat(1); var A:[Mat] float; var v:[MatCol] float; var s:[MatRow] float; s = sum(dim=2) [i,j in Mat] A(i,j)*v(j); Version 2: distributions added, algorithm unchanged Sparse code can be formulated in a similar way

  24. Locality Control in Chapel: Basic Concepts • Locale • “locale”: abstract unit of locality, bound to an execution • user-defined views of locale sets • explicit allocation of data and computations on locales • Domain • first-class entity • components: index set, distribution, associated arrays, iterators • Array --- Mapping from a Domain to a Set of Variables • User-Defined Distributions • original ideas in Vienna Fortran and library-based approaches • user can work with distributions at three levels • naïve use of a predefined library distribution • explicit specification of a distribution by its global mapping • explicit specification of a distribution by global mapping and data layout

  25. Key Functionality of the Distribution Framework • Two levels: global mapping and layout mapping • User-Defined Global Mappings from Index Sets to Locales • “standard” distributions (block, block-cyclic, etc.) • distribution of irregular meshes, links to external partitioners • distribution of hierarchical structures (multiblock) • User-Defined Layout Specifications • layout specifies data arrangement within a locale • sparse data structures important target • Dynamic Reallocation, Redistribution • High-Level Control of Communication • user-defined specification of halos (ghost cells) • user-defined assertions on communication

  26. User-Defined Distributions:Global Mapping(1) class MyC:Distribution { const z:integer; /* block size */ const ntl:integer; /* number of target locales*/ function map(i:index(source)):locale { /* global mapping for MyC */ return Locales(mod(ceil(i/z-1)+1,ntl)); } class MyB: Distribution { var bl:integer = ...; /* block length */ function map(i: index(source)):locale { /* global mapping for MyB */ return Locales(ceil(i/bl)); } } const D1C: domain(1) distributed(MyC(z=100))=1..n1; const D1B: domain(1) distributed(MyB) onLocales(1..num_locales/10)=1..n1; var A1: [D1C] float; var A2: [D1B] float; /* declaration of distribution classes MyC and MyB: */ /* use of distribution classes MyC and MyB in declarations: */

  27. User-Defined Distributions:Global Mapping(2) class MyC1:Distribution { /* cyclic(1) */ const ntl:integer; /* number of target locales */ function map(i:index(source)):locale { /* global mapping for MyC1 */ return Locales(mod(i-1,ntl)+1); } iterator DistSegIterator(loc: index(target)): index(source) { const N: integer = getSource().extent; const k: integer = locale_index(loc); for i in k..N by ntl {yield(i); } } function GetDistributionSegment(loc: index(target)): Domain { const N: integer = getSource().extent; const k: integer = locale_index(loc); return (k..N by ntl); } } const D1C1: domain(1) distributed(MyC1()) on Locales(1..4)=1..16; var A1: [D1C1] float; ... /* declaration of distribution class MyC1: */ /* set of local iterators : */ /* distribution segment : */ /* use of distribution class MyC1 in declarations: */

  28. An Artificial Example: Banded Distribution* 5 6 7 8 d 2 3 4 9 Diagonal A/d = { A(i,j) | d=i+j } 10 j bw = 3 (bandwidth) 1 2 3 4 5 7 6 8 9 11 p=4 (number of locales) 1 1 12 Distribution—global map: 2 Blocks of bw diagonals are cyclically mapped to locales 2 13 3 14 Layout: 3 4 Each diagonal is represented as a one-dimensional dense array. Arrays in a locale are referenced by a pointer array i 15 5 4 16 6 17 7 1 18 8 2 9 *This example was proposed by Brad Chamberlain (Cray Inc.)

  29. User-Defined Banded Distribution (1) /* declaration of distribution class Banded: */ class Banded:Distribution { const b: integer; const n: integer = getSource()(1).extent(); const p: integer = getTargetLocales().extent(); const ndiags: integer = 2*n-1; var firstDiagsInDS: [1..p] domain(1); var DiagsInDS: [1..p] seq(integer) = nil; constructor Banded() { forall k in 1..p { firstDiagsInDS(k) = (k-1)*b+2 .. ndiags by b*p; forall d in firstDiagsInDS(k) diagsInDS(k)#d..min(d+b-1,2*n); } } /* global mapping: */ function map(i,j:integer):locale{return(Locales(mod((i+j-2)/b+1,p))}; /* set of local iterators: */ iterator DistSegIterator(loc:locale):index(source){ const k:integer=locale_index(loc); for d in DiagsInDS { for i in first_i(d)..first_i(d)+length(d)-1 0..b-1 { yield(i.d-i);} } } }

  30. User-Defined Banded Distribution (2) class BandedLayout:LocalSegment { /* diagonals in distribution segment loc: /* const DIAGS_k: domain = Banded.diagsInDS(k); /* local index domain:/ const LocalDomain=[DIAGS_k] domain(1); /* structure of local array segment */ … constructor BandedLayout() { forall d in DIAGS_k {LocalDomain(d) = 1..Banded.length(d);} } function layout(i,j:integer): index(LocalDomain){ /* layout mapping */ return this.index(i+j,i-first_i(i+j) + 1); } } const D: domain(2) distributed(Banded(b=bw),BandedLayout()) on Locales[1..p] = [1..n,1..n]; iterator acrossDiagonal(d: index(diagD)):(integer,integer) {…} var A: [D] eltType; forall d in diagD { for dx in acrossDiagonal(d) { … A(dx)… } … /* declaration of layout class BandedLayout: */ /* use of distribution class Banded and layout class BandedLayout in declarations: */

  31. Example: Matrix-Vector Multiplication (sparse CRS) 0 53 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 93 0 0 0 0 0 0 0 0 0 21 0 0 0 16 72 0 0 0 0 0 0 0 0 0 13 0 C0 ____ 2 1 4 5 C1 ____ 7 8 6 7 R0 ____ 1 2 2 3 3 4 5 5 R1 ____ 1 1 2 3 4 4 4 5 R0 ____ 1 2 2 3 3 4 5 5 C0 ____ 2 1 4 5 D0 ____ 53 19 17 93 D1 ____ 21 16 72 13 D0 ____ 53 19 17 93 D0 ____ 53 19 17 93 44 0 0 19 37 0 0 0 0 0 64 0 0 0 0 0 0 2369 0 27 0 0 11 R3 ____ 1 3 4 5 C2 ____ 2 3 1 4 D0 ____ 53 19 17 93 D2 ____ 23 69 27 11 C3 ____ 5 8 5 7 D3 ____ 44 19 37 64 R2 ____ 1 1 3 5 const D: domain(2)=[1..m,1..n]; const DD: domain(D) sparse(CRS)= …; distribute(DD,Block_CRS); var AA: [DD] float; …

  32. User-Defined Halos • User-Defined Specification of halo (ghost cells) • Compiler/Runtime System • allocates images • defines mapping between remote objects and images • Halo Management • update • flush distribution segment

  33. Contents • Introduction • Issues and Challenges for HPC Languages • From HPF to High Productivity Languages • A Short Overview of Chapel • Future Research • Conclusion

  34. Intelligent Programming Environments Objective: Create an infrastructure for an intelligent programming environment that supports: • a portable framework for introspection • fault tolerance and resilience to component failures for large-scale systems • autonomy and dynamic adaptation (self-tuning) for complex applications • tool integration and support for high productivity • knowledge acquisition, learning, consulting, and knowledge presentation • shared user interface

  35. Case Study: Offline Performance Tuning source program restructuring commands Transformation system actuators Parallelizing compiler Analysis Knowledge Base monitoring instrumented target program HPCS system end of tuning cycle 50

  36. Case Study: Online Performance Tuning source program restructuring commands Transformation system actuator: restructure/ recompile Parallelizing compiler Expert System actuator: re-instrument instrumented target program … actuator: change function implementation monitoring HPCS system 50

  37. Introspection Framework Overview KNOWLEDGE BASE INTROSPECTION FRAMEWORK hardware operating system languages compilers libraries … System Knowledge A P SENSORS P … Inference Engine L Application Domain Knowledge I A A C Agent System for - monitoring - analysis - tuning A ACTUATORS T … … I components semantics performance experiments … Application Knowledge O N Presentation Knowledge 50

  38. Heterogeneous System Example: Future On-Board Systems • Future space missions will require enhanced on-board computing systems • deep-space missions, operating in hazardous environments • advanced sensor technology producing large amounts of high-quality data • latency and bandwidth limitations for transmissions between spacecraft and Earth • real-time on-board decisions concerning navigation, planning of science experiments, and analysis of unexpected phenomena • On-board systems consist of two major components: • radiation-hardened control system • heterogeneous, highly parallel, reduced-reliability computational subsystem • New, high-level, fault-tolerant programming models for such systems are needed Spacecraft Control Computer Computational Subsystem on-board system structure Earth

  39. Software-Enhanced Computational Subsystem introspection Global actuators Introspection … Framework Heterogeneous Massively Parallel Fault Spacecraft Control Computer Tolerance Computational Subsystem … Performance Tuning introspection sensors Earth

  40. Conclusion • MPI provides a key infrastructure for HPC that is here to stay • However, productivity and reliability considerations will, in the long term, enforce a programming paradigm in which MPI is the target – not the source • From a practical point of view, general agreement on a single HPC language combined with a common intermediate language, front end, and programming environment would be the best solution to address the difficult open problems • Acceptance of a new language depends on many criteria, including: • functionality and target code performance • mature compiler and runtime system technology • intelligent programming environment closely integrated with the language • familiarity of users with syntax and semantics for conventional (sequential) features • support by funding agencies and major vendors • Open research issues include models for heterogeneous systems, design of intelligent programming environments, fault tolerance and automatic performance tuning