1 / 29

Compiler, Languages, and Libraries

Compiler, Languages, and Libraries. ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh 810181079 hadi@cad.ece.ut.ac.ir. Introduction. Distributed systems are heterogeneous: Power Architecture Data Representation

lethia
Download Presentation

Compiler, Languages, and Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh 810181079 hadi@cad.ece.ut.ac.ir

  2. Introduction • Distributed systems are heterogeneous: • Power • Architecture • Data Representation • Data access latency are significantly long and vary with underlaying network traffic • Network bandwidths are limited and can vary dramatically with the underlaying load

  3. Programming Support Systems: Principles • Principle: each component of the system should do what it does best • The application developer should be able to concentrate on problem analysis and decomposition at a fairly high level of abstraction

  4. Programming Support Systems: Goals • They should make applications easy to develop • Build applications that portable across different architectures and computing configurations • Achieve high performance close to what an expert programmer can achieve using the underlaying features of the network and computing configurations • Exploits various forms of parallelism to balance across a heterogeneous configuration • Minimizing the computation time • Matching the communication to the underlaying bandwidths and latencies • Ensure the performance variability remains within certain bounds

  5. Autoparallelization • The user focuses on what is beingcomputed rather than How • Performance penalty should not be worse rather than a factor of two • Automatic vectorization • Dependence analysis • Asynchronous (MIMD) Parallel Processing • Symmetric multiprocessor (SMP)

  6. Distributed Memory Architecture • Caches • Higher latency of large memories • Determine how to apportion data to the memories of processors in away that • Maximize local memory access • Minimize communication • Regions of parallel execution had to be large enough to compensate for the overhead of initiating and synchronization • Interprocedural analysis and optimization • Mechanisms that involve the programmer in the design of the parallelization as well as the problem solution will be required

  7. Explicit Communication • Message passing to get data from remote memories • Single version of program runs on the all processors • The computation is specialized to specific processors through extracting number of processor and indexing its own data

  8. Send-Receive Model • A shared-memory environment • Each processor not only receives its needed data but also sends data other ones require • PVM • MPI

  9. Get-Put Model • The processor that needs data from a remote memory is able to explicitly get it without requiring any explicit action by the remote processor

  10. Discussion • Program is responsible for: • Decomposition of computation • The power of individual processor • Load balancing • Layout of the memory • Management of latency • Organization and optimization of communication • Explicit communication can be though of as an assembly language for grids

  11. Distributed Shared Memory • DSM as a vehicle for hiding complexities of memory and communication management • Address space is as flatten as a single-processor machine for programmer • The hardware/software is responsible for data retrieval through generating needed communications, from remote memories

  12. Hardware Approach • Stanford DASH, HP/Convex Exemplar, SGI Origin • Local cache misses initiate data transfer from remote memory if needed

  13. Software Scheme • Shared Virtual Memory, TreadMark • Rely on paging mechanism in the operating system • Transfer whole page on demand between operating systems • Make granularity and latency significantly large • Used in conjunction with relaxed memory consistency models and support for latency hiding

  14. Discussion • Programmer is free from handling thread packaging and parallel loops • Has performance penalties and then is useful for coarser-grained parallelism • Works best with some help from the programmer on the layout of memory • Is a promising strategy for simplifying the programming model

  15. Data-Parallel Languages • High performance on distributed memory: • Allocate data to various processor memory to maximize locality and minimize communication • For scaling parallelism to hundreds or thousands of processors data parallelism is necessary • Data parallelism: subdividing the data domain in some manner and assigning the subdomains to different processors (data layout) • These are the foundations for data-parallel languages • Fortran D, Vienna Fortran, CM Fortran, C*, data-parallel C, and PC++ • High Performance Fortran (HPF), and High Performance C++ (HPC++)

  16. HPF • Provides directives for data layout on F’90 and F’95 • Directives have no effect on the meaning of the program • Advices for compiler on how to assign elements of the program arrays and data structures to different processors • These specification is relatively machine independent • The principle focus is the layout of arrays • Arrays are typically associated with the data domains of underlying problem • The principle drawback: limited support for problems on irregular meshes • Distribution via run-time array • Generalized block distribution (blocks to be of different sizes) • For heterogeneous machines: block sizes can be adopted to the powers of target machines (generalized block distribution)

  17. HPC++ • Unsynchronized for-loops • Parallel template libraries, with parallel or distributed data structures as basis

  18. Task Parallelism • Different components of the same computation are executed in parallel • Different tasks can be allocated to different nodes of the grid • Object parallelism (Different tasks may be components of objects of different classes) • Task parallelism need not be restricted to shared-memory systems and can be defined in terms of communication library

  19. HPF 2.0 Extensions for Task Parallelism • Can be implemented on both shared- and distributed-memory systems • Providing a way for a set of cases to be run in parallel with no communication until synchronization at the end • Remaining problems on using HPF on a computational grid: • Load matching • Communication optimization

  20. Coarse-Grained Software Integration • Complete application is not a simple program • It is a collection of programs that must all be run, passing data to one another • The main technical challenge of the integration is how to prevent performance degradation due to sequential processing of the various programs • Each program could be viewed as a task • Tasks collected and matched to the power of the various nodes in the grid

  21. Latency Tolerance • Dealing with long memory or communication latencies • Latency hiding: data communication is overlapped with computation (software-perfecting) • Latency reduction: programs are reorganized to reuse more data in local memories (loop blocking for cache) • More complex to implement on heterogeneous distributed computers • Latencies are large and variable • More time to be spent on estimating running times

  22. Load Balancing • Spreading the calculation evenly across processors while minimizing communication • Simulated annealing, neural nets • Recursive bisection: at each stage, the work is divided into two equal parts. • For Grid: power of each node must be taken in the account • Performance prediction of components is essential

  23. Runtime Compilation • A problem with automatic load-balancing (especially on irregular grids) • Unknown loop upper bounds • Unknown array sizes • Inspector/executer model • Inspector: executed a single time once the runtime, establishes a plan for efficient execution • Executor:executed on each iteration, carries out the plan defined by inspector

  24. Libraries • Functional library: the parallelized version of standard functions are applied to user-defined data structures (ScaLAPACK, FFTPACK) • Data structure library: aparallel data structure is maintained within the library whose representation is hidden from the user (DAGH) • Well suited for OO languages • Provides max flexibility to the library developer to manage runtime challenges • Heterogeneous networks • Adaptive girding • Variable latencies • Drawback: their components are currently treated by compilers as black boxes • Some sort of collaboration between compiler and library might be possible, particularly in an interprocidural compilation

  25. Programming Tools • Tools like Pablo, Gist and Upshot can show where performance bottlenecks exist • Performance-tuning tools

  26. Future Directions (Assumptions) • The user is responsible for both problem decomposition and assignment • Some kind of service negotiator runs prior the execution and determines the available nodes and their relative power • Some portion of compilation will be invoked after this service

  27. Task Compilation • Constructing a task graph, along with an estimation of running time for each task • TG construction and decomposition • Performance Estimation • Restructuring the program to better suit the target grid configuration • Assignments of components of the TG to the available nodes • Java

  28. Grid Shared Memory (Challenges) • Different nodes has different page sizing and paging mechanisms • Good Performance Estimation • Managing the system level interaction providing DSM

  29. Global Grid Compilation • Providing a programming language and compilation strategy targeted to grid • Mixture of parallelism styles, data parallelism and task parallelism • Data decomposition • Function decomposition

More Related