Compiler languages and libraries
1 / 29

Compiler, Languages, and Libraries - PowerPoint PPT Presentation

  • Uploaded on

Compiler, Languages, and Libraries. ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh 810181079 Introduction. Distributed systems are heterogeneous: Power Architecture Data Representation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Compiler, Languages, and Libraries' - lethia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Compiler languages and libraries

Compiler, Languages, and Libraries

ECE Dept., University of Tehran

Parallel Processing Course Seminar

Hadi Esmaeilzadeh



  • Distributed systems are heterogeneous:

    • Power

    • Architecture

    • Data Representation

  • Data access latency are significantly long and vary with underlaying network traffic

  • Network bandwidths are limited and can vary dramatically with the underlaying load

Programming support systems principles
Programming Support Systems: Principles

  • Principle: each component of the system should do what it does best

  • The application developer should be able to concentrate on problem analysis and decomposition at a fairly high level of abstraction

Programming support systems goals
Programming Support Systems: Goals

  • They should make applications easy to develop

  • Build applications that portable across different architectures and computing configurations

  • Achieve high performance close to what an expert programmer can achieve using the underlaying features of the network and computing configurations

  • Exploits various forms of parallelism to balance across a heterogeneous configuration

    • Minimizing the computation time

    • Matching the communication to the underlaying bandwidths and latencies

    • Ensure the performance variability remains within certain bounds


  • The user focuses on what is beingcomputed rather than How

  • Performance penalty should not be worse rather than a factor of two

  • Automatic vectorization

    • Dependence analysis

  • Asynchronous (MIMD) Parallel Processing

    • Symmetric multiprocessor (SMP)

Distributed memory architecture
Distributed Memory Architecture

  • Caches

  • Higher latency of large memories

  • Determine how to apportion data to the memories of processors in away that

    • Maximize local memory access

    • Minimize communication

  • Regions of parallel execution had to be large enough to compensate for the overhead of initiating and synchronization

  • Interprocedural analysis and optimization

  • Mechanisms that involve the programmer in the design of the parallelization as well as the problem solution will be required

Explicit communication
Explicit Communication

  • Message passing to get data from remote memories

  • Single version of program runs on the all processors

  • The computation is specialized to specific processors through extracting number of processor and indexing its own data

Send receive model
Send-Receive Model

  • A shared-memory environment

  • Each processor not only receives its needed data but also sends data other ones require

  • PVM

  • MPI

Get put model
Get-Put Model

  • The processor that needs data from a remote memory is able to explicitly get it without requiring any explicit action by the remote processor


  • Program is responsible for:

    • Decomposition of computation

      • The power of individual processor

    • Load balancing

    • Layout of the memory

    • Management of latency

    • Organization and optimization of communication

  • Explicit communication can be though of as an assembly language for grids

Distributed shared memory
Distributed Shared Memory

  • DSM as a vehicle for hiding complexities of memory and communication management

  • Address space is as flatten as a single-processor machine for programmer

  • The hardware/software is responsible for data retrieval through generating needed communications, from remote memories

Hardware approach
Hardware Approach

  • Stanford DASH, HP/Convex Exemplar, SGI Origin

  • Local cache misses initiate data transfer from remote memory if needed

Software scheme
Software Scheme

  • Shared Virtual Memory, TreadMark

  • Rely on paging mechanism in the operating system

  • Transfer whole page on demand between operating systems

  • Make granularity and latency significantly large

  • Used in conjunction with relaxed memory consistency models and support for latency hiding


  • Programmer is free from handling thread packaging and parallel loops

  • Has performance penalties and then is useful for coarser-grained parallelism

  • Works best with some help from the programmer on the layout of memory

  • Is a promising strategy for simplifying the programming model

Data parallel languages
Data-Parallel Languages

  • High performance on distributed memory:

    • Allocate data to various processor memory to maximize locality and minimize communication

  • For scaling parallelism to hundreds or thousands of processors data parallelism is necessary

    • Data parallelism: subdividing the data domain in some manner and assigning the subdomains to different processors (data layout)

  • These are the foundations for data-parallel languages

  • Fortran D, Vienna Fortran, CM Fortran, C*, data-parallel C, and PC++

  • High Performance Fortran (HPF), and High Performance C++ (HPC++)

Compiler languages and libraries

  • Provides directives for data layout on F’90 and F’95

  • Directives have no effect on the meaning of the program

  • Advices for compiler on how to assign elements of the program arrays and data structures to different processors

  • These specification is relatively machine independent

  • The principle focus is the layout of arrays

  • Arrays are typically associated with the data domains of underlying problem

  • The principle drawback: limited support for problems on irregular meshes

    • Distribution via run-time array

    • Generalized block distribution (blocks to be of different sizes)

  • For heterogeneous machines: block sizes can be adopted to the powers of target machines (generalized block distribution)

Compiler languages and libraries

  • Unsynchronized for-loops

  • Parallel template libraries, with parallel or distributed data structures as basis

Task parallelism
Task Parallelism

  • Different components of the same computation are executed in parallel

  • Different tasks can be allocated to different nodes of the grid

  • Object parallelism (Different tasks may be components of objects of different classes)

  • Task parallelism need not be restricted to shared-memory systems and can be defined in terms of communication library

Hpf 2 0 extensions for task parallelism
HPF 2.0 Extensions for Task Parallelism

  • Can be implemented on both shared- and distributed-memory systems

  • Providing a way for a set of cases to be run in parallel with no communication until synchronization at the end

  • Remaining problems on using HPF on a computational grid:

    • Load matching

    • Communication optimization

Coarse grained software integration
Coarse-Grained Software Integration

  • Complete application is not a simple program

  • It is a collection of programs that must all be run, passing data to one another

  • The main technical challenge of the integration is how to prevent performance degradation due to sequential processing of the various programs

  • Each program could be viewed as a task

  • Tasks collected and matched to the power of the various nodes in the grid

Latency tolerance
Latency Tolerance

  • Dealing with long memory or communication latencies

  • Latency hiding: data communication is overlapped with computation (software-perfecting)

  • Latency reduction: programs are reorganized to reuse more data in local memories (loop blocking for cache)

  • More complex to implement on heterogeneous distributed computers

    • Latencies are large and variable

    • More time to be spent on estimating running times

Load balancing
Load Balancing

  • Spreading the calculation evenly across processors while minimizing communication

  • Simulated annealing, neural nets

  • Recursive bisection: at each stage, the work is divided into two equal parts.

  • For Grid: power of each node must be taken in the account

    • Performance prediction of components is essential

Runtime compilation
Runtime Compilation

  • A problem with automatic load-balancing (especially on irregular grids)

    • Unknown loop upper bounds

    • Unknown array sizes

  • Inspector/executer model

    • Inspector: executed a single time once the runtime, establishes a plan for efficient execution

    • Executor:executed on each iteration, carries out the plan defined by inspector


  • Functional library: the parallelized version of standard functions are applied to user-defined data structures (ScaLAPACK, FFTPACK)

  • Data structure library: aparallel data structure is maintained within the library whose representation is hidden from the user (DAGH)

    • Well suited for OO languages

    • Provides max flexibility to the library developer to manage runtime challenges

      • Heterogeneous networks

      • Adaptive girding

      • Variable latencies

  • Drawback: their components are currently treated by compilers as black boxes

    • Some sort of collaboration between compiler and library might be possible, particularly in an interprocidural compilation

Programming tools
Programming Tools

  • Tools like Pablo, Gist and Upshot can show where performance bottlenecks exist

  • Performance-tuning tools

Future directions assumptions
Future Directions (Assumptions)

  • The user is responsible for both problem decomposition and assignment

  • Some kind of service negotiator runs prior the execution and determines the available nodes and their relative power

  • Some portion of compilation will be invoked after this service

Task compilation
Task Compilation

  • Constructing a task graph, along with an estimation of running time for each task

    • TG construction and decomposition

    • Performance Estimation

  • Restructuring the program to better suit the target grid configuration

  • Assignments of components of the TG to the available nodes

    • Java

Grid shared memory challenges
Grid Shared Memory (Challenges)

  • Different nodes has different page sizing and paging mechanisms

  • Good Performance Estimation

  • Managing the system level interaction providing DSM

Global grid compilation
Global Grid Compilation

  • Providing a programming language and compilation strategy targeted to grid

  • Mixture of parallelism styles, data parallelism and task parallelism

    • Data decomposition

    • Function decomposition