compiler languages and libraries
Skip this Video
Download Presentation
Compiler, Languages, and Libraries

Loading in 2 Seconds...

play fullscreen
1 / 29

Compiler, Languages, and Libraries - PowerPoint PPT Presentation

  • Uploaded on

Compiler, Languages, and Libraries. ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh 810181079 [email protected] Introduction. Distributed systems are heterogeneous: Power Architecture Data Representation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Compiler, Languages, and Libraries' - lethia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
compiler languages and libraries

Compiler, Languages, and Libraries

ECE Dept., University of Tehran

Parallel Processing Course Seminar

Hadi Esmaeilzadeh


[email protected]

  • Distributed systems are heterogeneous:
    • Power
    • Architecture
    • Data Representation
  • Data access latency are significantly long and vary with underlaying network traffic
  • Network bandwidths are limited and can vary dramatically with the underlaying load
programming support systems principles
Programming Support Systems: Principles
  • Principle: each component of the system should do what it does best
  • The application developer should be able to concentrate on problem analysis and decomposition at a fairly high level of abstraction
programming support systems goals
Programming Support Systems: Goals
  • They should make applications easy to develop
  • Build applications that portable across different architectures and computing configurations
  • Achieve high performance close to what an expert programmer can achieve using the underlaying features of the network and computing configurations
  • Exploits various forms of parallelism to balance across a heterogeneous configuration
    • Minimizing the computation time
    • Matching the communication to the underlaying bandwidths and latencies
    • Ensure the performance variability remains within certain bounds
  • The user focuses on what is beingcomputed rather than How
  • Performance penalty should not be worse rather than a factor of two
  • Automatic vectorization
    • Dependence analysis
  • Asynchronous (MIMD) Parallel Processing
    • Symmetric multiprocessor (SMP)
distributed memory architecture
Distributed Memory Architecture
  • Caches
  • Higher latency of large memories
  • Determine how to apportion data to the memories of processors in away that
    • Maximize local memory access
    • Minimize communication
  • Regions of parallel execution had to be large enough to compensate for the overhead of initiating and synchronization
  • Interprocedural analysis and optimization
  • Mechanisms that involve the programmer in the design of the parallelization as well as the problem solution will be required
explicit communication
Explicit Communication
  • Message passing to get data from remote memories
  • Single version of program runs on the all processors
  • The computation is specialized to specific processors through extracting number of processor and indexing its own data
send receive model
Send-Receive Model
  • A shared-memory environment
  • Each processor not only receives its needed data but also sends data other ones require
  • PVM
  • MPI
get put model
Get-Put Model
  • The processor that needs data from a remote memory is able to explicitly get it without requiring any explicit action by the remote processor
  • Program is responsible for:
    • Decomposition of computation
      • The power of individual processor
    • Load balancing
    • Layout of the memory
    • Management of latency
    • Organization and optimization of communication
  • Explicit communication can be though of as an assembly language for grids
distributed shared memory
Distributed Shared Memory
  • DSM as a vehicle for hiding complexities of memory and communication management
  • Address space is as flatten as a single-processor machine for programmer
  • The hardware/software is responsible for data retrieval through generating needed communications, from remote memories
hardware approach
Hardware Approach
  • Stanford DASH, HP/Convex Exemplar, SGI Origin
  • Local cache misses initiate data transfer from remote memory if needed
software scheme
Software Scheme
  • Shared Virtual Memory, TreadMark
  • Rely on paging mechanism in the operating system
  • Transfer whole page on demand between operating systems
  • Make granularity and latency significantly large
  • Used in conjunction with relaxed memory consistency models and support for latency hiding
  • Programmer is free from handling thread packaging and parallel loops
  • Has performance penalties and then is useful for coarser-grained parallelism
  • Works best with some help from the programmer on the layout of memory
  • Is a promising strategy for simplifying the programming model
data parallel languages
Data-Parallel Languages
  • High performance on distributed memory:
    • Allocate data to various processor memory to maximize locality and minimize communication
  • For scaling parallelism to hundreds or thousands of processors data parallelism is necessary
    • Data parallelism: subdividing the data domain in some manner and assigning the subdomains to different processors (data layout)
  • These are the foundations for data-parallel languages
  • Fortran D, Vienna Fortran, CM Fortran, C*, data-parallel C, and PC++
  • High Performance Fortran (HPF), and High Performance C++ (HPC++)
  • Provides directives for data layout on F’90 and F’95
  • Directives have no effect on the meaning of the program
  • Advices for compiler on how to assign elements of the program arrays and data structures to different processors
  • These specification is relatively machine independent
  • The principle focus is the layout of arrays
  • Arrays are typically associated with the data domains of underlying problem
  • The principle drawback: limited support for problems on irregular meshes
    • Distribution via run-time array
    • Generalized block distribution (blocks to be of different sizes)
  • For heterogeneous machines: block sizes can be adopted to the powers of target machines (generalized block distribution)
  • Unsynchronized for-loops
  • Parallel template libraries, with parallel or distributed data structures as basis
task parallelism
Task Parallelism
  • Different components of the same computation are executed in parallel
  • Different tasks can be allocated to different nodes of the grid
  • Object parallelism (Different tasks may be components of objects of different classes)
  • Task parallelism need not be restricted to shared-memory systems and can be defined in terms of communication library
hpf 2 0 extensions for task parallelism
HPF 2.0 Extensions for Task Parallelism
  • Can be implemented on both shared- and distributed-memory systems
  • Providing a way for a set of cases to be run in parallel with no communication until synchronization at the end
  • Remaining problems on using HPF on a computational grid:
    • Load matching
    • Communication optimization
coarse grained software integration
Coarse-Grained Software Integration
  • Complete application is not a simple program
  • It is a collection of programs that must all be run, passing data to one another
  • The main technical challenge of the integration is how to prevent performance degradation due to sequential processing of the various programs
  • Each program could be viewed as a task
  • Tasks collected and matched to the power of the various nodes in the grid
latency tolerance
Latency Tolerance
  • Dealing with long memory or communication latencies
  • Latency hiding: data communication is overlapped with computation (software-perfecting)
  • Latency reduction: programs are reorganized to reuse more data in local memories (loop blocking for cache)
  • More complex to implement on heterogeneous distributed computers
    • Latencies are large and variable
    • More time to be spent on estimating running times
load balancing
Load Balancing
  • Spreading the calculation evenly across processors while minimizing communication
  • Simulated annealing, neural nets
  • Recursive bisection: at each stage, the work is divided into two equal parts.
  • For Grid: power of each node must be taken in the account
    • Performance prediction of components is essential
runtime compilation
Runtime Compilation
  • A problem with automatic load-balancing (especially on irregular grids)
    • Unknown loop upper bounds
    • Unknown array sizes
  • Inspector/executer model
    • Inspector: executed a single time once the runtime, establishes a plan for efficient execution
    • Executor:executed on each iteration, carries out the plan defined by inspector
  • Functional library: the parallelized version of standard functions are applied to user-defined data structures (ScaLAPACK, FFTPACK)
  • Data structure library: aparallel data structure is maintained within the library whose representation is hidden from the user (DAGH)
    • Well suited for OO languages
    • Provides max flexibility to the library developer to manage runtime challenges
      • Heterogeneous networks
      • Adaptive girding
      • Variable latencies
  • Drawback: their components are currently treated by compilers as black boxes
    • Some sort of collaboration between compiler and library might be possible, particularly in an interprocidural compilation
programming tools
Programming Tools
  • Tools like Pablo, Gist and Upshot can show where performance bottlenecks exist
  • Performance-tuning tools
future directions assumptions
Future Directions (Assumptions)
  • The user is responsible for both problem decomposition and assignment
  • Some kind of service negotiator runs prior the execution and determines the available nodes and their relative power
  • Some portion of compilation will be invoked after this service
task compilation
Task Compilation
  • Constructing a task graph, along with an estimation of running time for each task
    • TG construction and decomposition
    • Performance Estimation
  • Restructuring the program to better suit the target grid configuration
  • Assignments of components of the TG to the available nodes
    • Java
grid shared memory challenges
Grid Shared Memory (Challenges)
  • Different nodes has different page sizing and paging mechanisms
  • Good Performance Estimation
  • Managing the system level interaction providing DSM
global grid compilation
Global Grid Compilation
  • Providing a programming language and compilation strategy targeted to grid
  • Mixture of parallelism styles, data parallelism and task parallelism
    • Data decomposition
    • Function decomposition