an extended openmp targeting on the hybrid architecture of smp cluster
Download
Skip this Video
Download Presentation
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER

Loading in 2 Seconds...

play fullscreen
1 / 21

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER. Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED international conference on Advances in computer science and technology Speaker : Cheng-Jung Wu. Outline. Introduction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER' - sinead


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an extended openmp targeting on the hybrid architecture of smp cluster

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER

Author:Y. Zhao、 C. Hu、S. Wang、 S. Zhang

Source :Proceedings of the 2nd IASTED international conference on Advances in computer science and technology

Speaker : Cheng-Jung Wu

outline
Outline
  • Introduction
  • Extensions in EOMP
    • Computing Resource Definition
    • Hierarchical Data Layout and Data Mapping
  • Execution Model for EOMP
  • Experiments and Results
    • Dot Product
    • Matrix multiplication under EOMP execution model on SMP cluster
  • Conclusion
introduction
Introduction
  • Clusters of shared-memory multiprocessors (SMPs)
    • More and more popular in High Performance Computing area
  • SMP clusters’ hybrid architectures
    • Supports for a wide range of parallel paradigm
  • Three programming paradigms
    • Standard message passing
    • Hybrid paradigm corresponding to the underlying architecture
    • Shared memory paradigm built on a Software Distribute Software Memory (SDSM)
  • Three major metrics
    • Performance、Portability 、Programmability
introduction1
Introduction
  • None of the three parallel programming paradigms can meet on all of the three metrics
  • New parallel paradigm (EOMP)
    • A compromising model
      • Balance the three major metrics
    • Features
      • Good programmability
      • Acceptable performance
  • Improve memory behavior
    • Data locality
    • The programs running on SMP cluster
      • Inter-node and intra-node data locality
extensions in eomp
Extensions in EOMP
  • Since OpenMP
    • Shared-memory systems
    • Lacks the support for distributed memory system
  • New directives
    • Computing resource definition
    • Data mapping
computing resource definition
Computing Resource Definition
  • Definitions
    • Virtual node (VN)
    • Virtual processor (VP)
  • VNs
    • Physical nodes
    • Target units of inter-node data distribution
  • VPs
    • Physical processors
    • Target units of intra-node data reallocation and task scheduling during compilation
computing resource definition1
Computing Resource Definition
  • Semantics of computing resource definition directives
    • Examples for processor mapping are given
hierarchical data layout and data mapping inter node data mapping
Hierarchical Data Layout and Data Mapping:Inter-node Data Mapping
  • Scalar data defined in the EOMP
    • Shared data at default
  • Every node gets an own copy of the data
  • Inter-node task parallel
    • Allows the shared scalar data be modified in certain nodes
  • Global addresses of distributed arrays
    • Llocal addresses
  • Inter-node data mapping distributes the mapped arrays to VNs
  • Semantics for inter-node data distribution directive:
    • #pragma eomp distribute a (BLOCK*) onto N
hierarchical data layout and data mapping intra node data mapping
Hierarchical Data Layout and Data Mapping:Intra-node Data Mapping
  • Shared memory data layout takes the advantage of global address
  • Technically
    • No further data mapping is required inside the nodes
  • In certain cases
    • Improper order of data access would decrease cache performance
      • false sharing or long-stride access
  • For instance:two threads always access the nearby array elements in memory at same time
    • cache performance may be very poor due to severe false sharing
  • Optimizations for intra-node data layout will be necessary
hierarchical data layout and data mapping intra node data mapping1
Hierarchical Data Layout and Data Mapping:Intra-node Data Mapping
  • An extreme example
    • Experiment on that circumstance shows an overall 90% reduction of L1 cache miss after the intra-node data reallocation optimization
    • (On 4-cpu IA64 SMP; the array a is of 1M size).
hierarchical data layout and data mapping intra node data mapping2
Hierarchical Data Layout and Data Mapping:Intra-node Data Mapping
  • Two strategies can be adopted to reduce cache miss
    • Rearrange the access order of each thread
      • Not always possible for compiler optimization
      • It depends closely on the source program structure
      • In the interleaving data case above, this means to avoid accessing the neighboring data in memory at the same time.
    • Reallocate the data layout in memory,
      • Not change data dependencies of the source program
      • Assures the correctness of this optimization
      • Store the data that accessed by the same thread in a contiguous memory block
hierarchical data layout and data mapping intra node data mapping3
Hierarchical Data Layout and Data Mapping:Intra-node Data Mapping
  • Intra-node data reallocation
    • programmer-specified directives
    • compiler reference analysis
  • Intra-node data reallocating data in memory
    • Additional time and space overheads
    • Evaluating the performance speedup of this optimization
    • The data locations have been changed
      • The reallocated data should be forbidden
  • Semantics for intra-node data reallocation directive
    • #pragma eomp distribute a (CYCLIC,*) intra
execution model for eomp
Inter-node barriers and broadcasts

Modifications of shared variables in the parallel section at the edges of task parallel region

Maintain data consistency

Inter-node communications use explicit message passing

Execution Model for EOMP
execution model for eomp1
Massage passing & multithreading program generated

By compiler first distributes data and schedule the tasks across nodes

Then deals with the intra-node data reallocation and task scheduling

Execution Model for EOMP
experiments and results dot product1
Experiments and Results:Dot Product
  • The experiment result shows that the efficiency of the EOMP based on the runtime library is similar to the MPI+OpenMP program (better under some cases)
  • But not good as pure MPI, because the amount of calculations in the dot product operation are not enough, comparing to the cost of inside-node scheduling
matrix multiplication under eomp execution model on smp cluster
Matrix multiplication under EOMP execution model on SMP cluster
  • C=A*B
    • A and C is distributed in rows
    • B is distributed in columns
matrix multiplication under eomp execution model on smp cluster1
Matrix multiplication under EOMP execution model on SMP cluster
  • Matrix size is small
    • The cost of inter-node scheduling and communications are relatively high (compared with the computation cost)
    • The three distributed memory models can not acquire a speedup
  • Matrix size becomes larger
    • The three distributed memory models achieve reasonable speedups
matrix multiplication under eomp execution model on smp cluster2
Matrix multiplication under EOMP execution model on SMP cluster
  • Notice that the EOMP model after intra-node data reallocation
    • Gets a high speedup when the matrix size is large
    • Showing that the improved intra-node cache performance can greatly benefit the overall performance of the program on SMP clusters
matrix multiplication under eomp execution model on smp cluster3
Matrix multiplication under EOMP execution model on SMP cluster
  • Peaks of EOMP-INDR curves in 500*500 and 1000*1000 cases
    • The effect of data reallocation is related with both the size of cache line local b
    • As the nodes become more and more, the size of local b on each node becomes smaller
    • That means the cache line may fill in more rows of local b, thus the cache misses is reduced
    • Explain why the peak in 500*500 multiplication case comes earlier than that of the 1000*1000 case
conclusion
Conclusion
  • The experiment result
    • Feasibility of our execution model
    • The benefit gained from intra-node data reallocation
  • For future work, we plan to develop a complete source to source EOMP compiler
    • Be based on ORC (Open Resource Compiler for IA64)
    • Our current runtime library prototype
    • Focusing on the communication generation and data management.
ad