An extended openmp targeting on the hybrid architecture of smp cluster
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER PowerPoint PPT Presentation


  • 44 Views
  • Uploaded on
  • Presentation posted in: General

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER. Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED international conference on Advances in computer science and technology Speaker : Cheng-Jung Wu. Outline. Introduction

Download Presentation

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER

Author:Y. Zhao、 C. Hu、S. Wang、 S. Zhang

Source :Proceedings of the 2nd IASTED international conference on Advances in computer science and technology

Speaker : Cheng-Jung Wu


Outline

  • Introduction

  • Extensions in EOMP

    • Computing Resource Definition

    • Hierarchical Data Layout and Data Mapping

  • Execution Model for EOMP

  • Experiments and Results

    • Dot Product

    • Matrix multiplication under EOMP execution model on SMP cluster

  • Conclusion


Introduction

  • Clusters of shared-memory multiprocessors (SMPs)

    • More and more popular in High Performance Computing area

  • SMP clusters’ hybrid architectures

    • Supports for a wide range of parallel paradigm

  • Three programming paradigms

    • Standard message passing

    • Hybrid paradigm corresponding to the underlying architecture

    • Shared memory paradigm built on a Software Distribute Software Memory (SDSM)

  • Three major metrics

    • Performance、Portability 、Programmability


Introduction

  • None of the three parallel programming paradigms can meet on all of the three metrics

  • New parallel paradigm (EOMP)

    • A compromising model

      • Balance the three major metrics

    • Features

      • Good programmability

      • Acceptable performance

  • Improve memory behavior

    • Data locality

    • The programs running on SMP cluster

      • Inter-node and intra-node data locality


Extensions in EOMP

  • Since OpenMP

    • Shared-memory systems

    • Lacks the support for distributed memory system

  • New directives

    • Computing resource definition

    • Data mapping


Computing Resource Definition

  • Definitions

    • Virtual node (VN)

    • Virtual processor (VP)

  • VNs

    • Physical nodes

    • Target units of inter-node data distribution

  • VPs

    • Physical processors

    • Target units of intra-node data reallocation and task scheduling during compilation


Computing Resource Definition

  • Semantics of computing resource definition directives

    • Examples for processor mapping are given


Hierarchical Data Layout and Data Mapping:Inter-node Data Mapping

  • Scalar data defined in the EOMP

    • Shared data at default

  • Every node gets an own copy of the data

  • Inter-node task parallel

    • Allows the shared scalar data be modified in certain nodes

  • Global addresses of distributed arrays

    • Llocal addresses

  • Inter-node data mapping distributes the mapped arrays to VNs

  • Semantics for inter-node data distribution directive:

    • #pragma eomp distribute a (BLOCK*) onto N


Hierarchical Data Layout and Data Mapping:Intra-node Data Mapping

  • Shared memory data layout takes the advantage of global address

  • Technically

    • No further data mapping is required inside the nodes

  • In certain cases

    • Improper order of data access would decrease cache performance

      • false sharing or long-stride access

  • For instance:two threads always access the nearby array elements in memory at same time

    • cache performance may be very poor due to severe false sharing

  • Optimizations for intra-node data layout will be necessary


Hierarchical Data Layout and Data Mapping:Intra-node Data Mapping

  • An extreme example

    • Experiment on that circumstance shows an overall 90% reduction of L1 cache miss after the intra-node data reallocation optimization

    • (On 4-cpu IA64 SMP; the array a is of 1M size).


Hierarchical Data Layout and Data Mapping:Intra-node Data Mapping

  • Two strategies can be adopted to reduce cache miss

    • Rearrange the access order of each thread

      • Not always possible for compiler optimization

      • It depends closely on the source program structure

      • In the interleaving data case above, this means to avoid accessing the neighboring data in memory at the same time.

    • Reallocate the data layout in memory,

      • Not change data dependencies of the source program

      • Assures the correctness of this optimization

      • Store the data that accessed by the same thread in a contiguous memory block


Hierarchical Data Layout and Data Mapping:Intra-node Data Mapping

  • Intra-node data reallocation

    • programmer-specified directives

    • compiler reference analysis

  • Intra-node data reallocating data in memory

    • Additional time and space overheads

    • Evaluating the performance speedup of this optimization

    • The data locations have been changed

      • The reallocated data should be forbidden

  • Semantics for intra-node data reallocation directive

    • #pragma eomp distribute a (CYCLIC,*) intra


Inter-node barriers and broadcasts

Modifications of shared variables in the parallel section at the edges of task parallel region

Maintain data consistency

Inter-node communications use explicit message passing

Execution Model for EOMP


Massage passing & multithreading program generated

By compiler first distributes data and schedule the tasks across nodes

Then deals with the intra-node data reallocation and task scheduling

Execution Model for EOMP


Experiments and Results:Dot Product


Experiments and Results:Dot Product

  • The experiment result shows that the efficiency of the EOMP based on the runtime library is similar to the MPI+OpenMP program (better under some cases)

  • But not good as pure MPI, because the amount of calculations in the dot product operation are not enough, comparing to the cost of inside-node scheduling


Matrix multiplication under EOMP execution model on SMP cluster

  • C=A*B

    • A and C is distributed in rows

    • B is distributed in columns


Matrix multiplication under EOMP execution model on SMP cluster

  • Matrix size is small

    • The cost of inter-node scheduling and communications are relatively high (compared with the computation cost)

    • The three distributed memory models can not acquire a speedup

  • Matrix size becomes larger

    • The three distributed memory models achieve reasonable speedups


Matrix multiplication under EOMP execution model on SMP cluster

  • Notice that the EOMP model after intra-node data reallocation

    • Gets a high speedup when the matrix size is large

    • Showing that the improved intra-node cache performance can greatly benefit the overall performance of the program on SMP clusters


Matrix multiplication under EOMP execution model on SMP cluster

  • Peaks of EOMP-INDR curves in 500*500 and 1000*1000 cases

    • The effect of data reallocation is related with both the size of cache line local b

    • As the nodes become more and more, the size of local b on each node becomes smaller

    • That means the cache line may fill in more rows of local b, thus the cache misses is reduced

    • Explain why the peak in 500*500 multiplication case comes earlier than that of the 1000*1000 case


Conclusion

  • The experiment result

    • Feasibility of our execution model

    • The benefit gained from intra-node data reallocation

  • For future work, we plan to develop a complete source to source EOMP compiler

    • Be based on ORC (Open Resource Compiler for IA64)

    • Our current runtime library prototype

    • Focusing on the communication generation and data management.


  • Login