An evaluation of auto scoping in openmp
Sponsored Links
This presentation is the property of its rightful owner.
1 / 19

An Evaluation of Auto-Scoping in OpenMP PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on
  • Presentation posted in: General

An Evaluation of Auto-Scoping in OpenMP. Michael Voss, Eric Chiu, Patrick Chow, Catherine Wong and Kevin Yuen ECE Department University of Toronto. An Overview of Auto-scoping. Dieter an Mey proposed Auto-scoping as an extension to OpenMP ( www.cOMPunity.org )

Download Presentation

An Evaluation of Auto-Scoping in OpenMP

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


An Evaluation of Auto-Scoping in OpenMP

Michael Voss, Eric Chiu, Patrick Chow, Catherine Wong and Kevin Yuen

ECE Department

University of Toronto


An Overview of Auto-scoping

  • Dieter an Mey proposed Auto-scoping as an extension to OpenMP (www.cOMPunity.org)

  • Relieve users from burden of explicit scoping

    • error prone

    • tedious

  • compromise: explicit and automatic parallelization

  • analysis is similar to automatic parallelization

    • successful in 1 of 2 scientific programs

WOMPAT 2004


C$OMP PARALLEL DO SHARED(A,B)

C$OMP&PRIVATE(I,J)

DO I = 1,100

DO J = 1,100

A(I,J) = A(J,I)

+ B(I,J)

ENDDO

ENDDO

C$OMP END PARALLEL DO

Using DEFAULT(AUTO)

C$OMP PARALLEL DO

C$OMP&DEFAULT(AUTO)

DO I = 1,100

DO J = 1,100

A(I,J) = A(J,I)

+ B(I,J)

ENDDO

ENDDO

C$OMP END PARALLEL DO

WOMPAT 2004


Outline of Talk

  • Introduction

  • Implementing DEFAULT(AUTO) in Polaris

  • An evaluation of DEFAULT(AUTO) in Polaris

    • comparison with EA Sun Studio 9 F95 compiler

  • A Discussion of runtime support

  • Related Work

  • Conclusion

WOMPAT 2004


Implementing DEFAULT(AUTO) in Polaris

  • Polaris is auto-parallelizer for Fortran 77

  • Supports a range of advanced techniques

    • The Range Test

    • The Omega Test

    • Array and Scalar Privatization

    • Array and Scalar Reduction Recognition

    • Induction Variables Substitution

    • Interprocedural Constant Propagation

    • Most Interprocedural Optimization by Inlining

WOMPAT 2004


Polaris as an OMP to OMP Translator

Polaris

Parser

DDtest pass

Reduction pass

Privatization pass

OpenMP Backend

Fortran 77

Fortran 77 +

OpenMP

Polaris

Parser

Moerae Backend

Fortran 77 +

Moerae calls

Fortran 77 +

OpenMP

Original automatic parallelization path

OpenMP to explicitly threaded code path

New OpenMP to OpenMP path

WOMPAT 2004


Supporting DEFAULT(AUTO)

  • Parse DEFAULT(AUTO)

  • React appropriately to user directives

    • selective loop parallelization

    • no changes without AUTO directive

    • user scoping overrides Polaris scoping

      • can parallelize loops that cannot be fully auto-scoped

  • Limitations

    • only regions with PARALLEL DO semantics

    • bails out on general parallel regions

WOMPAT 2004


Example 1: No explicit scoping

!$OMP PARALLEL DEFAULT(AUTO)

DO N = 1,7

DO M = 1,7

!$OMP DO

DO L = LSS(itsub),LEE(itsub)

I = IG(L)

J = JG(L)

K = KG(L)

LIJK = L2IJK(L)

RHS(L,M) = RHS(L,M)

+ - FJAC(LIJK,LM00,M,N)*DQCO(i-1,j,k,n,NB)*FM00(L)

+ - FJAC(LIJK,LP00,M,N)*DQCO(i+1,j,k,n,NB)*FP00(L)

+ - FJAC(LIJK,L0M0,M,N)*DQCO(i,j-1,k,n,NB)*F0M0(L)

+ - FJAC(LIJK,L0P0,M,N)*DQCO(i,j+1,k,n,NB)*F0P0(L)

ENDDO

!$OMP END DO NOWAIT

ENDDO

ENDDO

!$OMP END PARALLEL

WOMPAT 2004


Example 1: No explicit scoping

!$OMP PARALLEL

!$OMP+DEFAULT(SHARED)!$OMP+PRIVATE(M,L,N)

DO n = 1, 7, 1

DO m = 1,7, 1

!$OMP DO

DO l = lss(itsub), lee(itsub), 1

rhs(l, m) = rhs(l, m)+(-dqco(ig(l), (-1)+jg(l), kg(l), n, nb))*

*f0m0(l)*fjac(l2ijk(l), l0m0, m, n)+(-dqco(ig(l), 1+jg(l), kg(l), n

*, nb))*f0p0(l)*fjac(l2ijk(l), l0p0, m, n)+(-dqco((-1)+ig(l), jg(l)

*, kg(l), n, nb))*fjac(l2ijk(l), lm00, m, n)*fm00(l)+(-dqco(1+ig(l)

*, jg(l), kg(l), n, nb))*fjac(l2ijk(l), lp00, m, n)*fp00(l)

ENDDO

!$OMP END DO NOWAIT

ENDDO

ENDDO

!$OMP END PARALLEL

WOMPAT 2004


Example 2: Explicit scoping

SUBROUTINE RECURSION(n,k,a,b,c,d,e,f,g,h,s)

REAL*8 A(*),B(*),C(*),D(*),E(*),F(*),G(*),H(*)

REAL*8 T,S

INTEGER N,K,I

S = 0.0D0

C$OMP PARALLEL SHARED(D)

C$OMP+DEFAULT(AUTO)

C$OMP DO

DO I = 1,N

T = F(I) + G(I)

A(I) = B(I) + C(I)

D(I+K) = D(I) + E(I)

H(I) = H(I) * T

S = S + H(I)

END DO

C$OMP END DO

C$OMP END PARALLEL

END

WOMPAT 2004


Example 2: Explicit scoping

SUBROUTINE recursion(n, k, a, b, c, d, e, f, g, h, s)

DOUBLE PRECISION a, b, c, d, e, f, g, h, s, t

INTEGER*4 i, k, n

DIMENSION a(*), b(*), c(*), d(*), e(*), f(*), g(*), h(*)

s = 0.0D0

!$OMP PARALLEL

!$OMP+DEFAULT(SHARED)

!$OMP+PRIVATE(T,I)

!$OMP DO

!$OMP+REDUCTION(+:s)

DO i = 1, n, 1

t = f(i)+g(i)

a(i) = b(i)+c(i)

d(i+k) = d(i)+e(i)

h(i) = h(i)*t

s = h(i)+s

ENDDO

!$OMP END DO

!$OMP END PARALLEL

RETURN

END

WOMPAT 2004


Evaluation of DEFAULT(AUTO)

  • Fortran 77 Benchmarks from SPEC OpenMP

    • removed all explicit scoping

    • added DEFAULT(AUTO) to all regions

    • used Omni OpenMP compiler as backend (-O2)

  • Explicit speedup –vs- auto-scope speedup

    • four processor Xeon server

    • 1.8 GHz processors, 16 GBytes main memory

    • Hyperthreaded, but only used 1 thread per CPU

  • Also used EA Sun Studio 9 Fortran 95 compiler

    • supports DEFAULT(__AUTO)

    • report number of regions auto-scoped

WOMPAT 2004


Performance of Auto-scoping

Sun results are for the Early Access Version of the Sun Microsystems Studio 9 Fortran 95 compiler.

WOMPAT 2004


Discussion

  • Many regions were not fully analyzable

    • Polaris could not fully inline the regions

    • several regions were general parallel regions

  • Early Access Sun Studio 9 compiler

    • auto-scoped fewer regions in general

    • missed important regions in Swim and Mgrid

    • regions could be parallelized but not auto-scoped

  • Sun compiler could auto-scope some regions that Polaris could not

    • can analyze general parallel regions

WOMPAT 2004


A general parallel region from WupwisePolaris fails but the Sun compiler succeeds

C$OMP PARALLEL DEFAULT(AUTO)

LSCALE = ZERO

LSSQ = ONE

C$OMP DO

DO IX = 1, 1 + (N - 1) *INCX, INCX

IF (DBLE (X(IX)) .NE. ZERO) THEN

...

LSSQ = ONE + LSSQ* (LSCALE / TEMP) ** 2

LSCALE = TEMP

END IF

...

END DO

C$OMP END DO

C$OMP CRITICAL

IF (SCALE .LT. LSCALE) THEN

SSQ = ((SCALE / LSCALE) ** 2) * SSQ + LSSQ

SCALE = LSCALE

ELSE

SSQ = SSQ + ((LSCALE / SCALE) ** 2) * LSSQ

END IF

C$OMP END CRITICAL

C$OMP END PARALLEL

WOMPAT 2004


Runtime Support for Auto-scoping

  • add speculate directive for regions that cannot be auto-scoped

  • applies to very few regions in SPEC OpenMP

    • requires interprocedural marking of reads/writes

    • only 2 regions not auto-scoped can be fully analyzed

!$OMP PARALLEL

!$OMP+DEFAULT(SHARED)

!$OMP+PRIVATE(U51K,U41K,U31K,Q,U21K,M,K,I,U41,U31KM1,U51KM1,U21KM1)

!$OMP+PRIVATE(U41KM1,TMP,J)

!$OMP+SPECULATE(UTMP,RTMP)

!$OMP DO

!$OMP+LASTPRIVATE(FLUX2)

DO j = jst, jend, 1

...

ENDDO

!$OMP END DO

!$OMP END PARALLEL

(a region from the RHS subroutine of Applu)

WOMPAT 2004


Related Work

  • DEFAULT(AUTO) proposed by Dieter an Mey

  • Many commercial and research auto-parallelizers

    • Polaris, SUIF, CAPO, …

    • Perform parallelization and scoping

  • The EA Sun Studio 9 Fortran 95 Compiler

    • paper also here at WOMPAT

    • thanks to Yuan Lin for pointing me to it

  • Runtime dependence testing

    • Saltz, Rauchwerger, …

WOMPAT 2004


Conclusion

  • Implemented DEFAULT(AUTO) in Polaris

    • created full OpenMP to OpenMP translator

    • added facilities for auto-scoping

  • Evaluated implementation

    • 2 of 5 benchmarks fully auto-scoped

    • remainder showed significant loss of speedup

    • results different from EA Sun compiler

      • performance not portable across compilers

  • Discussed speculative parallelization support

WOMPAT 2004


Conclusion cont…

  • Combination of loop and region analyzer

    • Polaris auto-scoped more regions

    • Sun compiler can handle general regions

  • Performance not be portable across compilers

    • never is but…

    • sacrifice performance for convenience

    • perhaps a useful tool during manual parallelization

  • Future work

    • general region support in Polaris

WOMPAT 2004


  • Login