Chen Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek Sarkar April 21st, 2009

A Study of Different Instantiations of the OpenMP MemoryModel and Their Software Cache Implementations Chen Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek Sarkar April 21st, 2009

Outline • The OpenMP memory model is not well-defined • Our solution: Four well-defined instantiations of the OpenMP memory model • Implementations – Cache Protocols • Experimental Results

Situation of Shared Memory Parallel Programming —3-Tier Hierarchy of Programming Models Have some basic knowledge of parallel programming languages Joe Parallel Programmers Parallel Programming Specialists Have some basic knowledge of memory consistency and cache organizations Expert on parallel system architecture Computer System Specialists

Temporary view Register, cache, local storage Cache variables Not required Flush operation Enforce consistency Flush-set Reordering restriction Serialized requirement Data-race program Unspecified behavior The OpenMP Memory Model Temporary View of the Memory Temporary View of the Memory Thread Thread Thread Interconnection Network Main Memory Is the OpenMP Memory Model well-defined?

The OpenMP Memory Model is not Well-defined • Complex semantic of the temporary view • Some threads own temporary views, others not • Why access memory if the temporary view has a copy? • Unspecified semantics of data-race programs • Why reordering (between flush and memory accesses) is still restricted? • The applications are limited to be data-race-free. • Unclear definition of the flush operations • Variables may “escape” temporary view before flush • Serialized requirement is unnecessary 6

Our solution • Simple semantics for the temporary view. • We defined ModelIDEAL with very simple semantics. • Specified behaviors of all programs. • We defined four instantiations of the OpenMP memory model. Each one has specified semantics for any program. (They are equivalent for DRF programs.) • Clear definition of the flush operation. • We defined simple semantics of the flush operation. • We introduced the non-deterministic flush operation to solve the space limitation problem. 8

ModelIDEAL Each thread owns a temporary view. Infinitely big space. Temporary View of the Memory Temporary View of the Memory Temporary View of the Memory Write: Access of the temporary view. Read: 1) Access of the temporary view (Hit); or 2) Access of the main memory (Miss). Flush: Writing back “dirty values” and discarding all the values. (one thread) Thread Thread Thread Interconnection Network Main Memory 9

ModelGF Limited space Temporary View of the Memory Temporary View of the Memory Temporary View of the Memory Non-deterministic flush: A flush operation can be performed at any time. (To solve the limited space problem) Global flush: Flush operation on all of the threads. Thread Thread Thread Interconnection Network Main Memory 10

ModelLF Limited space Temporary View of the Memory Temporary View of the Memory Temporary View of the Memory Non-deterministic flush. Local Flush: Flush operation on one thread. Thread Thread Thread Interconnection Network Main Memory 11

ModelRLF Limited space Temporary View of the Memory Temporary View of the Memory Temporary View of the Memory Non-deterministic flush. Acquire: Discarding the “clean value”. Release: Writing back “dirty value”. Barrier: Acquire + Release. Thread Thread Thread Interconnection Network Main Memory 12

Implementations – Cache Protocols • Each thread contains a cache which corresponds to its temporary view. • Each operation is performed on one cache line. • Per-location dirty bits in each cache line. 14

Centralized Directory for ModelGF • Directory in the main memory • Information of all the caches • A flush will look up the directory and inform the involved threads 15

Very small local storage per SPE Local storage stores both data and instructions SPE accesses global memory by DMA transfers Cell Architecture Local storage 256K Local storage 256K Local storage 256K Local storage 256K Synergistic processing elements (SPE) Synergistic processing elements (SPE) Synergistic processing elements (SPE) Synergistic processing elements (SPE) Global memory Power processing element (PPE) Element Interconnect Bus (EIB) Synergistic processing elements (SPE) Synergistic processing elements (SPE) Synergistic processing elements (SPE) Synergistic processing elements (SPE) Local storage 256K Local storage 256K Local storage 256K Local storage 256K 17

OPELL Framework • An open source toolchain / runtime effort to implement OpenMP for the CBE Single Source Compiler Partition Manager Generating sequential codes for PPE Generating parallel codes for SPEs Loading/Unloading SPEs’ codes SWC is in the local storage Runtime System (PPE) Runtime System (SPE) Software Cache Task assignments Executing the sequential codes Triggering PM to execute tasks Managing software caches on SPEs 18 Remote function calls

Experimental Testbeds • Hardware (PlayStation 3tm) • 3.2 GHz Cell Broadband Engine CPU (with 6 accessible SPEs) • 256MB global shared memory. • Software Framework (OPELL) • An open source toolchain / runtime effort to implement OpenMP for the CBE. • Benchmarks • RandomAccess and Stream from the HPC Challenge benchmark suite. • Integer Sort (IS), Embarrassingly Parallel (EP) and Multigrid (MG) from the NAS Parallel Benchmarks. 19

Summary of Main Results • Performance and Scalability • ModelLF consistently outperforms ModelGF • Good scalability of ModelLF • Impact of Cache Line Eviction • The cache line eviction has a significant impact on the performance difference between ModelLF outperforms ModelGF • Such difference become larger as the cache size (per core/thread) is becoming smaller • Programmability • In our experiments, little changes are needed on the OpenMP code • In other words, the performance advantage is achieved without compromising the programmability 20

ModelGF vs. ModelLF on Execution Time (Cache size = 32K) Performance improvement: EP-A: 1.53× IS-W: 1.32× MG-W: 1.19× RandomAccess: 1.05× Stream: 1.36× • ModelLF consistently outperforms ModelGF 21

Speedup as a Function of the Number of SPEs under ModelLF IS-W and EP-W achieve almost linear speedup. MG-W performs worse because of unbalanced workloads. (3, 5 or 6 SPEs) 22

ModelGF vs. ModelLF on Execution Time and Cache Eviction Ratio for IS-W The difference of normalized exec-ution time increa-sed from 0.15 to 0.25 as the cache size per SPE was decreased from 64KB to 4KB. The two curves of cache eviction rat-io are overlapped because of comp-letely identical cache settings. 23

ModelGF vs. ModelLF on Execution Time and Cache Eviction Ratio for MG-W The difference of normalized exec-ution time increa-sed from 0.04 to 0.16 as the cache size per SPE was decreased from 32KB to 4KB. The two curves of cache eviction rat-io are overlapped because of comp-letely identical cache settings. GF LF 24

Conclusion and Future Work • Our contributions • Formalization of the OpenMP memory model • Performance studies of ModelGL and ModelLF • Future work • Studies of ModelRLF • Tests on more benchmarks • Evaluations on more many-core architectures 25

Acknowledgement • This work was supported by NSF (CNS-0509332, CSR-0720531, CCF-0833166, CCF-0702244), and other government sponsors. • Joseph B Manzano, Ge Gan, Guang R. Gao and Vivek Sarkar are co-authors of the paper • Joseph B Manzano and Guang R. Gao gave a lot of useful comments on the slides 26

BACKUP

Example (1) X = 0; p = &X; q = &X; #pragma omp parallel sections { #pragma omp section { // Section 1, assume it is running on thread T1. 1: *p = 1; 2: #pragma omp flush (X) } #pragma omp section { // Section 2, assume it is running on thread T2. 3: *q = 2; 4: #pragma omp flush (X) } #pragma omp section { // Section 3, assume it is running on thread T3. 5: #pragma omp flush (X) // Assume that compiler cannot establish that p == q 6: v1 = *p; 7: v2 = *q; 8: v3 = *p; } } ModelIDEAL: v1, v2 and v3 always read the same value. (E.g. {v1==v2==v3==0(or 1, 2)}) ModelGF: v1, v2 and v3 may read different values. (E.g. {v1==1, v2==v3==2} if there is a non-deterministic flush between statements 6 and 7.) ModelLF: v1, v2 and v3 may read different values. The order 1-3-2-4-5-6-7-8 results in {v1==v2==v3==2} under ModelIDEAL; {v1==v2==v3==1} underModelGF; and {v1==v2==v3==2} under ModelLF. 28

Example (2) X = 0; p = &X; q = &X; #pragma omp parallel sections { #pragma omp section { // Section 1, assume it is running on thread T1. 1: *p = 1; 2: #pragma omp critical // A flush (acquire) here. 3: v1 = *p; 4: // A flush (release) here. } #pragma omp section { // Section 2, assume it is running on thread T2. 5: *q = 2; 6: #pragma omp critical // A flush (acquire) here. 7: v1 = *q; 8: // A flush (release) here. } } ModelIDEAL, ModelGF, and ModelGF: Statements 2 and 6 will remove the values in the temporary views – Statements 3 and 7 have to access the main memory. ModelRLF: Statements 2 and 6 are acquire operations. The values are preserved in the temporary views – Statements 3 and 7 can access the temporary views to get the values. 29

States of Cache Lines Invalid: All the words of the cache line are invalid. Clean: All the words of the cache line contain “clean values”. Dirty: All the words of the cache line contain “dirty values”. Clean-Dirty: Clean + Dirty Invalid-Dirty: Invalid + Dirty 30

State transition diagram for the cache protocol of ModelGF and ModelLF read flush read Invalid Clean flush write write write write flush flush flush read/write write Invalid-Dirty Clean-Dirty read write write Dirty read/write

State transition diagram for the cache protocol of ModelRLF acquire/release/barrier/flush read/release read Invalid Clean acquire/barrier/flush write write write write release/ barrier flush barrier/ flush release barrier/ flush write/acquire read/write release read Invalid-Dirty Clean-Dirty acquire write write Dirty read/write/acquire

Overall Experimental Results: ModelGF vs. ModelLF • ModelLF consistently outperforms ModelGF 33

Chen Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek Sarkar April 21st, 2009

Chen Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek Sarkar April 21st, 2009

Presentation Transcript

J. Chen

Lihua Chen

Julie chen

Guang Z Chen Sector Manager, Transport South Asia Region

Chen

Katherine Chen

Jiangzhuo Chen

JASON CHEN

CHEN 4903

K. R. Chen

Li Chen

Angela Chen

Advanced Oral Communication Yi- chen Chen

Howard Chen - April 6, 2010

Xuejiao Chen

Kan Chen

history chen

Joseph Chen - Triberr

YaJuan Chen

Chen Wa