Advances in PUMI for
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Dan Ibanez, Micah Corah , Seegyoung Seol , Mark Shephard 2/27/2013 PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on
  • Presentation posted in: General

Advances in PUMI for High Core Count Machines. Dan Ibanez, Micah Corah , Seegyoung Seol , Mark Shephard 2/27/2013 Scientific Computation Research Center Rensselaer Polytechnic Institute. Outline. Distributed Mesh Data Structure Phased Message Passing

Download Presentation

Dan Ibanez, Micah Corah , Seegyoung Seol , Mark Shephard 2/27/2013

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Dan ibanez micah corah seegyoung seol mark shephard 2 27 2013

Advances in PUMI for

High Core Count Machines

  • Dan Ibanez, Micah Corah, SeegyoungSeol, Mark Shephard

  • 2/27/2013

  • Scientific Computation Research Center

  • Rensselaer Polytechnic Institute


Outline

Outline

  • Distributed Mesh Data Structure

  • Phased Message Passing

  • Hybrid (MPI/thread) Programming Model

  • Hybrid Phased Message Passing

  • Hybrid Partitioning

  • Hybrid Mesh Migration


Unstructured mesh data structure

Unstructured Mesh Data Structure

Mesh

Part

Pointer in

Data Structure

Regions

Faces

Edges

Vertices


Distributed mesh representation

Distributed Mesh Representation

Mesh elements assigned to parts

  • Uniquely identified by handle or global ID

  • Treated as a serial mesh with the addition of part boundaries

    • Part boundary: groups of mesh

      entities on shared links between

      parts

    • Remote copy: duplicated entity

      copy on non-local part

    • Resident part set : list of parts

      where the entity exists

  • Can have multiple parts

    per process.


Message passing

Message Passing

  • Primitive functional set:

    • Size – members in group

    • Rank – ID of self in group

    • Send – non-blocking synchronous send

    • Probe – non-blocking probe

    • Receive – blocking receive

      Non-blocking barrier (ibarrier)

    • API Call 1: Begin ibarrier

    • API Call 2: Wait for ibarrier termination

    • Used for phased message passing

    • Will be available in MPI3, right now custom solution


Ibarrier implementation

ibarrier Implementation

  • Using all non-blocking point-to-point calls:

  • For N ranks, lg(N) go to and from rank 0

  • Uses a separate MPI communicator

0

1

2

3

4

Reduce

Broadcast


Phased message passing

Phased Message Passing

  • Similar to Bulk Synchronous Parallel

  • Uses non-blocking barrier

    • Begin phase

    • Send all messages

    • Receive any messages sent this phase

    • End phase

  • Benefits:

    • Efficient termination detection when neighbors unknown

    • Phases are implicit barriers – simplify algorithms

    • Allows buffering all messages per rank per phase


Phased message passing1

Phased Message Passing

  • Implementation:

    • Post all sends for this phase

    • While local sends incomplete: receive any

      • Now local sends complete (remember they are synchronous)

    • Begin “stopped sending” ibarrier

    • While ibarrier incomplete: receive any

      • Now all sends complete, can stop receiving

    • Begin “stopped receiving” ibarrier

    • While ibarrier incomplete: compute

      • Now all ranks stopped receiving, safe to send next phase

    • Repeat

recv

recv

recv

send

send

send

ibarriers are

signal edges


Hybrid system

Hybrid System

Blue Gene/Q

Program

MAPPING

Node

Process

Thread

Thread

Thread

Thread

Core

Core

Core

Core

Thread

Thread

Thread

Thread

Core

Core

Core

Core

Thread

Thread

Thread

Thread

Core

Core

Core

Core

Thread

Thread

Thread

Thread

Core

Core

Core

Core

*Processes per node and threads per core are variable


Hybrid programming system

Hybrid Programming System

1. Message Passing is the de facto standard programming model for distributed memory architectures.

2. The classic shared memory programming model: mutexes, atomic operations, lockless structures

Most massively parallel code is currently using model 1.

Models are very different, hard to convert from 1 to 2.


Hybrid programming system1

Hybrid Programming System

We will try message passing between threads.

Threads can send to other threads in the same process

And to threads in a different process.

Same model as MPI, replace “process” with “thread”.

Porting is faster: change the message passing API.

Shared memory is still exploited, lock with messages:

Thread 1:

Write(A)

Release(lockA)

Thread2:

Lock(lockA)

Write(A)

Thread 1:

Write(A)

SendTo(2)

Thread2:

ReceiveFrom(1)

Write(A)

becomes


Parallel control utility

Parallel Control Utility

Multi-threading API for hybrid MPI/thread mode

  • Launch a function pointer on N threads

  • Get thread ID, number of threads in process

  • Uses pthread directly

    Phased communication API

  • Send messages in batches per phase, detect end of phase

    Hybrid MPI/thread communication API

  • Uses hybrid ranks and size

  • Same phased API, automatically changes to hybrid when called within threads

    Future: Hardware queries by wrapping hwloc*

    * Portable Hardware Locality (http://www.open-mpi.org/projects/hwloc/)


Hybrid message passing

Hybrid Message Passing

  • Everything built from primitives, need hybrid primitives:

    • Size: # of threads on the whole machine

    • Rank: machine-unique ID of the thread

    • Send, Probe, and Receive using hybrid ranks

Process rank

0

1

Thread rank

0

1

2

3

0

1

2

3

Hybrid rank

4

5

6

7

0

1

2

3


Hybrid message passing1

Hybrid Message Passing

  • Initial simple hybrid primitives: just wrap MPI primitives

    • MPI_Init_thread with MPI_THREAD_MULTIPLE

    • MPI rank = floor(Hybrid rank / threads per process)

    • MPI tag bit fields:

From thread

To thread

Hybrid tag

Phased

Phased

MPI mode:

ibarrier

Hybrid mode:

ibarrier

MPI primitives

Hybrid primitives

MPI primitives


Hybrid partitioning

Hybrid Partitioning

  • Partition mesh to processes, then partition to threads

  • Map Parts to threads, 1-to-1

  • Share entities on inter-thread part boundaries

Process 1

Process 3

MPI

Process 2

Process 4

Part

Part

Part

Part

Part

Part

Part

Part

pthreads

pthreads

pthreads

pthreads

Part

Part

Part

Part

Part

Part

Part

Part


Hybrid partitioning1

P

0

P

2

0

M

j

0

M

i

P

1

Hybrid Partitioning

  • Entities are shared within a process

    • Part boundary entity is created once per process

    • Part boundary entity is shared by all local parts

    • Only owning part can modify entity

      (avoids almost all contention)

    • Remote copy: duplicate entity

      copy on another process

    • Parallel control utility can provide

      architecture info to mesh,

      which is distributed

      accordingly.

inter-process boundary

process j

process i

intra-process part boundary (implicit)


Mesh migration

Mesh Migration

  • Moving mesh entities between parts

    • Input: local mesh elements to send to other parts

    • Other entities to move are determined by adjacencies

  • Complex Subtasks

    • Reconstructing mesh adjacencies

    • Re-structuring the partition model

    • Re-computing remote copies

  • Considerations

    • Neighborhoods change: try to maintain scalability despite loss of communication locality

    • How to benefit from shared memory


Mesh migration1

2

2

2

2

2

2

2

2

2

2

2

2

2

2

1

1

1

1

1

1

1

1

1

1

Mesh Migration

  • Migration Steps

1

1

(B) Get affected entities and compute post-migration residence parts

(A) Mark destination part id

P0

P0

P2

2

2

P2

2

2

2

2

2

1

1

1

1

1

1

P1

P1

(C) Exchange entities

and update part boundary

(D) Delete migrated entities


Hybrid migration

Hybrid Migration

  • Shared memory optimizations:

    • Thread to part matching: use partition model for concurrency

    • Threads handle part boundary entities which they own

      • Other entities are ‘released’

    • Inter-process entity movement

      • Send entity to one thread per process

    • Intra-process entity movement

      • Send message containing pointer

0

1

0

1

release

grab

3

3

2

2


Hybrid migration1

Hybrid Migration

  • Release shared entities

  • Update entity resident part sets

  • Move entities between processes

  • Move entities between threads

  • Grab shared entities

    Two-level temporary ownership: Master and Process Master

    Master: smallest resident part ID

    Process Master: smallest on-process resident part ID

0

1

3

2


Hybrid migration2

Hybrid Migration

Representative Phase:

  • Old Master Part sends entity to new Process Master Parts

  • Receivers bounce back addresses of created entities

  • Senders broadcast union of all addresses

Old Resident Parts:

{1,2,3}

New Resident Parts:

{5,6,7}

0

1

Data to create copy

Address of local copy

Addresses of all copies

4

5

7

6


Hybrid migration3

Hybrid Migration

Many subtle complexities:

  • Most steps have to be done one dimension at a time

  • Assigning upward adjacencies causes thread contention

    • Use a separate phase of communication to make them

    • Use another phase to remove them when entities are deleted

  • Assigning downward adjacencies requires addresses on the new process

    • Use a separate phase to gather remote copies


Preliminary results

Preliminary Results

  • Model: bi-unit cube

  • Mesh: 260K tets, 16 parts

  • Migration: sort by X coordinate


Preliminary results1

Preliminary Results

  • First test of hybrid algorithm:

  • Using 1 node of the CCNI Blue Gene /Q:

  • Cases:

    • 16 MPI ranks, 1 thread per rank

      • 18.36 seconds for migration

      • 433 MB mesh memory use (sum of all MPI ranks)

    • 1 MPI rank, 16 threads per rank

      • 9.62 seconds for migration + thread create/join

      • 157 MB mesh memory use (sum of all threads)


Thank you

Thank You

SeegyoungSeol – FMDB architect, part boundary sharing

Micah Corah – SCOREC undergraduate, threaded part loading


  • Login