Windows nt based distributed virtual parallel machine
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

Windows-NT based Distributed Virtual Parallel Machine PowerPoint PPT Presentation


  • 41 Views
  • Uploaded on
  • Presentation posted in: General

The MILLIPEDE Project Technion, Israel. Windows-NT based Distributed Virtual Parallel Machine. http://www.cs.technion.ac.il/Labs/Millipede. What is Millipede ?. A strong Virtual Parallel Machine: employ non-dedicated distributed environments. Programs.

Download Presentation

Windows-NT based Distributed Virtual Parallel Machine

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Windows nt based distributed virtual parallel machine

The MILLIPEDE ProjectTechnion, Israel

Windows-NT based Distributed Virtual Parallel Machine

http://www.cs.technion.ac.il/Labs/Millipede


Windows nt based distributed virtual parallel machine

What is Millipede ?

A strong Virtual Parallel Machine:

employ non-dedicated distributed environments

Programs

Implementation of Parallel Programming Langs

Distributed Environment


Programming paradigms

Programming Paradigms

SPLASH

Cilk/Calipso

CC++

Other

Java

CParPar

ParC

ParFortran90

“Bare Millipede”

Events Mechanism (MJEC)

Migration Services (MGS)

Millipede Layer

Distributed Shared Memory (DSM)

Communication Packages

U-Net, Transis, Horus,…

Operating System Services

Communication, Threads,

Page Protection, I/O

Software Packages

User-mode threads


Windows nt based distributed virtual parallel machine

  • So, what’s in a VPM?

  • Check list:

    • Using non-dedicated cluster of PCs (+ SMPs)

    • Multi-threaded

    • Shared memory

    • User-mode

    • Strong support for weak memory

    • Dynamic page- and job-migration

    • Load sharing for maximal locality of reference

    • Convergence to optimal level of parallelism

Millipede

inside

Millipede

inside


Using a non dedicated cluster

Using a non-dedicated cluster

Dynamically identify idle machines

Move work to idle machines

Evacuate busy machines

Do everything transparently

to native user

Co-existence of

several parallel

applications


Windows nt based distributed virtual parallel machine

Multi-Threaded Environments

  • Well known:

    • Better utilization of resources

    • An intuitive and high level of abstraction

    • Latency hiding by comp. and comm. overlap

  • Natural for parallel programing paradigms & environments

    • Programmer defined max-level of parallelism

    • Actual level of parallelism set dynamically. Applications scale up and down

    • Nested parallelism

    • SMPs


Convergence to optimal speedup

Convergence to Optimal Speedup

The Tradeoff:Higher level of parallelismVS. Better locality of memory reference

Optimal speedup - not necessarily with the maximal number of computers

Achieved level of parallelism - depends on the program needs and onthe capabilities of the system


Windows nt based distributed virtual parallel machine

No/Explicit/Implicit Access Shared Memory

PVM

C-Linda

/* Receive data from master */

/* Retrieve data from DSM */

msgtype = 0;

rd

(“

init data”, ?

nproc, ?n, ?data);

pvm_recv

(-1,

msgtype);

pvm_upkint

(&

nproc, 1, 1);

pvm_upkint

(

tids,

nproc, 1);

pvm_upkint

(&n, 1, 1);

pvm_upkfloat

(data, n, 1);

/* Worker id is given at creation

/* Determine which slave I am

no need to compute it now */

(0..nproc-1)*/

for(i=0;

i<

nproc; i++)

if(

mytid==tids[i]) {

me=i; break;}

/* do calculation. put result in DSM*/

/* Do calculations with

data*/

(“result”, id,

work(id, n, data,

nproc)

);

out

result=work(me, n, data,

tids,

nproc);

/* send result to master */

pvm_initsend

(

PvmDataDefault);

“Bare” Millipede

pvm_pkint

(&me, 1, 1);

pvm_pkfloat

(&result, 1, 1);

msg_type = 5;

result =

work(

milGetMyId(),

master =

pvm_paremt

();

n, data,

pvm_send

(master,

msgtype);

milGetTotalIds())

;

/* Exit PVM before stopping */

pvm_exit

();


Relaxed consistency avoiding false sharing and ping pong

Relaxed Consistency(Avoiding false sharing and ping pong)

page

copies

Sequential, CRUW,

Sync(var), Arbitrary-CW Sync

Multiple relaxations for different shared variables within the same program

No broadcast, no central address servers

(so can work efficiently interconnected LANs)

New protocols welcome (user defined?!)

Step-by-step optimization towards maximal parallelism


Windows nt based distributed virtual parallel machine

LU Decomposition 1024x1024 matrix written in SPLASH - Advantages gained when reducing consistency of a single variable (the Global structure):


Windows nt based distributed virtual parallel machine

MJEC - Millipede Job Event Control

An open mechanism with which various

synchronization methods can be implemented

  • A job has a unique systemwide id

  • Jobs communicate and synchronize by sending events

  • Although a job is mobile, its events follow and reach itsevents queuewherever it goes

  • Event handlers are context-sensitive


Mjec con t

MJEC (con’t)

  • Modes:

    • In Execution-Mode: arriving events are enqueued

    • In Dispatching-Mode: events are dequeued and handled by a user-supplied dispatching routine


Mjec interface

MJEC Interface

Execution Mode

milEnterDispatchingMode(func, context)

ret := func(INIT, context)

Registration and Entering Dispatch Mode:

milEnterDispatchingMode((FUNC)foo, void *context)

Post Event:

milPostEvent(id target, int event, int data)

Dispatcher Routine Syntax:

int foo(id origin, int event, int data, void *context)

No

Yes

ret==EXIT?

ret := func(event, context)

event pending?

Yes

ret := func(EXIT, context)

Wait for event


Experience with mjec

Experience with MJEC

  • ParC:~ 250 linesSPLASH: ~ 120 lines

  • Easy implementation of many synchronization methods: semaphores, locks, condition variables, barriers

  • Implementation of location-dependent services (e.g., graphical display)


Example barriers with mjec

Example - Barriers with MJEC

Barrier Server

Barrier() {

milPostEvent(BARSERV, ARR, 0);

milEnterDispatchingMode(wait_in_barrier, 0);

}

wait_in_barrier(src, event, context) {

if (event == DEP)

return EXIT_DISPATCHER;

else

return STAY_IN_DISPATCHER;

}

Dispatcher:

...

EVENT

ARR

Job

Job

BARRIER(...)

Dispatcher:


Windows nt based distributed virtual parallel machine

Example - Barriers with MJEC (con’t)

BarrierServer() {

milEnterDispatchingMode(barrier_server, info);

}

barrier_server(src, event, context) {

if (event == ARR)

enqueue(context.queue, src);

if (should_release(context))

while(context.cnt>0) {

milPostEvent(context.dequeue, DEP);

}

return STAY_IN_DISPATCHER;

}

Barrier Server

Dispatcher:

...

EVENT

DEP

EVENT

DEP

Job

Job

BARRIER(...)

BARRIER(...)

Dispatcher:

Dispatcher:


Dynamic page and job migration

Dynamic Page- and Job-Migration

  • Migration may occur in case of:

    • Remote memory access

    • Load imbalance

    • User comes back from lunch

    • Improving locality by location rearrangement

  • Sometimes migration should be disabled

    • by system: ping-pong, critical section

    • by programmer: control system


Locality of memory reference is the dominant efficiency factor migration can help locality

Locality of memory reference is THE dominant efficiency factorMigration Can Help Locality:

Only Job Migration

Only Page Migration

Page & Job Migration


Load sharing max locality minimum weight multiway cut

Load Sharing + Max. Locality = Minimum-Weight multiway cut

p

q

p

q

r

r


Problems with the multiway cut model

Problems with themultiway cut model

  • NP-hard for #cuts>2. We haven>X,000,000. Polynomial 2-approximations known

  • Not optimized for load balancing

  • Page replica

  • Graph changes dynamically

  • Only external accesses are

    recorded ===> only partial

    information is available


Our approach

Our Approach

Access

page 1

page 2

page 1

page 0

  • Record the history of remote accesses

  • Use this information when taking decisions concerning load balancing/load sharing

  • Save old information to avoid repeating bad decisions (learn from mistakes)

  • Detect and solve ping-pong situations

  • Do everything by piggybacking on communication that is taking place anyway


Ping pong

Ping Pong

Detection (local):

1.Local threads attempt to use the page short time after it leaves the local host

2.The page leaves the host shortly after arrival

Treatment (by ping-pong server):

  • Collect information regarding all participating hosts and threads

  • Try to locate an underloaded target host

  • Stabilize the system by locking-in pages/threads


Windows nt based distributed virtual parallel machine

Optimization

TSP

-

Effect of Locality

15 cities, Bare Millipede

sec

4000

NO-FS

3500

OPTIMIZED-FS

3000

FS

2500

2000

1500

1000

500

0

1

2

3

4

5

6

hosts

In the NO

-

FS case false sharing is avoided by aligning all allocations to

page size. In the other two cases each page is used by

2

threads: in FS no

optimizations are used, and in OPTIMIZED

-

FS the history mechanism is

enabled.


Windows nt based distributed virtual parallel machine

TSP on 6 hosts

k number of threads falsely sharing a page

k

optimized?

# DSM

-

# ping

-

pong

Number of

execution

related

treatment msgs

thread

time (sec)

messages

migrations

2

Yes

5100

290

68

645

2

No

176120

0

23

1020

3

Yes

4080

279

87

620

3

No

160460

0

32

1514

4

Yes

5060

343

99

690

4

No

155540

0

44

1515

5

Yes

6160

443

139

700

5

No

162505

0

55

1442


Windows nt based distributed virtual parallel machine

TSP-1

1000

900

800

700

600

500

400

300

200

100

0

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Best results are achieved at maximal sensitivity,

since all pages are accessed frequently.

TSP-2

1100

1000

900

800

700

600

500

400

300

200

100

0

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Since part of the pages are accessed frequently

and part

-

only occasionally, maximal sensitivity

causes unnecessary ping

pong treatment and

significantly increases execution time.

Ping Pong Detection Sensitivity


Applications

Applications

  • Numerical computations:Multigrid

  • Model checking:BDDs

  • Compute-intensive graphics:Ray-Tracing, Radiosity

  • Games, Search trees, Pruning, Tracking, CFD ...


Performance evaluation

Performance Evaluation

L - underloaded

H - overloaded

Delta(ms) - lock in time

t/o delta - polling (MGS,DSM)

msg delta - system pages delta

T_epoch - max history time

??? - remove old histories

- refresh old histories

L_epoch - histories length

page histories

vs.

job histories

migration heuristic -

which func?

ping-pong -

- what is initial noise?

- what freq. is PP?


Windows nt based distributed virtual parallel machine

LU Decomposition 1024x1024 matrix written in SPLASH:

Performance improvements when there are few threads on each host


Windows nt based distributed virtual parallel machine

LU Decomposition 2048x2048 matrix written in SPLASH -Super-Linear speedups due to the cachingeffect.


Windows nt based distributed virtual parallel machine

Jacobi Relaxation 512x512 matrix (using 2 matrices, no false sharing) written in ParC


Windows nt based distributed virtual parallel machine

Overhead of ParC/Millipede on a single host.

Testing with Tracking algorithm:


Windows nt based distributed virtual parallel machine

Info...

http://www.cs.technion.ac.il/Labs/Millipede

[email protected]

Release available at the Millipede site !


  • Login