1 / 31

# Parallel Algorithms - PowerPoint PPT Presentation

Parallel Algorithms. Sung Yong Shin TC Lab CS Dept. KAIST. Contents. 1. Background 2. Parallel Computers 3. PRAM 4. Parallel Algorithms. 1. Background. Von Neumann Machines sequential executing one instruction at a time

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Parallel Algorithms' - zaina

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Parallel Algorithms

Sung Yong Shin

TC Lab

CS Dept. KAIST

1. Background

2. Parallel Computers

3. PRAM

4. Parallel Algorithms

1.Background

• Von Neumann Machines

• sequential

• executing one instruction at a time

• Inherent limitation “ not faster than electrical signals ”  1 ft / 1 nanosecond ( 10-9 sec )

• Parallelism or Concurrency Carrying out many operations simultaneously

• partition a complex problem in such a way that various parts of the work can be carried out independently and in parallel, and combine the results when all subcomputation are complete.

• need parallel computers to support this approach.

• Hardware-oriented

• A parallel architecture of a specific kind is built.

• The parallel algorithms for solving different problems are developed to make use of these hardware features to the best advantage.

• Problem-oriented

• Whether the parallel algorithms can truly enhance the speed of obtaining a solution to a given problem, or not.

• If so, how much ?

(i) The usefulness of parallel computers depends greatly on :

• suitable parallel algorithms

• parallel computer languages

“ A major rethinking needed”

(ii) Practical limitations by parallel computers “ too many factors to be considered” How to abstract ingredient from complex reality !!!

Which problems can be solved substantially faster using many processors rather than one processor ?

• Nicholas Pippenger (1976)

“ NC-class problems” ( Nick’s Class )

“ ultra-fast on a parallel computer with feasible amount of hardware”

( independent of the particular parallel model chosen )

• Inherent Parallelism

probably not possible now but for the future !!!

“ fascinating research topics”

P

P(n) processors

NC

(log n)m

P-complete

P = NC ?

Applications (needs ) processors rather than one processor ?

• Computer vision / Image processing

• Computer Graphics

• Searching huge databases

• Artificial Intelligence

· · · · · · · ·

2. processors rather than one processor ?Parallel Computers

SIMD ( Single Instruction Multiple Data Stream )

MIMD ( Multiple Instruction Multiple Data Stream )

What does SISD stand for ?

Program processors rather than one processor ?

Result x+y

Data

Source

x

Function

Unit

y

SISD

Program processors rather than one processor ?

Result x+y

x

Function

Unit

y

Data

Source

Result s+q

s

Function

Unit

q

Result v+w

v

Function

Unit

w

• SIMD

• array processors

• vector processors (pipelining)

Process3 processors rather than one processor ?

Process4

Process1

Process2

Data

Source

Branch

Function

Unit

NO

Result x · y

YES

x

Multiply

Function

Unit

Data

Source

y

Result w+v

w

Function

Unit

Data

Source

v

Result s/q

s

Divide

Function

Unit

Data

Source

q

MIMD

Array Processors processors rather than one processor ?

instructions (for multiple data)

master

slave

slave

slave

Control

Processor

Arithmetic

Processor

Arithmetic

Processor

Arithmetic

Processor

PE

Memory

Memory

Memory

Memory

Communication network

Identical processors rather than one processor ?

Processors

· · ·

P

P

P

Interconnection network

· · ·

M

M

M

tightly coupled multiprocessors

P processors rather than one processor ?

P

P

Identical

Processing

Elements

( PEs )

· · ·

M

M

M

Interconnection network

loosely coupled multiprocessors

Vector ( pipe-line ) processors processors rather than one processor ?

functional unit

Operand one

exponents

and multiply

mantissas

Compare

components

Align

operands

accordingly

Determine

normalization

factor

Result

Operand two

Normalize

results

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

A simplified pipeline for floating-point multiplication

3. processors rather than one processor ?PRAM (Parallel Random Access Machine)

Processors

(i) p general-purpose processors

(ii) Each processor is connected to a large shared, random access memory M.

(iii) Each processor has a private (or local) memory for its own computation.

(iv) All communications among processors take place via the shared memory.

(v) The input for an algorithm is assumed to be the 1stn memory cells,

and output is to be placed in the 1st cell.

(vi) All memory cells are initialized to be “0”.

P1

P2

P3

Pp

· · ·

Interconnection

· · ·

M

· · ·

1

m

Memory

[A PRAM]

( processors rather than one processor ?vii) All processors run the same program.

(viii) Each processor knows its own index.

(ix) A PRAM program may instruct processors to do different things

depending on their indices.

write

computation

three phases

Major Assumption processors rather than one processor ?

(i) PRAM processors are synchronized !!!

(1) processors begin each step at the same time.

(2) All the processors that write at any step write at the same time.

(ii) Any number of processors may read the same memory cell

concurrently !!!

Variants of PRAM’s processors rather than one processor ?

CREW ( Concurrent Read Exclusive Write )

CRCW ( Concurrent Read Concurrent Write )

– Common-write

– Priority-write

Why not EREW ?

yes, if you want !!!

Other Models processors rather than one processor ?

[Other parallel architectures]

(a) A hypercube (dimension = 3)

(b) A bounded degree network (degree = 4)

···

···

· · · · · · · · · · · · · · · ·

(c) Octree model

4. processors rather than one processor ?Parallel Algorithms

• Binary Fan-in Technique

• Matrix multiplication

• Handling write conflicts

• Merging & Sorting

Binary Fan-in Technique processors rather than one processor ?

P7

P5

P3

P1

x1

Compute

Write

Compute

Write

Compute

Write

x7

x3

x5

x8

x2

x4

x6

comparison

write

save

M[1] = max

[A parallel tournament]

( finding Max )

Processors: processors rather than one processor ?

Step 0

Step 1

big := max (big, temp)

write big

Step 2

big := max (big, temp)

write big

Step 3

big := max (big, temp)

write big

P1 P2 P3 P4 P5 P6 P7 P8

M

16 12 1 17 23 19 4 8

16 12 1 17 23 19 4 8

M

16 12 1 17 23 19 4 8

12 1 17 23 19 4 8 – 

16 12 17 23 23 19 8 8

M

16 12 17 23 23 19 8 8

17 23 23 19 8 8 –  – 

17 23 23 23 23 19 8 8

M

17 23 23 23 23 19 8 8

23 19 8 8 –  –  –  – 

23 23 23 23 23 19 8 8

M

23 23 23 23 23 19 8 8

max

[A tournament example showing the activity of all the processors.]

read processors rather than one processor ?M[i] into big ;

incr := 1 ;

write –  { some very small value } into M[n+i] ;

for step := 1 to lg n do

big := max (big, temp) ;

incr := 2 * incr ;

write big into M[i]

end { for }

O( log n ) using n/2 processors

no write conflicts

Matrix Multiplication processors rather than one processor ?

O(n) using n2 processors

What if using n3 processors ?

O( log n )

Why ?

Handling write conflict processors rather than one processor ?

Algorithm:Computing the or of n Bits

Input : Bits x1, · · · · ,xn in M[1],· · · ·, M[n].

Output : x1 · · · ·  xn in M[1].

If xi=1, then Pi writes 1 in M[1].

O(1) using n processors

write conflict !!!

 CRCW

– Common-write

– Priority-write

Fast algorithm for finding Max processors rather than one processor ?

Initial memory contents (n = 4).

Input

loser

2 7 3 6 0 0 0 0

8

1

P14

After Step 2

P13

P12

P23

P24

P34

1

1

1

1

1

1

2 7 3 6 1 0 1 1

P23

After Step 3

7

7

[Example for the fast max-finding algorithm]

O(1) using processors

common-write

Algorithm : Finding the Largest of processors rather than one processor ?n Keys

Input : n keys x1, x2,···, xn, initially in memory cells M[1], M[2],···, M[n] (n>2).

Output : The largest key will be left in M[1].

Comment : For clarity, the processors will be numbered Pi.j for 1  i  j  n.

Step 1

Step 2

Pi.j compares xi and xj.

Let k be the index of the smaller key.

(If the keys are equal, let k be the smaller index.)

Pi.j writes 1 in loser[k].

{At this point, every key other than the largest has lost a comparison. }

Step 3

Any processor that read a 0 writes xi in M[1]. (P1.n would write xn.)

{ Pi.i+1 already has xi in its local memory ; P1.n has xn. }

Merging and Sorting processors rather than one processor ?

merging

P1

Pn

Pn/2+1

Pn/2

x1

xn/2

yn

y1

· · ·

· · ·

M[1]

M[n]

M[n/2]

(a) Assignment of processors to keys.

Pi

yj

xi

>yj

>xi

<xi

<yj

(b) Binary search steps; Pi finds j such that yj-1<xi<yj.

binary search

Pi

x1,…, xi-1 and y1,…, yj-1 (merged)

xi

M[i+j-1]

(c) Output step.

[Parallel merging]

O(log n) using n processors

no write conflict

Algorithm processors rather than one processor ?:Parallel Merging

Input : Two sorted lists of n/2 keys each, in the first n cells of memory.

Output : The merged list, in the first n cells of memory.

Comment : Each processor Pihas a local variable x (if in/2) or y (if i>n/2) and other local variables for conducting its binary search. Each processor has a local variable position that will indicate where to write its key.

Initialization :

Pi reads M[i] into x (if in/2) or into y (if i>n/2).

Pi does initialization for its binary search.

Binary search steps :

Processors Pi, for 1in/2, do binary search in M[n/2+1],…, M[n]

to find the smallest j such that x<M[n/2+j], and assign i+j–1 to

position. If there is no such j, Piassigns n/2+i to position.

Processors Pn/2+i, for 1in/2, do binary search in M[1],…, M[n/2]

to find the smallest j such that y<M[j], and assign i+j–1 to position.

If there is no such j, Pi assigns n/2+i to position.

Output step :

Each Pi(for 1in) writes its key (x or y) in M[position].

Break the list into two halves. processors rather than one processor ?

Sort the two halves (recursively).

Merge the two sorted halves.

Algorithm : Sorting by Merging

Input : A list of n keys in M[1],…,M[n].

Output : The n key sorted in nondecreasing order in M[1],…,M[n].

Comment : The indexing in the algorithm is easier if the number of keys is a power of 2, so the first step will “pad” the input with large keys at the end. We still use only n processors.

Piwrites  (some large key) in M[n+i] ;

for t := 1 to lg n do

k := 2t-1 ; { the size of the lists being merged }

Pi,…, Pi+2k-1 merge the two sorted lists of size k beginning at M[i];

end { for }

O((log n)2) using n processors