- 105 Views
- Uploaded on

Download Presentation
## Parallel Processing

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Parallel Processing

Chapter 9

Problem:

- Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available
- Solution:

Problem:

- Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available
- Solution:
- Divide program into parts
- Run each part on separate CPUs of larger machine

Motivations

- Desktops are incredibly cheap
- Custom high-performance uniprocessor
- Hook up 100 desktops
- Squeezing out more ILP is difficult

Motivations

- Desktops are incredibly cheap
- Custom high-performance uniprocessor
- Hook up 100 desktops
- Squeezing out more ILP is difficult
- More complexity/power required each time
- Would require change in cooling technology

Challenges

- Parallelizing code is not easy
- Communication can be costly
- Requires HW support

Challenges

- Parallelizing code is not easy
- Languages, software engineering, software verification issue – beyond scope of class
- Communication can be costly
- Requires HW support

Challenges

- Parallelizing code is not easy
- Languages, software engineering, software verification issue – beyond scope of class
- Communication can be costly
- Performance analysis ignores caches - these costs are much higher
- Requires HW support

Challenges

- Parallelizing code is not easy
- Languages, software engineering, software verification issue – beyond scope of class
- Communication can be costly
- Performance analysis ignores caches - these costs are much higher
- Requires HW support
- Multiple processes modifying the same data causes race conditions, and out of order processors arbitrarily reorder things.

Performance - Speedup

- _____________________
- 70% of the program is parallelizable
- What is the highest speedup possible?
- What is the speedup with 100 processors?

Speedup

- Amdahl’s Law!!!!!!
- 70% of the program is parallelizable
- What is the highest speedup possible?
- What is the speedup with 100 processors?

Speedup

- Amdahl’s Law!!!!!!
- 70% of the program is parallelizable
- What is the highest speedup possible?
- 1 / (.30 + .70 / ) = 1 / .30 = 3.33
- What is the speedup with 100 processors?

8

Speedup

- Amdahl’s Law!!!!!!
- 70% of the program is parallelizable
- What is the highest speedup possible?
- 1 / (.30 + .70 / ) = 1 / .30 = 3.33
- What is the speedup with 100 processors?
- 1 / (.30 + .70/100) = 1 / .307 = 3.26

8

Taxonomy

- SISD – single instruction, single data
- SIMD – single instruction, multiple data
- MISD – multiple instruction, single data
- MIMD – multiple instruction, multiple data

Taxonomy

- SISD – single instruction, single data
- uniprocessor
- SIMD – single instruction, multiple data
- MISD – multiple instruction, single data
- MIMD – multiple instruction, multiple data

Taxonomy

- SISD – single instruction, single data
- uniprocessor
- SIMD – single instruction, multiple data
- vector, MMX extensions, graphics cards
- MISD – multiple instruction, single data
- MIMD – multiple instruction, multiple data

D

P

D

P

D

P

D

Controller

SIMD

Controller fetches instructions

All processors execute the same instruction

Conditional instructions only way for variation

P

D

P

D

P

D

P

D

Taxonomy

- SISD – single instruction, single data
- uniprocessor
- SIMD – single instruction, multiple data
- vector, MMX extensions, graphics cards
- MISD – multiple instruction, single data
- MIMD – multiple instruction, multiple data

Taxonomy

- SISD – single instruction, single data
- uniprocessor
- SIMD – single instruction, multiple data
- vector, MMX extensions, graphics cards
- MISD – multiple instruction, single data
- Never built – pipeline architectures?!?
- MIMD – multiple instruction, multiple data

Taxonomy

- SISD – single instruction, single data
- uniprocessor
- SIMD – single instruction, multiple data
- vector, MMX extensions, graphics cards
- MISD – multiple instruction, single data
- Streaming apps?
- MIMD – multiple instruction, multiple data
- Most multiprocessors
- Cheap, flexible

Example

- Sum the elements in A[] and place result in sum

int sum=0;

int i;

for(i=0;i<n;i++)

sum = sum + A[i];

Parallel versionShared Memory

int A[NUM];

int numProcs;

int sum;

int sumArray[numProcs];

myFunction( (input arguments) )

{

int myNum - …….;

int mySum = 0;

for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++)

mySum += A[i];

sumArray[myNum] = mySum;

barrier();

if (myNum == 0)

{

for(i=0;i<numProcs;i++)

sum += sumArray[i];

}

}

Why Synchronization?

- Why can’t you figure out when proc x will finish work?

Why Synchronization?

- Why can’t you figure out when proc x will finish work?
- Cache misses
- Different control flow
- Context switches

Supporting Parallel Programs

- Synchronization
- Cache Coherence
- False Sharing

Synchronization

- Sum += A[i];
- Two processors, i = 0, i = 50
- Before the action:
- Sum = 5
- A[0] = 10
- A[50] = 33
- What is the proper result?

Synchronization

- Sum = Sum + A[i];
- Assembly for this equation, assuming
- A[i] is already in $t0:
- &Sum is already in $s0

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

Synchronization Problem

- Reading and writing memory is a non-atomic operation
- You can not read and write a memory location in a single operation
- We need hardware primitives that allow us to read and write without interruption

Solution

- Software Solution
- “lock” – function that allows one processor to leave, all others to loop
- “unlock” – releases the next looping processor (or resets to allow next arriving proc to leave)
- Hardware
- Provide primitives that read & write in order to implement lock and unlock

HardwareImplementing lock & unlock

- Swap $1, 100($2)
- Swap the contents of $1 and M[$2+100]

Hardware: Implementing lock & unlock with swap

- If lock has 0, it is free
- If lock has 1, it is held

Lock:

Li $t0, 1

Loop: swap $t0, 0($a0)

bne $t0, $0, loop

Hardware: Implementing lock & unlock with swap

- If lock has 0, it is free
- If lock has 1, it is held

Lock:

Li $t0, 1

Loop: swap $t0, 0($a0)

bne $t0, $0, loop

Unlock:

sw $0, 0($a0)

Outline

- Synchronization
- Cache Coherence
- False Sharing

Cache Coherence

P1

P2

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a
- 2. P2: Wr a, 5
- 3. P1: Rd a
- P2: Wr a, 3
- P1: Rd a

$$$

$$$

DRAM

P1,P2 are write-back caches

Cache Coherence

P1

P2

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a
- 2. P2: Wr a, 5
- 3. P1: Rd a
- P2: Wr a, 3
- P1: Rd a

$$$

$$$

1

DRAM

P1,P2 are write-back caches

Cache Coherence

P1

P2

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5
- 3. P1: Rd a
- P2: Wr a, 3
- P1: Rd a

$$$

$$$

1

DRAM

P1,P2 are write-back caches

Cache Coherence

P1

P2

2

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5
- 3. P1: Rd a
- P2: Wr a, 3
- P1: Rd a

$$$

$$$

1

DRAM

P1,P2 are write-back caches

Cache Coherence

P1

P2

2

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5
- 3. P1: Rd a
- P2: Wr a, 3
- P1: Rd a

$$$

$$$

1

DRAM

P1,P2 are write-back caches

Cache Coherence

P1

P2

2

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a
- P2: Wr a, 3
- P1: Rd a

$$$

$$$

1

DRAM

P1,P2 are write-back caches

Cache Coherence

P1

P2

2

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a 5 5 5
- P2: Wr a, 3
- P1: Rd a

$$$

$$$

3

1

DRAM

P1,P2 are write-back caches

Cache Coherence

P1

P2

2

4

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a 5 5 5
- P2: Wr a, 3 5 3 5
- P1: Rd a

$$$

$$$

3

1

DRAM

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

Cache Coherence

P1

P2

2

4

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a 5 5 5
- P2: Wr a, 3 5 3 5
- P1: Rd a

$$$

$$$

3

1

DRAM

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

What will P1 receive from its load?

Cache Coherence

P1

P2

2

4

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a 5 5 5
- P2: Wr a, 3 5 3 5
- P1: Rd a

$$$

$$$

3

1

DRAM

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

What will P1 receive from its load? 5

What should P1 receive from its load?

Cache Coherence

P1

P2

2

4

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a 5 5 5
- P2: Wr a, 3 5 3 5
- P1: Rd a

$$$

$$$

3

1

DRAM

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

What will P1 receive from its load? 5

What should P1 receive from its load? 3

Whatever are we to do?

- Write-Invalidate
- Invalidate that value in all others’ caches
- Set the valid bit to 0
- Write-Update
- Update the value in all others’ caches

Write Invalidate

P1

P2

2

4

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a 5 5 5
- P2: Wr a, 3
- P1: Rd a

$$$

$$$

3

1

DRAM

P1,P2 are write-back caches

Write Invalidate

P1

P2

2

4

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a 5 5 5
- P2: Wr a, 3 * 3 5
- P1: Rd a

$$$

$$$

3

1

DRAM

P1,P2 are write-back caches

Write Invalidate

P1

P2

2

4

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a 5 5 5
- P2: Wr a, 3 * 3 5
- P1: Rd a 3 3 3

$$$

$$$

3,5

1

DRAM

P1,P2 are write-back caches

Write Update

P1

P2

4

2

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a 5 5 5
- P2: Wr a, 3
- P1: Rd a

$$$

$$$

3,4

1

DRAM

P1,P2 are write-back caches

Write Update

P1

P2

4

2

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a 5 5 5
- P2: Wr a, 3 3 3 3
- P1: Rd a

$$$

$$$

3,4

1

DRAM

P1,P2 are write-back caches

Write Update

P1

P2

4

2

- Current a value in:P1$ P2$ DRAM
- * * 7
- 1. P2: Rd a * 7 7
- 2. P2: Wr a, 5 * 5 7
- 3. P1: Rd a 5 5 5
- P2: Wr a, 3 3 3 3
- P1: Rd a 3 3 3

$$$

$$$

3,4

1

DRAM

P1,P2 are write-back caches

Outline

- Synchronization
- Cache Coherence
- False Sharing

Cache CoherenceFalse Sharing w/ Invalidate

P1

P2

- Current contents in:P1$ P2$
- * *
- P2: Rd A[0]
- P1: Rd A[1]
- 3. P2: Wr A[0], 5
- 4. P1: Wr A[1], 3

$$$

$$$

DRAM

P1,P2 cacheline size: 4 words

Look closely at example

- P1 and P2 do not access the same element
- A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.

Cache Coherence False Sharing w/ Invalidate

P1

P2

- Current contents in:P1$ P2$
- * *
- P2: Rd A[0] * A[0-3]
- P1: Rd A[1]
- 3. P2: Wr A[0], 5
- 4. P1: Wr A[1], 3

$$$

$$$

DRAM

P1,P2 cacheline size: 4 words

Cache Coherence False Sharing w/ Invalidate

P1

P2

- Current contents in:P1$ P2$
- * *
- P2: Rd A[0] * A[0-3]
- P1: Rd A[1] A[0-3]A[0-3]
- 3. P2: Wr A[0], 5
- 4. P1: Wr A[1], 3

$$$

$$$

DRAM

P1,P2 cacheline size: 4 words

Cache Coherence False Sharing w/ Invalidate

P1

P2

- Current contents in:P1$ P2$
- * *
- P2: Rd A[0] * A[0-3]
- P1: Rd A[1] A[0-3]A[0-3]
- 3. P2: Wr A[0], 5 * A[0-3]
- 4. P1: Wr A[1], 3

$$$

$$$

DRAM

P1,P2 cacheline size: 4 words

Cache Coherence False Sharing w/ Invalidate

P1

P2

- Current contents in:P1$ P2$
- * *
- P2: Rd A[0] * A[0-3]
- P1: Rd A[1] A[0-3]A[0-3]
- 3. P2: Wr A[0], 5 * A[0-3]
- 4. P1: Wr A[1], 3 A[0-3] *

$$$

$$$

DRAM

P1,P2 cacheline size: 4 words

False Sharing

- Different/same processors access different/same items in different/same cache block
- Leads to ___________ misses

False Sharing

- Different processors access different items in same cache block
- Leads to___________ misses

False Sharing

- Different processors access different items in same cache block
- Leads to coherence cache misses

Cache Performance

// Pn = my processor number (rank)

// NumProcs = total active processors

// N = total number of elements

// NElem = N / NumProcs

For(i=0;i<N;i++)

A[NumProcs*i+Pn] = f(i);

Vs

For(i=(Pn*NElem);i<(Pn+1)*NElem;i++)

A[i] = f(i);

Which is better?

- Both access the same number of elements
- No processors access the same elements as each other

Why is the second better?

- Both access the same number of elements
- No processors access the same elements as each other
- Better Spatial Locality

Why is the second better?

- Both access the same number of elements
- No processors access the same elements as each other
- Better Spatial Locality
- Less False Sharing

Download Presentation

Connecting to Server..