parallel processing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Parallel Processing PowerPoint Presentation
Download Presentation
Parallel Processing

Loading in 2 Seconds...

play fullscreen
1 / 70

Parallel Processing - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Parallel Processing. Chapter 9. Problem: Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:. Problem: Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution: Divide program into parts

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Parallel Processing


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide2
Problem:
    • Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available
  • Solution:
slide3
Problem:
    • Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available
  • Solution:
    • Divide program into parts
    • Run each part on separate CPUs of larger machine
motivations1
Motivations
  • Desktops are incredibly cheap
    • Custom high-performance uniprocessor
    • Hook up 100 desktops
  • Squeezing out more ILP is difficult
motivations2
Motivations
  • Desktops are incredibly cheap
    • Custom high-performance uniprocessor
    • Hook up 100 desktops
  • Squeezing out more ILP is difficult
    • More complexity/power required each time
    • Would require change in cooling technology
challenges
Challenges
  • Parallelizing code is not easy
  • Communication can be costly
  • Requires HW support
challenges1
Challenges
  • Parallelizing code is not easy
    • Languages, software engineering, software verification issue – beyond scope of class
  • Communication can be costly
  • Requires HW support
challenges2
Challenges
  • Parallelizing code is not easy
    • Languages, software engineering, software verification issue – beyond scope of class
  • Communication can be costly
    • Performance analysis ignores caches - these costs are much higher
  • Requires HW support
challenges3
Challenges
  • Parallelizing code is not easy
    • Languages, software engineering, software verification issue – beyond scope of class
  • Communication can be costly
    • Performance analysis ignores caches - these costs are much higher
  • Requires HW support
    • Multiple processes modifying the same data causes race conditions, and out of order processors arbitrarily reorder things.
performance speedup
Performance - Speedup
  • _____________________
  • 70% of the program is parallelizable
  • What is the highest speedup possible?
  • What is the speedup with 100 processors?
speedup
Speedup
  • Amdahl’s Law!!!!!!
  • 70% of the program is parallelizable
  • What is the highest speedup possible?
  • What is the speedup with 100 processors?
speedup1
Speedup
  • Amdahl’s Law!!!!!!
  • 70% of the program is parallelizable
  • What is the highest speedup possible?
    • 1 / (.30 + .70 / ) = 1 / .30 = 3.33
  • What is the speedup with 100 processors?

8

speedup2
Speedup
  • Amdahl’s Law!!!!!!
  • 70% of the program is parallelizable
  • What is the highest speedup possible?
    • 1 / (.30 + .70 / ) = 1 / .30 = 3.33
  • What is the speedup with 100 processors?
    • 1 / (.30 + .70/100) = 1 / .307 = 3.26

8

taxonomy
Taxonomy
  • SISD – single instruction, single data
  • SIMD – single instruction, multiple data
  • MISD – multiple instruction, single data
  • MIMD – multiple instruction, multiple data
taxonomy1
Taxonomy
  • SISD – single instruction, single data
    • uniprocessor
  • SIMD – single instruction, multiple data
  • MISD – multiple instruction, single data
  • MIMD – multiple instruction, multiple data
taxonomy2
Taxonomy
  • SISD – single instruction, single data
    • uniprocessor
  • SIMD – single instruction, multiple data
    • vector, MMX extensions, graphics cards
  • MISD – multiple instruction, single data
  • MIMD – multiple instruction, multiple data
slide18

P

D

P

D

P

D

P

D

Controller

SIMD

Controller fetches instructions

All processors execute the same instruction

Conditional instructions only way for variation

P

D

P

D

P

D

P

D

taxonomy3
Taxonomy
  • SISD – single instruction, single data
    • uniprocessor
  • SIMD – single instruction, multiple data
    • vector, MMX extensions, graphics cards
  • MISD – multiple instruction, single data
  • MIMD – multiple instruction, multiple data
taxonomy4
Taxonomy
  • SISD – single instruction, single data
    • uniprocessor
  • SIMD – single instruction, multiple data
    • vector, MMX extensions, graphics cards
  • MISD – multiple instruction, single data
    • Never built – pipeline architectures?!?
  • MIMD – multiple instruction, multiple data
taxonomy5
Taxonomy
  • SISD – single instruction, single data
    • uniprocessor
  • SIMD – single instruction, multiple data
    • vector, MMX extensions, graphics cards
  • MISD – multiple instruction, single data
    • Streaming apps?
  • MIMD – multiple instruction, multiple data
    • Most multiprocessors
    • Cheap, flexible
example
Example
  • Sum the elements in A[] and place result in sum

int sum=0;

int i;

for(i=0;i<n;i++)

sum = sum + A[i];

parallel version shared memory1
Parallel versionShared Memory

int A[NUM];

int numProcs;

int sum;

int sumArray[numProcs];

myFunction( (input arguments) )

{

int myNum - …….;

int mySum = 0;

for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++)

mySum += A[i];

sumArray[myNum] = mySum;

barrier();

if (myNum == 0)

{

for(i=0;i<numProcs;i++)

sum += sumArray[i];

}

}

why synchronization
Why Synchronization?
  • Why can’t you figure out when proc x will finish work?
why synchronization1
Why Synchronization?
  • Why can’t you figure out when proc x will finish work?
    • Cache misses
    • Different control flow
    • Context switches
supporting parallel programs
Supporting Parallel Programs
  • Synchronization
  • Cache Coherence
  • False Sharing
synchronization
Synchronization
  • Sum += A[i];
  • Two processors, i = 0, i = 50
  • Before the action:
    • Sum = 5
    • A[0] = 10
    • A[50] = 33
  • What is the proper result?
synchronization1
Synchronization
  • Sum = Sum + A[i];
  • Assembly for this equation, assuming
    • A[i] is already in $t0:
    • &Sum is already in $s0

lw $t1, 0($s0)

add $t1, $t1, $t0

sw $t1, 0($s0)

synchronization ordering 1

lw $t1, 0($s0)

SynchronizationOrdering #1

add $t1, $t1, $t0

sw $t1, 0($s0)

5

5

15

38

15

38

synchronization ordering 2

lw $t1, 0($s0)

SynchronizationOrdering #2

add $t1, $t1, $t0

sw $t1, 0($s0)

5

5

15

38

38

15

synchronization problem
Synchronization Problem
  • Reading and writing memory is a non-atomic operation
    • You can not read and write a memory location in a single operation
  • We need hardware primitives that allow us to read and write without interruption
solution
Solution
  • Software Solution
    • “lock” – function that allows one processor to leave, all others to loop
    • “unlock” – releases the next looping processor (or resets to allow next arriving proc to leave)
  • Hardware
    • Provide primitives that read & write in order to implement lock and unlock
software using lock and unlock
SoftwareUsing lock and unlock

lock(&balancelock)

Sum += A[i]

unlock(&balancelock)

hardware implementing lock unlock
HardwareImplementing lock & unlock
  • Swap $1, 100($2)
    • Swap the contents of $1 and M[$2+100]
hardware implementing lock unlock with swap
Hardware: Implementing lock & unlock with swap
  • If lock has 0, it is free
  • If lock has 1, it is held

Lock:

Li $t0, 1

Loop: swap $t0, 0($a0)

bne $t0, $0, loop

hardware implementing lock unlock with swap1
Hardware: Implementing lock & unlock with swap
  • If lock has 0, it is free
  • If lock has 1, it is held

Lock:

Li $t0, 1

Loop: swap $t0, 0($a0)

bne $t0, $0, loop

Unlock:

sw $0, 0($a0)

outline
Outline
  • Synchronization
  • Cache Coherence
  • False Sharing
cache coherence
Cache Coherence

P1

P2

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a
  • 2. P2: Wr a, 5
  • 3. P1: Rd a
  • P2: Wr a, 3
  • P1: Rd a

$$$

$$$

DRAM

P1,P2 are write-back caches

cache coherence1
Cache Coherence

P1

P2

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a
  • 2. P2: Wr a, 5
  • 3. P1: Rd a
  • P2: Wr a, 3
  • P1: Rd a

$$$

$$$

1

DRAM

P1,P2 are write-back caches

cache coherence2
Cache Coherence

P1

P2

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5
  • 3. P1: Rd a
  • P2: Wr a, 3
  • P1: Rd a

$$$

$$$

1

DRAM

P1,P2 are write-back caches

cache coherence3
Cache Coherence

P1

P2

2

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5
  • 3. P1: Rd a
  • P2: Wr a, 3
  • P1: Rd a

$$$

$$$

1

DRAM

P1,P2 are write-back caches

cache coherence4
Cache Coherence

P1

P2

2

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5
  • 3. P1: Rd a
  • P2: Wr a, 3
  • P1: Rd a

$$$

$$$

1

DRAM

P1,P2 are write-back caches

cache coherence5
Cache Coherence

P1

P2

2

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a
  • P2: Wr a, 3
  • P1: Rd a

$$$

$$$

1

DRAM

P1,P2 are write-back caches

cache coherence6
Cache Coherence

P1

P2

2

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a 5 5 5
  • P2: Wr a, 3
  • P1: Rd a

$$$

$$$

3

1

DRAM

P1,P2 are write-back caches

cache coherence7
Cache Coherence

P1

P2

2

4

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a 5 5 5
  • P2: Wr a, 3 5 3 5
  • P1: Rd a

$$$

$$$

3

1

DRAM

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

cache coherence8
Cache Coherence

P1

P2

2

4

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a 5 5 5
  • P2: Wr a, 3 5 3 5
  • P1: Rd a

$$$

$$$

3

1

DRAM

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

What will P1 receive from its load?

cache coherence9
Cache Coherence

P1

P2

2

4

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a 5 5 5
  • P2: Wr a, 3 5 3 5
  • P1: Rd a

$$$

$$$

3

1

DRAM

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

What will P1 receive from its load? 5

What should P1 receive from its load?

cache coherence10
Cache Coherence

P1

P2

2

4

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a 5 5 5
  • P2: Wr a, 3 5 3 5
  • P1: Rd a

$$$

$$$

3

1

DRAM

AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

What will P1 receive from its load? 5

What should P1 receive from its load? 3

whatever are we to do
Whatever are we to do?
  • Write-Invalidate
    • Invalidate that value in all others’ caches
    • Set the valid bit to 0
  • Write-Update
    • Update the value in all others’ caches
write invalidate
Write Invalidate

P1

P2

2

4

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a 5 5 5
  • P2: Wr a, 3
  • P1: Rd a

$$$

$$$

3

1

DRAM

P1,P2 are write-back caches

write invalidate1
Write Invalidate

P1

P2

2

4

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a 5 5 5
  • P2: Wr a, 3 * 3 5
  • P1: Rd a

$$$

$$$

3

1

DRAM

P1,P2 are write-back caches

write invalidate2
Write Invalidate

P1

P2

2

4

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a 5 5 5
  • P2: Wr a, 3 * 3 5
  • P1: Rd a 3 3 3

$$$

$$$

3,5

1

DRAM

P1,P2 are write-back caches

write update
Write Update

P1

P2

4

2

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a 5 5 5
  • P2: Wr a, 3
  • P1: Rd a

$$$

$$$

3,4

1

DRAM

P1,P2 are write-back caches

write update1
Write Update

P1

P2

4

2

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a 5 5 5
  • P2: Wr a, 3 3 3 3
  • P1: Rd a

$$$

$$$

3,4

1

DRAM

P1,P2 are write-back caches

write update2
Write Update

P1

P2

4

2

  • Current a value in:P1$ P2$ DRAM
  • * * 7
  • 1. P2: Rd a * 7 7
  • 2. P2: Wr a, 5 * 5 7
  • 3. P1: Rd a 5 5 5
  • P2: Wr a, 3 3 3 3
  • P1: Rd a 3 3 3

$$$

$$$

3,4

1

DRAM

P1,P2 are write-back caches

outline1
Outline
  • Synchronization
  • Cache Coherence
  • False Sharing
cache coherence false sharing w invalidate
Cache CoherenceFalse Sharing w/ Invalidate

P1

P2

  • Current contents in:P1$ P2$
  • * *
  • P2: Rd A[0]
  • P1: Rd A[1]
  • 3. P2: Wr A[0], 5
  • 4. P1: Wr A[1], 3

$$$

$$$

DRAM

P1,P2 cacheline size: 4 words

look closely at example
Look closely at example
  • P1 and P2 do not access the same element
  • A[0] and A[1] are in the same cache block, so if they are in one cache, they are in the other cache.
cache coherence false sharing w invalidate1
Cache Coherence False Sharing w/ Invalidate

P1

P2

  • Current contents in:P1$ P2$
  • * *
  • P2: Rd A[0] * A[0-3]
  • P1: Rd A[1]
  • 3. P2: Wr A[0], 5
  • 4. P1: Wr A[1], 3

$$$

$$$

DRAM

P1,P2 cacheline size: 4 words

cache coherence false sharing w invalidate2
Cache Coherence False Sharing w/ Invalidate

P1

P2

  • Current contents in:P1$ P2$
  • * *
  • P2: Rd A[0] * A[0-3]
  • P1: Rd A[1] A[0-3]A[0-3]
  • 3. P2: Wr A[0], 5
  • 4. P1: Wr A[1], 3

$$$

$$$

DRAM

P1,P2 cacheline size: 4 words

cache coherence false sharing w invalidate3
Cache Coherence False Sharing w/ Invalidate

P1

P2

  • Current contents in:P1$ P2$
  • * *
  • P2: Rd A[0] * A[0-3]
  • P1: Rd A[1] A[0-3]A[0-3]
  • 3. P2: Wr A[0], 5 * A[0-3]
  • 4. P1: Wr A[1], 3

$$$

$$$

DRAM

P1,P2 cacheline size: 4 words

cache coherence false sharing w invalidate4
Cache Coherence False Sharing w/ Invalidate

P1

P2

  • Current contents in:P1$ P2$
  • * *
  • P2: Rd A[0] * A[0-3]
  • P1: Rd A[1] A[0-3]A[0-3]
  • 3. P2: Wr A[0], 5 * A[0-3]
  • 4. P1: Wr A[1], 3 A[0-3] *

$$$

$$$

DRAM

P1,P2 cacheline size: 4 words

false sharing
False Sharing
  • Different/same processors access different/same items in different/same cache block
  • Leads to ___________ misses
false sharing1
False Sharing
  • Different processors access different items in same cache block
  • Leads to___________ misses
false sharing2
False Sharing
  • Different processors access different items in same cache block
  • Leads to coherence cache misses
cache performance
Cache Performance

// Pn = my processor number (rank)

// NumProcs = total active processors

// N = total number of elements

// NElem = N / NumProcs

For(i=0;i<N;i++)

A[NumProcs*i+Pn] = f(i);

Vs

For(i=(Pn*NElem);i<(Pn+1)*NElem;i++)

A[i] = f(i);

which is better
Which is better?
  • Both access the same number of elements
  • No processors access the same elements as each other
why is the second better
Why is the second better?
  • Both access the same number of elements
  • No processors access the same elements as each other
  • Better Spatial Locality
why is the second better1
Why is the second better?
  • Both access the same number of elements
  • No processors access the same elements as each other
  • Better Spatial Locality
  • Less False Sharing