Rethinking parallel execution
Download
1 / 47

Rethinking Parallel Execution - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

Rethinking Parallel Execution. Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison. Outline. From sequential to multicore Reminiscing: Instruction Level Parallelism (ILP) Canonical parallel processing and execution

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Rethinking Parallel Execution' - saima


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Rethinking parallel execution

Rethinking Parallel Execution

Guri Sohi

(along with Matthew Allen, Srinath Sridharan, Gagan Gupta)

University of Wisconsin-Madison


Outline
Outline

  • From sequential to multicore

  • Reminiscing: Instruction Level Parallelism (ILP)

  • Canonical parallel processing and execution

  • Rethinking canonical parallel execution

  • Dynamic Serialization

  • Consequences of Dynamic Serialization

  • Wrap up

Mason Wells


Microprocessor generations
Microprocessor Generations

  • Generation 1: Serial

  • Generation 2: Pipelined

  • Generation 3: Instruction-level Parallel (ILP)

  • Generation 4: Multiple processing cores

Mason Wells


Microprocessor generations1
Microprocessor Generations

Gen 2: Pipelined (1980s)

Gen 1: Sequential (1970s)

Gen 4: Multicore (2000s)

Gen 3: ILP (1990s)

Mason Wells


From one generation to next
From One Generation to Next

  • Significant debate and research

    • New solutions proposed

    • Old solutions adapt in interesting ways to become viable or even better than new solutions

  • Solutions that involve changes “under the hood” end up winning over others


From one generation to next1
From One Generation to Next

  • From Sequential to Pipelined

    • RISC (MIPS, Sun SPARC, Motorola 88k, IBM PowerPC) vs. CISC (Intel x86)

    • CISC architectures learned and employed RISC innovations

  • From Pipelined to Instruction-Level Parallel

    • Statically scheduled VLIW/EPIC

    • Dynamically scheduled superscalar


From one generation to next2
From One Generation to Next

  • From ILP to Multicore

    • Parallelism based upon canonical parallel execution model

    • Overcome constraints to canonical parallelization

      • Thread-level speculation (TLS)

      • Transactional memory (TM)


Reminiscing about ilp
Reminiscing about ILP

  • Late 1980s to mid 1990s

  • Search for “post RISC” architecture

    • More accurately, instruction processing model

  • Desire to do more than one instruction per cycle—exploit ILP

  • Majority school of thought: VLIW/EPIC

  • Minority: out-of-order (OOO) superscalar

8


Vliw epic school
VLIW/EPIC School

  • Parallel execution requires a parallel ISA

  • Parallel execution determined statically (by compiler)

  • Parallel execution expressed in static program

    • Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism

9


Vliw epic school1
VLIW/EPIC School

  • Creating effective parallel representations (statically) introduces several problems

    • Predication

    • Statically scheduling loads

    • Exception handling

    • Recovery code

  • Lots of research addressing these problems

  • Intel and HP pushed it as their future (Itanium)

10


Ooo superscalar
OOO Superscalar

  • Create dynamic parallel execution from sequential static representation

    • dynamic dependence information accurate

    • execution schedule flexible

  • None of the problems associated with trying to create a parallel representation statically

  • Natural growth path with no demands on software

11


Lessons from ilp generation
Lessons from ILP Generation

  • Significant consequences of trying to statically detect and express parallelism

  • Techniques that make “under the hood” changes are the winners

    • Even though they may have some drawbacks/overheads

12


The multicore generation
The Multicore Generation

How to achieve parallel execution on multiple processors?

Solution critical to the long-term health of the computer and information technology industry

And thus the economy and society as we know it

13


The multicore generation1
The Multicore Generation

  • How to achieve parallel execution on multiple processors?

  • Over four decades of conventional wisdom in parallel processing

    • Mostly in the scientific application/HPC arena

    • Use this as basis

      Parallel Execution Requires a Parallel Representation

17


Canonical parallel execution model
Canonical Parallel Execution Model

A: Analyze program to identify independencein program

  • independent portions executed in parallel

    B: Create static representation of independence

  • synchronization to satisfy independence assumption

    C: Dynamic parallel execution unwinds as per static representation

  • potential consequences due to static assumptions

18


Canonical parallel execution model1
Canonical Parallel Execution Model

  • Like VLIW/EPIC, canonical model creates a variety of problems that have lead to a vast body of research

    • identifying independence

    • creating static representation

    • dynamic unwinding

19


Identifying independence
Identifying Independence

  • Static program analysis

    • Over four decades of work

  • Hard to identify statically

    • Inherently dynamic properties

    • Must be conservative statically

  • Need to identify dependence in order to identify independence

Mason Wells


Creating static representation
Creating Static Representation

  • Parallel representation for guaranteed independent work

  • Insert synchronization for potential dependences

    • Conservative synchronization moves parallel execution towards sequential execution

Mason Wells


Dynamic unwinding
Dynamic Unwinding

  • Non-determinism

    • Changes to program state may not be repeatable

  • Race conditions

  • Several startup companies to deal with this problem

Mason Wells


Conventional wisdom
Conventional Wisdom

Parallel Execution Requires a Parallel Representation

Consequences:

  • Must create parallel representation

  • For correct execution, must statically identify:

    • Independence for parallel representation

    • Dependence for synchronization

  • Source of enormous difficulty and complexity

    • Generally functions of input to program

    • Inherently dynamic properties

Mason Wells


Current approaches
Current Approaches

  • Stick with canonical model and try to overcome limitations

  • Thread Level Speculation (TLS) and Transactional Memory (TM)

  • Techniques to allow programmer to program sequentially but automatically generate parallel representation

  • Techniques to handle non-determinism and race conditions.

Mason Wells


Tls and tm
TLS and TM

  • Overcome major constraint to creating static parallel representation

  • Likely in several upcoming microprocessors

    • Our work in mid 1990s will be key enabler

      • Already in Sun MAJC, NEC Merlot, Sun Rock

Mason Wells


Static program representation
Static Program Representation

  • Can we get parallel execution without a parallel representation?  Yes

  • Can dynamic parallelization extract parallelism that is inaccessible to static methods?  Yes

Mason Wells


Serialization sets what
Serialization Sets: What?

  • Sequential program representation and dynamic parallel execution

    • No static representation of independence

    • No locks and no explicit synchronization

  • “Under the hood” run time system dynamically determines and orders dependent computations

    • Independence and thus parallelism falls out as a side

  • Comparable or better performance than conventional parallel models

Mason Wells


How big picture
How? Big Picture

  • Write program in well object-oriented style

    • Method operates on data of associated object (ver. 1)

  • Identify parts of program for potential parallel execution

    • Make suitable annotations as needed

  • Dynamically determine data object touched by selected code

    • Identify dependence

  • Program thread assigns selected code to bins

Mason Wells


How big picture1
How? Big Picture

  • Serialize computations to same object

    • Enforce dependence

    • Assign them to same bin; delegate thread executes computations in same bin sequentially

  • Do not look for/represent independence

    • Falls out as an effect of enforcing dependence

    • Computations in different bins execute in parallel

  • Updates to given state in same order as in sequential program

    • Determinism

    • No races

    • If sequential correct; parallel execution is correct (same input)

Mason Wells


Big picture
Big Picture

Program Thread

Delegate Thread 0

Delegate Thread 2

Delegate Thread 1


Serialization sets how
Serialization Sets: How?

  • Sequential program with annotations

    • Identify potentially independent methods

    • Associate a serializers with objects to express dependence

  • Serializer groups dependent method invocations into a serialization set

    • Runtime executes in order to honor dependences

  • Independent method invocations in different sets

    • Runtime opportunistically parallelizes execution

Mason Wells


Example debit credit transactions
Example: Debit/Credit Transactions

# of transactions?

trans_t* trans;

while ((trans = get_trans ()) != NULL) {

account_t* account = trans->account;

if (trans->type == DEPOSIT)

account->deposit (trans->amount);

else if (trans->type == WITHDRAW)

account->withdraw (trans->amount);

}

Points to?

Loop-carried dependence?

Several static unknowns!

Mason Wells


Multithreading strategy
Multithreading Strategy

Oblivious to what accounts each thread may access!

→ Methods must lock account to

→ ensure mutual exclusion

trans_t* trans;

while ((trans = get_trans ()) != NULL) {

account_t* account = trans[i]->account;

if (trans->type == DEPOSIT)

account->deposit (trans->amount);

else if (trans->type == WITHDRAW)

account->withdraw (trans->amount);

}

  • Read all transactions into an array

  • Divide chunks of array among multiple threads

Mason Wells


Example with serialization sets
Example with Serialization Sets

private <account_t> private_account_t;

begin_nest ();

trans_t* trans;

while ((trans = get_trans ()) != NULL) {

private_account_t* account = trans->account;

if (trans->type == DEPOSIT)

account->delegate(deposit, trans->amount);

else if (trans->type == WITHDRAW)

account->delegate(withdraw, trans->amount);

}

end_nest ();

Declare wrapped account type

Delegate indicates potentially-independent operations

  • At execution, delegate:

  • Creates method invocation structure

  • Gets serializer pointer from base class

  • Enqueues invocation in serialization set

Initiate nesting level

End nesting level, implicit barrier

Mason Wells


Delegate context

Program context

SS #100

SS #200

SS #300

delegate

deposit

acct=100

$300

deposit

acct=100

$2000

withdraw

acct=100

$20

withdraw

acct=100

$50

delegate

delegate

delegate

withdraw

acct=200

$1000

withdraw

acct=200

$1000

delegate

delegate

delegate

delegate

deposit

acct=300

$5000

withdraw

acct=300

$350

Mason Wells


Delegate threads

Delegate context

Program thread

Program context

SS #100

Delegate 0

SS #200

Delegate 1

SS #300

delegate

withdraw

acct=100

$50

withdraw

acct=100

$20

deposit

acct=100

$2000

deposit

acct=100

$300

withdraw

acct=100

$50

withdraw

acct=100

$20

deposit

acct=100

$300

deposit

acct=100

$2000

delegate

delegate

delegate

withdraw

acct=200

$1000

withdraw

acct=200

$1000

withdraw

acct=200

$1000

withdraw

acct=200

$1000

delegate

delegate

delegate

delegate

withdraw

acct=300

$350

withdraw

acct=300

$350

deposit

acct=300

$5000

deposit

acct=300

$5000

Race-free, determinate execution without synchronization!

Mason Wells


Prometheus c library for ss
Prometheus: C++ Library for SS

  • Template library

    • Compile-time instantiation of SS data structures

    • Metaprogramming for static type checking

  • Runtime orchestrates parallel execution

  • Portable

    • x86, x86_64, SPARC V9

    • Linux, Solaris

Mason Wells


Prometheus runtime
Prometheus Runtime

  • Version 1.0

    • Dynamically extracts parallelism

    • Statically scheduled

    • No nested parallelism

  • Version 2.0

    • Dynamically extracts parallelism

    • Dynamically scheduled

      • Work-stealing scheduler

    • Supports nested parallelism

Mason Wells


Network packet classification
Network Packet Classification

packet_t* packet;

classify_t* classifier;

vector<int> ruleCount(num_rules);

Vector<packet_queue_t> packet_queues;

int packetCount = 0;

for(i=0;i<packet_queues.size();i++)

{

while ((packet =

packet_queues[i].get_pkt()) != NULL)

{

ruleID = classifier->softClassify (packet);

ruleCount[ruleID]++;

packetCount++;

}

}


Example with serialization sets1
Example with Serialization Sets

Private <classify_t> private_classify_t;

vector<private_classify_t> classifiers;

int packetCount = 0;

vector<int> ruleCount(numRules,0);

int size = packet_queues.size();

begin_nest ();

for (i=0;i<size;i++){

classifiers[i].delegate

(&classifier_t::softClassify,

packet_queues[i]);

}

end_nest ();

for(i=0;i<size;i++){

ruleCount += classifier[i].getRuleCount();

packetCount += classifier[i].getPacketCount();

}



Network intrusion detection
Network Intrusion Detection

  • Very common networking application

  • Most common program used: Snort

    • Open source version (like Linux)

    • But also commercial versions (Sourcefire)

  • Basic structure of computation also found in many other deep packet inspection applications

    • E.g., packet de-duplication (Riverbed)

Mason Wells


Other applications
Other Applications

  • Benchmarks

    • Lonestar, NU-MineBench, PARSEC, Phoenix

  • Conventional Parallelization

    • pthreads, OpenMP

  • Prometheus versions

    • Port program to sequential C++ program

    • Idiomatic C++: OO, inheritance, STL

    • Parallelize with serialization sets

Mason Wells


Statically scheduled results
Statically Scheduled Results

4 Socket AMD Barcelona (4-way multicore) = 16 total cores

Mason Wells



Summary
Summary

  • Sequential program with annotations

    • No explicit synchronization, no locks

  • Programmers focus on keeping computation private to object state

    • Consistent with OO programming practices

  • Dependence-based model

    • Determinate race-free parallel execution

  • Do as well or better than incumbents but without their negatives

  • Can do things that are very hard for incumbents

Mason Wells


ad