The potential for software only thread level speculation
Download
1 / 38

The potential for Software-only thread-level speculation - PowerPoint PPT Presentation


  • 123 Views
  • Uploaded on

The potential for Software-only thread-level speculation. Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members Prof. Tarek. Abdelrahman   Prof. Michael Voss Prof. Ken Sevick By: Chuck (Chengyan) Zhao April 25, 2005.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The potential for Software-only thread-level speculation' - jerrod


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The potential for software only thread level speculation
The potential for Software-only thread-level speculation

Depth Oral Presentation

Co-Supervisors: Prof. Greg. Steffan

Prof. Cristina Amza

Committee Members

Prof. Tarek. Abdelrahman  

Prof. Michael Voss

Prof. Ken Sevick

By: Chuck (Chengyan) Zhao

April 25, 2005


Chip multi processor cmp is now everywhere

From all major companies:

IBM:

Power 4

Power 5 …

Intel:

Montecito

Smithfield …

AMD:

Dual-core Opteron

Sun:

MAJC

Sony, Toshiba, IBM:

Cell

… …

Chip Multi-Processor (CMP) is now everywhere

Power 4

Dual-core Intel chip

Dual-core Opteron

Cell

Abundant Chip Multiprocessors


Improving throughput with a chip multi processor

P

P

P

P

P

C

C

C

C

C

C

C

Improving Throughput with a Chip Multi-Processor

Multiprogramming Workload:

Applications

Execution

Time

Processor

Caches

improve throughput


Improving single application performance with a chip multi processor

P

P

P

P

P

P

P

P

P

C

C

C

C

C

C

C

C

C

C

C

C

Improving Single Application Performance with a Chip Multi-Processor

Single Application:

Exec.

Time

need parallel threads to reduce execution time


Using chip multi processor for improvements
Using Chip Multi-Processor for improvements

  • Improve throughput for multi-programming workload

    • Easy

    • CMP behaves like a normal MP

  • Improve single-application performance

    • Hard

    • Control and Data Dependence

    • Proposed approach: Thread-Level Speculation (TLS)

CMP trade-offs


Thread level speculation tls

Run Time

Compile Time

Parallelize without

dependency detection

Commit

Modification

No

Detect

Violation

Squash And

Re-execute

Yes

Thread-Level Speculation (TLS)

  • Enable compiler to create parallel threads despite the existence of ambiguous data dependence

  • Optimistically parallelize at compile time

  • Detect violations and recover at runtime

Optimistic at compile time, detect and recover at runtime


Example of thread level speculation
Example of Thread-Level Speculation

Code to parallelize

for ( …){

*p = …;

… = … … *q;

}

Un-parallelizable through paralleling compilers

  • Uncertain dependence between *p and *q

  • Might be runtime or user-input dependent

Break loop iterations into threads, explore uncertainty in each thread


How thread level speculation works

*q

violation

*p…

Recover

TLS

Exec.

Time

…*q

exploit available thread-level parallelism

How Thread-Level Speculation works


Thread level speculation quick summary
Thread-Level Speculation quick summary

  • Benefits

    • Reduce inter-thread communication time among cores

    • Scale

    • New parallel programming model

  • Types of implementations

    • Hardware only

    • Combined with hardware and software

    • Software only

Thread-Level Speculation is good for Chip Multi-Processor


Thread level speculation implementation diagram

Thread-Level

Speculation

SW-only approach

HW-only approach

Our approach

Thread-Level Speculation Implementation Diagram

Overall picture of Thread-Level Speculation


Thread level speculation implementation comparison
Thread-Level Speculation Implementation Comparison

  • Hardware-only approach

    • Lots of research

    • Good speed up through simulation

    • Nobody builds it yet

      • cost, risky,

      • need both HW + SW at the same time

    • Outcome

      • HW-only TLS looks promising

      • Significant hardware changes

  • Software-only approach: limited work, limited progress

    • Major problem: high overhead

      • Buffer memory for speculative states

      • Track each memory read + write: violation detection

      • Recover from failed speculation: re-execution

Quick summary on HW-only and SW-only approaches


Outline for the rest of the talk
Outline for the rest of the talk

  • Hardware TLS schemes

  • Software TLS schemes

  • Our scheme

    • Our goals

    • Starting point

    • Potential applications

  • Conclusion


Hardware only thread level speculation

Thread-Level

Speculation

SW-only approach

HW-only approach

Our approach

Hardware-only Thread-Level Speculation

Overall picture of HW-only TLS approach


Hardware thread level speculation schemes
Hardware Thread-Level Speculation Schemes

  • Lots of hardware TLS research

    • CMU Stampede

    • Stanford Hydra

    • Wisconsin Multiscalar

    • UIUC IA-COMA

    • UMN Super-threaded architecture

  • Convergence of hardware schemes

    • Use cache to buffer speculative state

    • Extend cache coherence protocol to track data dependence

Convergence of HW-only Thread-Level Speculation


Hardware tls schemes quick summary

Result

TLS is promising

SPEC int improvement:

30% - 100%

Depends on aggressiveness of the hardware support

P

P

P

P

C

C

C

C

C

(non-speculative)

Hardware TLS Schemes: quick summary

Sp-state

Sp-state

Sp-state

Sp-state

CMP with hardware speculative buffer and enhanced cache consistence protocol

Convergence of HW-only Thread-Level Speculation


Software only thread level speculation

Thread-Level

Speculation

SW-only approach

HW-only approach

Our approach

Software-only Thread-Level Speculation

Overall picture of SW-only TLS approach


Software only thread level speculation schemes
Software-only Thread-Level Speculation Schemes

  • LRPD Test: UIUC

  • VM for dependence tracking: Spiros’s, CMU

  • Cintra’s SW TLS: U Edinburgh

  • Problem of software-only approach: high overhead

  • Try to reduce it

  • overview of SW-only TLS approach


    Lrpd test uiuc

    software

    dependence

    tracking

    was parallel

    execution safe?

    LRPD Test (UIUC)

    + implemented entirely in software

    – applies only to array-based code

    – no partial parallelism

    entire loop will re-execute sequentially if there is any dependence

    Exec.

    Time

    Pros + Cons of LRPD


    Dependence tracking using virtual memory
    Dependence tracking using Virtual Memory

    Exec.

    Time

    Software dependence tracking through VM pages

    Virtual Memory Synchronize:

    transfer VM pages

    ? Pros + Cons of VM Tracking


    Cmu spiros s approach dependence tracking using virtual memory
    CMU Spiros’s approach -- Dependence tracking using Virtual Memory

    • Coarse-grain, software-only

    • Based on memory tracking

      • virtual memory page protection mechanism

      • use software DSM (TreadMarks)

      • Synchronization through VM pages through cost analysis

    • Overhead is prohibitive

      • 2 sec (seq) / 5 min (par)

      • Not a viable approach on this level of coarse granularity

    SW-TLS through VM Tracking is not attractive


    Cintra s sw tls memory tracking tuned for performance
    Cintra’s SW TLS: Memory tracking tuned for performance

    Exec.

    Time

    Efficient tracking for array references

    Efficient but custom-made for array only


    Cintra s software only thread level speculation quick summary
    Cintra’s software-only Thread-Level Speculation: quick summary

    • Features

      • Software simulation for extended cache coherence protocol

        • Provide speculative state transition table

      • Violation detection through speculate state comparison

      • Instrument on each load and store

    • Pros + Cons:

      • + advanced implementation of LRPD test

      • + implement entirely in software

      • + cover partial parallelism

      • – hand-crafted code for performance

      • – apply only to array-based code

    Summary of Cintra’s work


    Problems with software thread level speculation
    Problems with Software Thread-Level Speculation

    • High overhead

      • Buffer speculative state

      • Track data dependence for all memory reference

      • Re-execute in case of failed speculation

  • Potential speedup

    • largely unexplored

  • Possible directions for future research

    • Reduce overhead

    • Achieve speedup from TLS parallelism

  • Summary of Software TLS


    Our current thread level speculation approach

    Thread-Level

    Speculation

    SW-only approach

    HW-only approach

    Our approach

    Our current Thread-Level Speculation approach

    Overall position for our SW TLS approach


    Long term future plan
    Long term future plan

    • Goals

      • Target

        • Chip Multi-Processors

        • Tightly-coupled MPs

      • Apply to general-purpose code: not only arrays

      • Minimize overhead

        • Capitalize on compiler analysis and optimizations

          • Idempotency analysis <done>

          • Synchronization and communications <done>

          • PPA: Probabilistic pointer analysis Framework (Jeff’s work) <progressing>

          • Minimal backup and buffer retrieval analysis <progressing>

          • … more analysis we will invent <todo>

    • SW-only approach: room to improve

    • Starting point: highly efficient software checkpointing

    Goals and Plans


    Starting point efficient software checkpointing
    Starting point: efficient software checkpointing

    program

    execution

    • Some program points in source code

    • Buffer state change between current execution point and its latest check point

    • Execution can always efficiently rewind to its latest checkpointing

    Buffer memory changes

    Buffer more memory changes

    Software checkpointing

    Introduce software checkpointing


    Potential use of software checkpointing
    Potential use of Software checkpointing

    • Software Rollback

      • automatic software TLS support

      • foundation of future automatic TLS parallelization

    • Debug

      • controlled rewind

    • Enhance application reliability

    • Speculative optimizations in uni-processor program

      • larger window size

      • deep branch speculation

      • speculative code motion

    what can software checkpointing do


    Software checkpointing schemes
    Software checkpointing schemes

    • Compiler analysis

      • Local: Basic Block level

        • Backup only needed memory writes

        • Optimize to minimize

          • number of backup

          • Number of buffer retrieval

      • Global: procedural level

        • Populate buffers through control-flow graph

        • Iterate until buffer stabilizes

      • Inter-procedural level

    • Potential approaches for software backup

      • Undo backup

      • Todo backup

    build software checkpointing


    Undo backup
    Undo backup

    • Compile-time analysis

    • Backup once

      • per distinct memory write

      • per Basic Block

    • Program continue to operate on non-backup memory

    • Action upon execution completion

      • Commit: trash buffer

      • Rollback: restore from buffer

    undo backup properties


    Undo backup example
    Undo backup example

    Program, Basic Block level

    Undo backup memory

    Undo backup action

    (&a, [a])

    (&b, [b])

    (&c, [c])

    a = 10;

    b = 12;

    c = a + b;

    conflicts check

    Y

    restore undo memory

    N

    trash undo memory

    Next Basic

    Block

    undo backup process


    Todo backup
    Todo backup

    • Perform at runtime

    • Happen on each single memory write inside Basic Block

    • Each following read might need to retrieve from buffer

    • Action upon completion (reverse of Undo type)

      • Commit: write-back from buffer

      • Rollback: trash buffer

    todo backup properties


    Todo backup example
    Todo backup example

    Program, Basic Block level

    todo backup memory

    (p, a)

    (q, b)

    *p = a;

    *q = b;

    …*p + *q;

    conflicts check

    Y

    trash todo backup

    N

    write todo backup to memory

    Next Block

    todo backup process


    Backup comparison
    Backup Comparison

    • Undo

      • Pro: fast

        • Few number of backups

        • No need to retrieve from buffer for read

    • Con: Memory address needs to be known statically

      • Scalar

      • Pointer to fixed location

  • Todo

    • Pro

      • Handle both scalar and general-purpose pointer cases

    • Con: slow

      • Backup once per memory write

      • Need to retrieve each following read from buffer

  • In reality: both types are used

  • pros + cons of undo and todo


    An example in reality mixed mode
    An example in reality: mixed mode

    Code to execute

    Undo buffer

    int a, b, c;

    int * p, * q;

    (d) a = 1;

    (d) b = 2;

    (d) *p = 5;

    … …

    (u) c = a + b;

    … …

    (u) … = * q;

    (&a, [a])

    (&b, [b])

    (&c, [c])

    Todo buffer

    (p, 5)

    combined-backup process in reality


    Selection of backups in reality
    Selection of backups in reality

    • Combined approach

      • Undo: memory address known

        • Scalars

        • Pointers to fixed address

        • Compile-time analysis

      • Todo: memory address unknown

        • Normal pointers

        • Run-time analysis

    • Plan for implementation

      • put into SUIF, as a optimization pass

      • Minimize performance drop

    use both types together in reality


    Conclusion
    Conclusion

    • Thread-Level Speculation is compelling

      • Potential large performance gains

    • Challenge

      • Software overhead

    • Limited SW TLS work

      • No previous SW TLS working on general-purpose programs

      • Killer advantage: compiler analyses

      • Modest starting point

        • efficient software checkpointing

    summary



    Concurrent hw only related work
    Concurrent HW-only Related Work

    An other view of HW-only Thread-Level Speculation Schemes


    ad