the potential for software only thread level speculation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The potential for Software-only thread-level speculation PowerPoint Presentation
Download Presentation
The potential for Software-only thread-level speculation

Loading in 2 Seconds...

play fullscreen
1 / 38

The potential for Software-only thread-level speculation - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

The potential for Software-only thread-level speculation. Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members Prof. Tarek. Abdelrahman   Prof. Michael Voss Prof. Ken Sevick By: Chuck (Chengyan) Zhao April 25, 2005.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The potential for Software-only thread-level speculation' - jerrod


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the potential for software only thread level speculation
The potential for Software-only thread-level speculation

Depth Oral Presentation

Co-Supervisors: Prof. Greg. Steffan

Prof. Cristina Amza

Committee Members

Prof. Tarek. Abdelrahman  

Prof. Michael Voss

Prof. Ken Sevick

By: Chuck (Chengyan) Zhao

April 25, 2005

chip multi processor cmp is now everywhere
From all major companies:

IBM:

Power 4

Power 5 …

Intel:

Montecito

Smithfield …

AMD:

Dual-core Opteron

Sun:

MAJC

Sony, Toshiba, IBM:

Cell

… …

Chip Multi-Processor (CMP) is now everywhere

Power 4

Dual-core Intel chip

Dual-core Opteron

Cell

Abundant Chip Multiprocessors

improving throughput with a chip multi processor

P

P

P

P

P

C

C

C

C

C

C

C

Improving Throughput with a Chip Multi-Processor

Multiprogramming Workload:

Applications

Execution

Time

Processor

Caches

improve throughput

improving single application performance with a chip multi processor

P

P

P

P

P

P

P

P

P

C

C

C

C

C

C

C

C

C

C

C

C

Improving Single Application Performance with a Chip Multi-Processor

Single Application:

Exec.

Time

need parallel threads to reduce execution time

using chip multi processor for improvements
Using Chip Multi-Processor for improvements
  • Improve throughput for multi-programming workload
    • Easy
    • CMP behaves like a normal MP
  • Improve single-application performance
    • Hard
    • Control and Data Dependence
    • Proposed approach: Thread-Level Speculation (TLS)

CMP trade-offs

thread level speculation tls

Run Time

Compile Time

Parallelize without

dependency detection

Commit

Modification

No

Detect

Violation

Squash And

Re-execute

Yes

Thread-Level Speculation (TLS)
  • Enable compiler to create parallel threads despite the existence of ambiguous data dependence
  • Optimistically parallelize at compile time
  • Detect violations and recover at runtime

Optimistic at compile time, detect and recover at runtime

example of thread level speculation
Example of Thread-Level Speculation

Code to parallelize

for ( …){

*p = …;

… = … … *q;

}

Un-parallelizable through paralleling compilers

  • Uncertain dependence between *p and *q
  • Might be runtime or user-input dependent

Break loop iterations into threads, explore uncertainty in each thread

how thread level speculation works

…*q

violation

*p…

Recover

TLS

Exec.

Time

…*q

exploit available thread-level parallelism

How Thread-Level Speculation works

thread level speculation quick summary
Thread-Level Speculation quick summary
  • Benefits
    • Reduce inter-thread communication time among cores
    • Scale
    • New parallel programming model
  • Types of implementations
    • Hardware only
    • Combined with hardware and software
    • Software only

Thread-Level Speculation is good for Chip Multi-Processor

thread level speculation implementation diagram

Thread-Level

Speculation

SW-only approach

HW-only approach

Our approach

Thread-Level Speculation Implementation Diagram

Overall picture of Thread-Level Speculation

thread level speculation implementation comparison
Thread-Level Speculation Implementation Comparison
  • Hardware-only approach
    • Lots of research
    • Good speed up through simulation
    • Nobody builds it yet
      • cost, risky,
      • need both HW + SW at the same time
    • Outcome
      • HW-only TLS looks promising
      • Significant hardware changes
  • Software-only approach: limited work, limited progress
    • Major problem: high overhead
      • Buffer memory for speculative states
      • Track each memory read + write: violation detection
      • Recover from failed speculation: re-execution

Quick summary on HW-only and SW-only approaches

outline for the rest of the talk
Outline for the rest of the talk
  • Hardware TLS schemes
  • Software TLS schemes
  • Our scheme
    • Our goals
    • Starting point
    • Potential applications
  • Conclusion
hardware only thread level speculation

Thread-Level

Speculation

SW-only approach

HW-only approach

Our approach

Hardware-only Thread-Level Speculation

Overall picture of HW-only TLS approach

hardware thread level speculation schemes
Hardware Thread-Level Speculation Schemes
  • Lots of hardware TLS research
    • CMU Stampede
    • Stanford Hydra
    • Wisconsin Multiscalar
    • UIUC IA-COMA
    • UMN Super-threaded architecture
  • Convergence of hardware schemes
    • Use cache to buffer speculative state
    • Extend cache coherence protocol to track data dependence

Convergence of HW-only Thread-Level Speculation

hardware tls schemes quick summary
Result

TLS is promising

SPEC int improvement:

30% - 100%

Depends on aggressiveness of the hardware support

P

P

P

P

C

C

C

C

C

(non-speculative)

Hardware TLS Schemes: quick summary

Sp-state

Sp-state

Sp-state

Sp-state

CMP with hardware speculative buffer and enhanced cache consistence protocol

Convergence of HW-only Thread-Level Speculation

software only thread level speculation

Thread-Level

Speculation

SW-only approach

HW-only approach

Our approach

Software-only Thread-Level Speculation

Overall picture of SW-only TLS approach

software only thread level speculation schemes
Software-only Thread-Level Speculation Schemes
    • LRPD Test: UIUC
    • VM for dependence tracking: Spiros’s, CMU
    • Cintra’s SW TLS: U Edinburgh
  • Problem of software-only approach: high overhead
  • Try to reduce it

overview of SW-only TLS approach

lrpd test uiuc

software

dependence

tracking

was parallel

execution safe?

LRPD Test (UIUC)

+ implemented entirely in software

– applies only to array-based code

– no partial parallelism

entire loop will re-execute sequentially if there is any dependence

Exec.

Time

Pros + Cons of LRPD

dependence tracking using virtual memory
Dependence tracking using Virtual Memory

Exec.

Time

Software dependence tracking through VM pages

Virtual Memory Synchronize:

transfer VM pages

? Pros + Cons of VM Tracking

cmu spiros s approach dependence tracking using virtual memory
CMU Spiros’s approach -- Dependence tracking using Virtual Memory
  • Coarse-grain, software-only
  • Based on memory tracking
    • virtual memory page protection mechanism
    • use software DSM (TreadMarks)
    • Synchronization through VM pages through cost analysis
  • Overhead is prohibitive
    • 2 sec (seq) / 5 min (par)
    • Not a viable approach on this level of coarse granularity

SW-TLS through VM Tracking is not attractive

cintra s sw tls memory tracking tuned for performance
Cintra’s SW TLS: Memory tracking tuned for performance

Exec.

Time

Efficient tracking for array references

Efficient but custom-made for array only

cintra s software only thread level speculation quick summary
Cintra’s software-only Thread-Level Speculation: quick summary
  • Features
    • Software simulation for extended cache coherence protocol
      • Provide speculative state transition table
    • Violation detection through speculate state comparison
    • Instrument on each load and store
  • Pros + Cons:
    • + advanced implementation of LRPD test
    • + implement entirely in software
    • + cover partial parallelism
    • – hand-crafted code for performance
    • – apply only to array-based code

Summary of Cintra’s work

problems with software thread level speculation
Problems with Software Thread-Level Speculation
  • High overhead
      • Buffer speculative state
      • Track data dependence for all memory reference
      • Re-execute in case of failed speculation
  • Potential speedup
    • largely unexplored
  • Possible directions for future research
    • Reduce overhead
    • Achieve speedup from TLS parallelism

Summary of Software TLS

our current thread level speculation approach

Thread-Level

Speculation

SW-only approach

HW-only approach

Our approach

Our current Thread-Level Speculation approach

Overall position for our SW TLS approach

long term future plan
Long term future plan
  • Goals
    • Target
      • Chip Multi-Processors
      • Tightly-coupled MPs
    • Apply to general-purpose code: not only arrays
    • Minimize overhead
      • Capitalize on compiler analysis and optimizations
        • Idempotency analysis <done>
        • Synchronization and communications <done>
        • PPA: Probabilistic pointer analysis Framework (Jeff’s work) <progressing>
        • Minimal backup and buffer retrieval analysis <progressing>
        • … more analysis we will invent <todo>
  • SW-only approach: room to improve
  • Starting point: highly efficient software checkpointing

Goals and Plans

starting point efficient software checkpointing
Starting point: efficient software checkpointing

program

execution

  • Some program points in source code
  • Buffer state change between current execution point and its latest check point
  • Execution can always efficiently rewind to its latest checkpointing

Buffer memory changes

Buffer more memory changes

Software checkpointing

Introduce software checkpointing

potential use of software checkpointing
Potential use of Software checkpointing
  • Software Rollback
    • automatic software TLS support
    • foundation of future automatic TLS parallelization
  • Debug
    • controlled rewind
  • Enhance application reliability
  • Speculative optimizations in uni-processor program
    • larger window size
    • deep branch speculation
    • speculative code motion

what can software checkpointing do

software checkpointing schemes
Software checkpointing schemes
  • Compiler analysis
    • Local: Basic Block level
      • Backup only needed memory writes
      • Optimize to minimize
        • number of backup
        • Number of buffer retrieval
    • Global: procedural level
      • Populate buffers through control-flow graph
      • Iterate until buffer stabilizes
    • Inter-procedural level
  • Potential approaches for software backup
    • Undo backup
    • Todo backup

build software checkpointing

undo backup
Undo backup
  • Compile-time analysis
  • Backup once
    • per distinct memory write
    • per Basic Block
  • Program continue to operate on non-backup memory
  • Action upon execution completion
    • Commit: trash buffer
    • Rollback: restore from buffer

undo backup properties

undo backup example
Undo backup example

Program, Basic Block level

Undo backup memory

Undo backup action

(&a, [a])

(&b, [b])

(&c, [c])

a = 10;

b = 12;

c = a + b;

conflicts check

Y

restore undo memory

N

trash undo memory

Next Basic

Block

undo backup process

todo backup
Todo backup
  • Perform at runtime
  • Happen on each single memory write inside Basic Block
  • Each following read might need to retrieve from buffer
  • Action upon completion (reverse of Undo type)
    • Commit: write-back from buffer
    • Rollback: trash buffer

todo backup properties

todo backup example
Todo backup example

Program, Basic Block level

todo backup memory

(p, a)

(q, b)

*p = a;

*q = b;

…*p + *q;

conflicts check

Y

trash todo backup

N

write todo backup to memory

Next Block

todo backup process

backup comparison
Backup Comparison
  • Undo
    • Pro: fast
        • Few number of backups
        • No need to retrieve from buffer for read
    • Con: Memory address needs to be known statically
        • Scalar
        • Pointer to fixed location
  • Todo
    • Pro
      • Handle both scalar and general-purpose pointer cases
    • Con: slow
        • Backup once per memory write
        • Need to retrieve each following read from buffer
  • In reality: both types are used

pros + cons of undo and todo

an example in reality mixed mode
An example in reality: mixed mode

Code to execute

Undo buffer

int a, b, c;

int * p, * q;

(d) a = 1;

(d) b = 2;

(d) *p = 5;

… …

(u) c = a + b;

… …

(u) … = * q;

(&a, [a])

(&b, [b])

(&c, [c])

Todo buffer

(p, 5)

combined-backup process in reality

selection of backups in reality
Selection of backups in reality
  • Combined approach
    • Undo: memory address known
      • Scalars
      • Pointers to fixed address
      • Compile-time analysis
    • Todo: memory address unknown
      • Normal pointers
      • Run-time analysis
  • Plan for implementation
    • put into SUIF, as a optimization pass
    • Minimize performance drop

use both types together in reality

conclusion
Conclusion
  • Thread-Level Speculation is compelling
    • Potential large performance gains
  • Challenge
    • Software overhead
  • Limited SW TLS work
    • No previous SW TLS working on general-purpose programs
    • Killer advantage: compiler analyses
    • Modest starting point
      • efficient software checkpointing

summary

concurrent hw only related work
Concurrent HW-only Related Work

An other view of HW-only Thread-Level Speculation Schemes