a heterogeneous lightweight multithreaded architecture
Download
Skip this Video
Download Presentation
A Heterogeneous Lightweight Multithreaded Architecture

Loading in 2 Seconds...

play fullscreen
1 / 29

A Heterogeneous Lightweight Multithreaded Architecture - PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on

A Heterogeneous Lightweight Multithreaded Architecture. Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block University of Notre Dame MTAAP 2007,CA. Outline. Heterogeneous Lightweight Multithreaded Architecture

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' A Heterogeneous Lightweight Multithreaded Architecture' - sybil


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a heterogeneous lightweight multithreaded architecture

A Heterogeneous Lightweight Multithreaded Architecture

Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge,

Paul Springer, and Gary Block

University of Notre Dame

MTAAP 2007,CA

outline
Outline
  • Heterogeneous Lightweight Multithreaded Architecture
  • Simulation environments, benchmarks and results
  • Conclusions and future work
architecture highlights
Architecture Highlights
  • Processing-In-Memory(PIM) Based
    • Effectively attack memory wall problem
  • Highly multithreaded
    • Successfully hide large latencies and contentions
  • Heterogeneous, Supports Extended Memory Semantics (EMS)
    • Extremely low overhead on context switch and synchronization
multithreaded processors
Multithreaded Processors
  • Multithreading reduces the processor idle time
  • Thread context is part of the processor

Multithreading Machines

1960s CDC 6600

1970s I/O Processor for the Space Shuttle

1980s Denelcor HEP

1990s Cray/Tera MTA

2000+ Cray Eldorado

2000+ Intel Xeon

2000+ Sun Niagara

Single Threaded Multithreaded

lightweight threads
Lightweight Threads
  • Thread context (frame) is 32 double words (256 bytes)
    • Two double words are reserved for the thread status; 30 general purpose registers.
    • No other per thread state, easy for multithreading .
  • Frames are stored in memory (No Register File)
    • Registers are aliases for memory locations
lightweight multithreading
Lightweight Multithreading
  • Thread creation is fast and inexpensive - single instruction
    • Contrast with pthread creation - kernel intervention and as many as 10,000’s of instructions
  • Unbounded Multithreading
    • Threads are part of the memory system rather than the processor state.
    • “Unlimited number” of threads per processor.
    • Many opportunities for issuing an instruction.
  • Ultra-lightweight Processing
    • Unbounded Multithreadingrequires low overhead thread management and synchronization
    • At the memory bank, Greater data bandwidth,Low overhead
heterogeneous architecture
Heterogeneous Architecture
  • Issue instruction from ready threads on each clock cycle
  • Architectural support for low overhead thread management

Heterogeneous Architecture

Lightweight Processor Chip (LPC)

extended memory semantics ems

64 bits of data/metadata

Extension bit

Extended Memory Semantics (EMS)
  • Memory subsystem is constructed of 65 bit dwords
    • 64 bits of data
    • 1 extension bit;1: dword is Full, 0: dword is empty
  • Extends Cray MTA E/F bits
    • Full/Empty: Contains data or not
    • Extra states: Metadata can contain frame pointer
  • Same semantics apply to thread registers
single producer consumer on ems
Single Producer/ Consumer on EMS
  • LWP behavior for load_fe with A empty.
    • Location A changes state to “FVE: forward value, leave empty”
    • Content of A is the target address of the forward operation (all registers also have a memory address).
completing the load
Completing the Load
  • How does the LWP complete the load_fe?
    • store_ef arrives at A
    • Data associated with store is returned to T2:R2 – this completes the load_fe
    • Location A changes to the empty state.
a more complex situation
A More Complex Situation
  • Consider a multiple producer/consumer problem such as locks.
    • Multiple threads (more than 3) all attempt to acquire the lock.
    • Memory requests will be queued up at the target location
    • EMS handlerthread needed to handle the bookkeeping
ems handler overhead
EMS Handler Overhead
  • Invoking a EMS handler
    • Synchronized memory operations beyond the hardware supported single producer/consumer scenario
  • Overhead
    • Creating the handler threads
    • To queue up memory requests, handlers need to spin on the target memory address to get exclusive access
    • Significant overhead on LWP CPU time, NoC traffic and memory bandwidth
  • How to alleviate the overhead?
ultra lightweight processor
Ultra-Lightweight Processor
  • Alleviate burden from LWP
  • For thread synchronization and management, Complex atomic memory operations
  • Simple design, Minimal circuitry
  • At the memory bank, Greatest data bandwidth (wide-word),no NoC traffic when accessing memory.
  • Multithreaded
large scale system
Large-scale system

Large-scale system

outline1
Outline
  • Heterogeneous Lightweight MultithreadedArchitecture
  • Simulation environments, benchmarks and results
  • Conclusion and future work
simulation environment
Simulation Environment

DimC – Diminished C

- An extension of the ANSI C

- Expose low level architectural features

- Support lightweight multithreading

SALT -Simulator for the Analysis of LWP Timings

-Contains LWPs, ULWPs, NoC and memory subsystems.

benchmark suite
Benchmark Suite
  • Two categories of irregular problems.
  • Complicated control structures such as recursion.
    • Such programs can achieve decent performance on conventional architectures but need great effort.
    • Not necessarily Invoking EMS handler or ULWP
    • N-Queens, Fibonacci
  • Complicated control structures and dynamic data structures
    • Very hard to parallelize effectively on conventional SMPs.
    • EMS handler or ULWP support is necessary
    • Competing agents, SAT solver kernel
n queens
N-Queens
  • Find all solutions to the problem of placing N queens on an N*N chessboard such that no queen can attack another.
  • Irregular problems with dynamic parallel recursion ,
  • Thread behavior is hard to predict.
competing agents
Competing Agents
  • Multiple agents attempt to update a shared memory location simultaneously
  • Each agent is implemented by a single thread. All threads are evenly distributed over four LWPs inside a single LPC
  • Complicated control structures and dynamic data structures
  • Using separate synchronized load/stores
  • To characterize the effectiveness of the ULWP in reducing the cost of synchronization.
sat solver zchaff
SAT Solver/zChaff
  • SAT-Boolean satisfiability problem (from propositional logic)
    • fundamental to many problems in automated reasoning, CAD, CAM, machine vision, database, robotics, IC design, computer architecture, and network design.
    • Given a boolean formula (usually in CNF) , check whether an assignment of boolean truth values to the variables in the formula exists, such that the formula evaluates to true.
    • For example, the CNF formula, x1 is true and x3 is false, then all three clauses are satisfied,regardless of the value of x2.
  • zChaff , the modern variants of the DPLL algorithm, is used to implement SAT solver.
n queens1
N-Queens
  • Successfully deploy all the parallelism
    • Completely dynamic, Ideal speedup
    • Saturation is only due to small data set
  • Good performance can be achieved on conventional SMPs but need great extra effort
competing agents1
Competing Agents
  • EMS handler is the bottleneck in high contention situation
  • Heterogeneous architecture can achieve unbounded scalability
  • High contention is not a problem any more in the heterogeneous architecture
sat solver zchaff on conventional smps
SAT Solver/zChaff on Conventional SMPs
  • Parallel implementation lead to performance degeneration
  • The more processors, the worse performance
  • Very hard to achieve good performance on conventional SMPs

Data from Parallel Multithreaded Satisfiability Solver: Design and Implementation By Yulik Feldman, etc. @ Intel

sat solver zchaff on heterogeneous architecture
SAT Solver/zChaff on Heterogeneous architecture
  • Ideal speedup
  • saturation is only due to small data set
  • Successfully deployed all the parallelism

Speedup Speedup Over serial version

outline2
Outline
  • Heterogeneous Lightweight MultithreadedArchitecture
  • Simulation environments, benchmarks and results
  • Conclusions and future work
conclusions
Conclusions
  • The Heterogeneous Lightweight Multithreaded Architecture
    • is a good solution for irregular problem that are hard/impossible to parallelize over conventional SMPs
    • Has very low overhead on context switching and synchronization
    • Can successfully hide latencies and contentions
    • Can provide unbounded multithreading and scalability
    • Can deploy all possible parallelism inside an irregular problem
future work
Future Work
  • Provide standard language support
  • Benchmark suites
  • Large-scale system performance
  • Comparison with conventional large-scale systems
acknowledgments
Acknowledgments
  • DARPA
    • This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under its Contract No. NBCH3039003.
  • University of Notre Dame
  • Caltech/JPL
  • Cray
ad