a heterogeneous lightweight multithreaded architecture n.
Skip this Video
Download Presentation
A Heterogeneous Lightweight Multithreaded Architecture

Loading in 2 Seconds...

play fullscreen
1 / 29

A Heterogeneous Lightweight Multithreaded Architecture - PowerPoint PPT Presentation

  • Uploaded on

A Heterogeneous Lightweight Multithreaded Architecture. Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block University of Notre Dame MTAAP 2007,CA. Outline. Heterogeneous Lightweight Multithreaded Architecture

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'A Heterogeneous Lightweight Multithreaded Architecture' - sybil

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a heterogeneous lightweight multithreaded architecture

A Heterogeneous Lightweight Multithreaded Architecture

Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge,

Paul Springer, and Gary Block

University of Notre Dame


  • Heterogeneous Lightweight Multithreaded Architecture
  • Simulation environments, benchmarks and results
  • Conclusions and future work
architecture highlights
Architecture Highlights
  • Processing-In-Memory(PIM) Based
    • Effectively attack memory wall problem
  • Highly multithreaded
    • Successfully hide large latencies and contentions
  • Heterogeneous, Supports Extended Memory Semantics (EMS)
    • Extremely low overhead on context switch and synchronization
multithreaded processors
Multithreaded Processors
  • Multithreading reduces the processor idle time
  • Thread context is part of the processor

Multithreading Machines

1960s CDC 6600

1970s I/O Processor for the Space Shuttle

1980s Denelcor HEP

1990s Cray/Tera MTA

2000+ Cray Eldorado

2000+ Intel Xeon

2000+ Sun Niagara

Single Threaded Multithreaded

lightweight threads
Lightweight Threads
  • Thread context (frame) is 32 double words (256 bytes)
    • Two double words are reserved for the thread status; 30 general purpose registers.
    • No other per thread state, easy for multithreading .
  • Frames are stored in memory (No Register File)
    • Registers are aliases for memory locations
lightweight multithreading
Lightweight Multithreading
  • Thread creation is fast and inexpensive - single instruction
    • Contrast with pthread creation - kernel intervention and as many as 10,000’s of instructions
  • Unbounded Multithreading
    • Threads are part of the memory system rather than the processor state.
    • “Unlimited number” of threads per processor.
    • Many opportunities for issuing an instruction.
  • Ultra-lightweight Processing
    • Unbounded Multithreadingrequires low overhead thread management and synchronization
    • At the memory bank, Greater data bandwidth,Low overhead
heterogeneous architecture
Heterogeneous Architecture
  • Issue instruction from ready threads on each clock cycle
  • Architectural support for low overhead thread management

Heterogeneous Architecture

Lightweight Processor Chip (LPC)

extended memory semantics ems

64 bits of data/metadata

Extension bit

Extended Memory Semantics (EMS)
  • Memory subsystem is constructed of 65 bit dwords
    • 64 bits of data
    • 1 extension bit;1: dword is Full, 0: dword is empty
  • Extends Cray MTA E/F bits
    • Full/Empty: Contains data or not
    • Extra states: Metadata can contain frame pointer
  • Same semantics apply to thread registers
single producer consumer on ems
Single Producer/ Consumer on EMS
  • LWP behavior for load_fe with A empty.
    • Location A changes state to “FVE: forward value, leave empty”
    • Content of A is the target address of the forward operation (all registers also have a memory address).
completing the load
Completing the Load
  • How does the LWP complete the load_fe?
    • store_ef arrives at A
    • Data associated with store is returned to T2:R2 – this completes the load_fe
    • Location A changes to the empty state.
a more complex situation
A More Complex Situation
  • Consider a multiple producer/consumer problem such as locks.
    • Multiple threads (more than 3) all attempt to acquire the lock.
    • Memory requests will be queued up at the target location
    • EMS handlerthread needed to handle the bookkeeping
ems handler overhead
EMS Handler Overhead
  • Invoking a EMS handler
    • Synchronized memory operations beyond the hardware supported single producer/consumer scenario
  • Overhead
    • Creating the handler threads
    • To queue up memory requests, handlers need to spin on the target memory address to get exclusive access
    • Significant overhead on LWP CPU time, NoC traffic and memory bandwidth
  • How to alleviate the overhead?
ultra lightweight processor
Ultra-Lightweight Processor
  • Alleviate burden from LWP
  • For thread synchronization and management, Complex atomic memory operations
  • Simple design, Minimal circuitry
  • At the memory bank, Greatest data bandwidth (wide-word),no NoC traffic when accessing memory.
  • Multithreaded
large scale system
Large-scale system

Large-scale system

  • Heterogeneous Lightweight MultithreadedArchitecture
  • Simulation environments, benchmarks and results
  • Conclusion and future work
simulation environment
Simulation Environment

DimC – Diminished C

- An extension of the ANSI C

- Expose low level architectural features

- Support lightweight multithreading

SALT -Simulator for the Analysis of LWP Timings

-Contains LWPs, ULWPs, NoC and memory subsystems.

benchmark suite
Benchmark Suite
  • Two categories of irregular problems.
  • Complicated control structures such as recursion.
    • Such programs can achieve decent performance on conventional architectures but need great effort.
    • Not necessarily Invoking EMS handler or ULWP
    • N-Queens, Fibonacci
  • Complicated control structures and dynamic data structures
    • Very hard to parallelize effectively on conventional SMPs.
    • EMS handler or ULWP support is necessary
    • Competing agents, SAT solver kernel
n queens
  • Find all solutions to the problem of placing N queens on an N*N chessboard such that no queen can attack another.
  • Irregular problems with dynamic parallel recursion ,
  • Thread behavior is hard to predict.
competing agents
Competing Agents
  • Multiple agents attempt to update a shared memory location simultaneously
  • Each agent is implemented by a single thread. All threads are evenly distributed over four LWPs inside a single LPC
  • Complicated control structures and dynamic data structures
  • Using separate synchronized load/stores
  • To characterize the effectiveness of the ULWP in reducing the cost of synchronization.
sat solver zchaff
SAT Solver/zChaff
  • SAT-Boolean satisfiability problem (from propositional logic)
    • fundamental to many problems in automated reasoning, CAD, CAM, machine vision, database, robotics, IC design, computer architecture, and network design.
    • Given a boolean formula (usually in CNF) , check whether an assignment of boolean truth values to the variables in the formula exists, such that the formula evaluates to true.
    • For example, the CNF formula, x1 is true and x3 is false, then all three clauses are satisfied,regardless of the value of x2.
  • zChaff , the modern variants of the DPLL algorithm, is used to implement SAT solver.
n queens1
  • Successfully deploy all the parallelism
    • Completely dynamic, Ideal speedup
    • Saturation is only due to small data set
  • Good performance can be achieved on conventional SMPs but need great extra effort
competing agents1
Competing Agents
  • EMS handler is the bottleneck in high contention situation
  • Heterogeneous architecture can achieve unbounded scalability
  • High contention is not a problem any more in the heterogeneous architecture
sat solver zchaff on conventional smps
SAT Solver/zChaff on Conventional SMPs
  • Parallel implementation lead to performance degeneration
  • The more processors, the worse performance
  • Very hard to achieve good performance on conventional SMPs

Data from Parallel Multithreaded Satisfiability Solver: Design and Implementation By Yulik Feldman, etc. @ Intel

sat solver zchaff on heterogeneous architecture
SAT Solver/zChaff on Heterogeneous architecture
  • Ideal speedup
  • saturation is only due to small data set
  • Successfully deployed all the parallelism

Speedup Speedup Over serial version

  • Heterogeneous Lightweight MultithreadedArchitecture
  • Simulation environments, benchmarks and results
  • Conclusions and future work
  • The Heterogeneous Lightweight Multithreaded Architecture
    • is a good solution for irregular problem that are hard/impossible to parallelize over conventional SMPs
    • Has very low overhead on context switching and synchronization
    • Can successfully hide latencies and contentions
    • Can provide unbounded multithreading and scalability
    • Can deploy all possible parallelism inside an irregular problem
future work
Future Work
  • Provide standard language support
  • Benchmark suites
  • Large-scale system performance
  • Comparison with conventional large-scale systems
    • This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under its Contract No. NBCH3039003.
  • University of Notre Dame
  • Caltech/JPL
  • Cray