A heterogeneous lightweight multithreaded architecture
1 / 29

A Heterogeneous Lightweight Multithreaded Architecture - PowerPoint PPT Presentation

  • Uploaded on

A Heterogeneous Lightweight Multithreaded Architecture. Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block University of Notre Dame MTAAP 2007,CA. Outline. Heterogeneous Lightweight Multithreaded Architecture

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' A Heterogeneous Lightweight Multithreaded Architecture' - sybil

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A heterogeneous lightweight multithreaded architecture

A Heterogeneous Lightweight Multithreaded Architecture

Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge,

Paul Springer, and Gary Block

University of Notre Dame



  • Heterogeneous Lightweight Multithreaded Architecture

  • Simulation environments, benchmarks and results

  • Conclusions and future work

Architecture highlights
Architecture Highlights

  • Processing-In-Memory(PIM) Based

    • Effectively attack memory wall problem

  • Highly multithreaded

    • Successfully hide large latencies and contentions

  • Heterogeneous, Supports Extended Memory Semantics (EMS)

    • Extremely low overhead on context switch and synchronization

Multithreaded processors
Multithreaded Processors

  • Multithreading reduces the processor idle time

  • Thread context is part of the processor

Multithreading Machines

1960s CDC 6600

1970s I/O Processor for the Space Shuttle

1980s Denelcor HEP

1990s Cray/Tera MTA

2000+ Cray Eldorado

2000+ Intel Xeon

2000+ Sun Niagara

Single Threaded Multithreaded

Lightweight threads
Lightweight Threads

  • Thread context (frame) is 32 double words (256 bytes)

    • Two double words are reserved for the thread status; 30 general purpose registers.

    • No other per thread state, easy for multithreading .

  • Frames are stored in memory (No Register File)

    • Registers are aliases for memory locations

Lightweight multithreading
Lightweight Multithreading

  • Thread creation is fast and inexpensive - single instruction

    • Contrast with pthread creation - kernel intervention and as many as 10,000’s of instructions

  • Unbounded Multithreading

    • Threads are part of the memory system rather than the processor state.

    • “Unlimited number” of threads per processor.

    • Many opportunities for issuing an instruction.

  • Ultra-lightweight Processing

    • Unbounded Multithreadingrequires low overhead thread management and synchronization

    • At the memory bank, Greater data bandwidth,Low overhead

Heterogeneous architecture
Heterogeneous Architecture

  • Issue instruction from ready threads on each clock cycle

  • Architectural support for low overhead thread management

Heterogeneous Architecture

Lightweight Processor Chip (LPC)

Extended memory semantics ems

64 bits of data/metadata

Extension bit

Extended Memory Semantics (EMS)

  • Memory subsystem is constructed of 65 bit dwords

    • 64 bits of data

    • 1 extension bit;1: dword is Full, 0: dword is empty

  • Extends Cray MTA E/F bits

    • Full/Empty: Contains data or not

    • Extra states: Metadata can contain frame pointer

  • Same semantics apply to thread registers

Single producer consumer on ems
Single Producer/ Consumer on EMS

  • LWP behavior for load_fe with A empty.

    • Location A changes state to “FVE: forward value, leave empty”

    • Content of A is the target address of the forward operation (all registers also have a memory address).

Completing the load
Completing the Load

  • How does the LWP complete the load_fe?

    • store_ef arrives at A

    • Data associated with store is returned to T2:R2 – this completes the load_fe

    • Location A changes to the empty state.

A more complex situation
A More Complex Situation

  • Consider a multiple producer/consumer problem such as locks.

    • Multiple threads (more than 3) all attempt to acquire the lock.

    • Memory requests will be queued up at the target location

    • EMS handlerthread needed to handle the bookkeeping

Ems handler overhead
EMS Handler Overhead

  • Invoking a EMS handler

    • Synchronized memory operations beyond the hardware supported single producer/consumer scenario

  • Overhead

    • Creating the handler threads

    • To queue up memory requests, handlers need to spin on the target memory address to get exclusive access

    • Significant overhead on LWP CPU time, NoC traffic and memory bandwidth

  • How to alleviate the overhead?

Ultra lightweight processor
Ultra-Lightweight Processor

  • Alleviate burden from LWP

  • For thread synchronization and management, Complex atomic memory operations

  • Simple design, Minimal circuitry

  • At the memory bank, Greatest data bandwidth (wide-word),no NoC traffic when accessing memory.

  • Multithreaded

Large scale system
Large-scale system

Large-scale system


  • Heterogeneous Lightweight MultithreadedArchitecture

  • Simulation environments, benchmarks and results

  • Conclusion and future work

Simulation environment
Simulation Environment

DimC – Diminished C

- An extension of the ANSI C

- Expose low level architectural features

- Support lightweight multithreading

SALT -Simulator for the Analysis of LWP Timings

-Contains LWPs, ULWPs, NoC and memory subsystems.

Benchmark suite
Benchmark Suite

  • Two categories of irregular problems.

  • Complicated control structures such as recursion.

    • Such programs can achieve decent performance on conventional architectures but need great effort.

    • Not necessarily Invoking EMS handler or ULWP

    • N-Queens, Fibonacci

  • Complicated control structures and dynamic data structures

    • Very hard to parallelize effectively on conventional SMPs.

    • EMS handler or ULWP support is necessary

    • Competing agents, SAT solver kernel

N queens

  • Find all solutions to the problem of placing N queens on an N*N chessboard such that no queen can attack another.

  • Irregular problems with dynamic parallel recursion ,

  • Thread behavior is hard to predict.

Competing agents
Competing Agents

  • Multiple agents attempt to update a shared memory location simultaneously

  • Each agent is implemented by a single thread. All threads are evenly distributed over four LWPs inside a single LPC

  • Complicated control structures and dynamic data structures

  • Using separate synchronized load/stores

  • To characterize the effectiveness of the ULWP in reducing the cost of synchronization.

Sat solver zchaff
SAT Solver/zChaff

  • SAT-Boolean satisfiability problem (from propositional logic)

    • fundamental to many problems in automated reasoning, CAD, CAM, machine vision, database, robotics, IC design, computer architecture, and network design.

    • Given a boolean formula (usually in CNF) , check whether an assignment of boolean truth values to the variables in the formula exists, such that the formula evaluates to true.

    • For example, the CNF formula, x1 is true and x3 is false, then all three clauses are satisfied,regardless of the value of x2.

  • zChaff , the modern variants of the DPLL algorithm, is used to implement SAT solver.

N queens1

  • Successfully deploy all the parallelism

    • Completely dynamic, Ideal speedup

    • Saturation is only due to small data set

  • Good performance can be achieved on conventional SMPs but need great extra effort

Competing agents1
Competing Agents

  • EMS handler is the bottleneck in high contention situation

  • Heterogeneous architecture can achieve unbounded scalability

  • High contention is not a problem any more in the heterogeneous architecture

Sat solver zchaff on conventional smps
SAT Solver/zChaff on Conventional SMPs

  • Parallel implementation lead to performance degeneration

  • The more processors, the worse performance

  • Very hard to achieve good performance on conventional SMPs

Data from Parallel Multithreaded Satisfiability Solver: Design and Implementation By Yulik Feldman, etc. @ Intel

Sat solver zchaff on heterogeneous architecture
SAT Solver/zChaff on Heterogeneous architecture

  • Ideal speedup

  • saturation is only due to small data set

  • Successfully deployed all the parallelism

Speedup Speedup Over serial version


  • Heterogeneous Lightweight MultithreadedArchitecture

  • Simulation environments, benchmarks and results

  • Conclusions and future work


  • The Heterogeneous Lightweight Multithreaded Architecture

    • is a good solution for irregular problem that are hard/impossible to parallelize over conventional SMPs

    • Has very low overhead on context switching and synchronization

    • Can successfully hide latencies and contentions

    • Can provide unbounded multithreading and scalability

    • Can deploy all possible parallelism inside an irregular problem

Future work
Future Work

  • Provide standard language support

  • Benchmark suites

  • Large-scale system performance

  • Comparison with conventional large-scale systems



    • This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under its Contract No. NBCH3039003.

  • University of Notre Dame

  • Caltech/JPL

  • Cray