1 / 13

Multithreaded architectures

Multithreaded architectures. Thread-level parallelism. Instruction level parallelism exploits very fine grain independent instructions Thread level parallelism is explicitly represented in the program by the use of multiple threads of execution that are inherently parallel

bentonj
Download Presentation

Multithreaded architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multithreaded architectures

  2. Thread-level parallelism • Instruction level parallelism exploits very fine grain independent instructions • Thread level parallelism is explicitly represented in the program by the use of multiple threads of execution that are inherently parallel • Goal: Use multiple instruction streams to improve either (or both) • Throughput of computers that run many programs • Execution time of multi-threaded programs

  3. Thread-level parallelism • Thread level parallelism potentially allows huge speedups • There is no complex superscalar architecture that scales poorly • There is no particular requirement for very complex compilers, like for VLIW • Instead the burden of identifying and exploiting the parallelism falls mostly on the programmer • Programming with multiple threads is much more difficult than sequential programming • Debugging parallel programs is incredibly challenging • But it’s pretty easy to build a big, fast parallel computer • However, the main reason we have not had widely-used parallel computers in the past is that they are too difficult (expensive), time consuming (expensive) and error prone to program.

  4. Multithreading in a processor core • Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions • Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor • Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread • The caches, TLBs, branch predictors can be shared (although the miss rates may increase if they are not sized accordingly) • The memory can be shared through virtual memory mechanisms • Hardware must support efficient thread context switching

  5. Types of Multithreading • Fine-grain – switch threads on every instruction issue • Round-robin thread interleaving (skipping stalled threads) • Processor must be able to switch threads on every clock cycle • Advantage – can hide throughput losses that come from both short and long stalls • Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads • Coarse-grain – switches threads only on costly stalls (e.g., L3 cache misses) • Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread • Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss • Pipeline must be flushed and refilled on thread switches

  6. Multithreaded Example: Sun’s Niagara (UltraSparc T1) • Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction) MT SPARC pipe MT SPARC pipe MT SPARC pipe MT SPARC pipe MT SPARC pipe MT SPARC pipe MT SPARC pipe MT SPARC pipe I/O shared funct’s Crossbar 4-way banked L2$ Memory controllers

  7. Niagara Integer Pipeline • Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient Fetch Thrd Sel Decode Execute Memory WB RegFilex4 ALU Mul Shft Div D$ DTLB Stbufx4 Crossbar Interface Inst bufx4 Thrd Sel Mux I$ ITLB Decode Instr type Thread Select Logic Cache misses Traps & interrupts Thrd Sel Mux Resource conflicts PC logicx4 From MPR, Vol. 18, #9, Sept. 2004

  8. Simultaneous Multithreading (SMT) • A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread-level parallelism (TLP) • Most Superscalar processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP) • With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them • Need separate rename tables (reorder buffers) for each thread • Need the capability to commit from multiple threads (i.e., from multiple reorder buffers) in one cycle • Intel’s recent desktop and laptop processors mostly use SMT

  9. Simultaneous Multithreading (SMT) • The hard part of building a processor with SMT is not designing the SMT hardware • SMT hardware relies on parallel instruction execution on out-of-order processors • It’s very simple • If two instructions belong to different threads then there is no dependency • The hard part of building SMT processors is • Designing and building the underlying out-of-order superscalar processor architecture • Testing and debugging the processor in SMT mode • Parallelism is so fine grain that it is hard to investigate, and there can be any ordering of execution of instructions from different threads

  10. Coarse MT Fine MT Threading on a 4-way Superscalar Example SMT Issue slots → Thread A Thread B Time → Thread C Thread D

  11. Multicore Xbox360 – “Xenon” processor • Aim is to provide game developers with a balanced and powerful platform • Three SMT processors, 32KB L1 D-cache & I-cache, 1MB Unified L2 cache • Two SMT threads per core • 165M transistors total • 3.2 Ghz Near-POWER PC ISA • 2-issue, 21 stage pipeline, with 128 128-bit registers • Weak branch prediction – supported by software hinting • In order instruction execution • Narrow cores – 2 INT units, 2 128-bit VMX SIMD units, 1 of anything else • An ATI-designed 500MHz GPU, 512MB of DDR3DRAM • 337M transistors, 10MB framebuffer • 48 pixel shader cores, each with 4 ALUs

  12. DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control Core 0 Core 1 Core 2 L1D L1I L1D L1I L1D L1I 1MB Unified L2 XMA Dec GPU 512MB DRAM BIU/IO Intf SMC 3D Core MC1 10MB EDRAM Video Out Analog Chip Video Out MC0 Xenon Diagram

  13. Xenon Diagram

More Related