1 / 25

Multithreaded Processors

Multi-Threaded Processor Architectures:. Multithreaded Processors. The Tera MTA. Frank Casilio. Computer Engineering. May 15, 1997. Cache Coherence. Writes To Memory. Problems with MultiProcessors. Memory Latency. Context Switching Time. Communication/Synchronization Latency.

nash
Download Presentation

Multithreaded Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-Threaded Processor Architectures: Multithreaded Processors The Tera MTA Frank Casilio Computer Engineering May 15, 1997

  2. Cache Coherence • Writes To Memory Problems with MultiProcessors • Memory Latency • Context Switching Time • Communication/Synchronization Latency • Poor Programming Model

  3. Motivation • Reduce/Tolerate Memory Latency • General Purpose Machine • Scalability • Shared Memory • Simpler Programming Model

  4. On-Chip Cache • Shortens Round Trip To Memory Typical Ways To Reduce Latency • Fast Buses & Networks • Hardware Synchronization • Prefetching

  5. Experimental Systems Have Existed Since The 50’s • Only 2 Commercial Systems Ever Produced • HEP • Tera MTA Multi-Threading: The Concept • Support For Multiple Concurrent Hardware Contexts • Swap Contexts During Latencies • Tolerates Latency Instead of Reducing It

  6. Parameters That Effect Efficiency • Number Of Contexts Supported • Switching Overhead • Run Length (Granularity) • Average Latency To Be Hidden

  7. Two Different Types • Fine Grained • Coarse Grained Switching Theory • Determines How Often Contexts Switch • Directly Related to Cost

  8. Requires More Contexts • Workload Requirements • Can Simplify Overall Processor Complexity Fine Grained Switching • Switches Contexts Every Cycle • Many Long Latencies Operations Tolerated

  9. Switches Contexts After A Couple Of Cycles • Has Problems With Sporadic Latencies Coarse Grained Switching • Requires Less Contexts • Requires More Complex Processors

  10. Scalable • Direct Relationship b/w PE’s & Throughput The TERA MTA • First Commercial Multithreaded Machine Since 1978 • Uniform Shared Memory • Fine Grained Architecture

  11. The Tera MTA Cont’d • Torodial Interconnection • 16-256 Processor Versions • 12 Million Dollar Base System

  12. Processor Characteristics • Support For 128 Threads • 16 Protection Domains • 0 Context Switching Overhead!!! • 1 GFLOP Peak Performance • 333 MHz Nominal Speed

  13. Load-Store Architecture • 3 Addressing Modes • 1 Memory Reference • 1 Arithmetic Operation • 1 Control (i.e.. Branch) Processor Characteristics Cont’d • 3 Operations Per Instruction • 31 64-bit GPR’s • 6KW Of Power Dissipation Per Processor

  14. 164 Bit Packets • 64 Bits Are Data • 2.67 GB/s Bandwidth In Each Direction Interconnection Network • 3-D Torus Contains 3p/2 nodes • Packet Switching • 3 Cycles of Latency Per Node • Messages Are Assigned Random Priorities • 2 HIPPI Channels / Processor For Net Connection

  15. Memory • Either 2p or 4p Units, Interleaved 64 Ways • 8, 16, 32 and 64 Bit Addressable • 4 Bits per Word Of Access State For Synchronization • Memory Units Equipped With Error Correcting Code • Memory Usage In Random To All Banks • 16 MB DRAM Chips

  16. Maximum Strategy Gen5 XL RAID • Sustained Bandwidth of 130 MB/s Input / Output • 20p MB/s In Each Direction • At Least p/16 Disk Arrays Are Required • System Capacity of 300p GB

  17. Distributed Parallel Version Of Unix • Highly Concurrent Version Of Berkeley • Two Tier Scheduler Provides Better Resource Allocation • PL Scheduler • PB Scheduler Operating System • Allows Systems To Run p Tasks Truly Parallel • Streams Are Dynamically Created w/o OS Intervention • Processes Are Broken Up Into Tasks By OS

  18. Automatic Parallelization Of: • C, C++ & Fortran By The Compiler Software / Languages • Implicit And Explicit Parallelism Is Allowed • High Degree of Cray Compatibility • Easy To Program b/c Of Architecture

  19. System Performance • 3.84-12.8 Times Performance Of Cray T90/32 • 1K x 1K Matrix Multiple in 50 ms • Integer Sort of 100M Keys in 36 ms

  20. Conclusion • Proven Effectiveness • Logical Step For Multiprocessor Computers • Still Very Pricey • Allow General Purpose Workload • Scalable • Shared Memory

  21. Questions?

  22. Instruction Pipeline

  23. Task Team Team Team Team VP VP VP VP VP VP VP VP Breakdown Of A Task

  24. Deciding The Of Number Contexts

More Related