1 / 43

Opportunities for Hardware Multithreading in Microprocessors and Microcontrollers

Opportunities for Hardware Multithreading in Microprocessors and Microcontrollers. Theo Ungerer Systems and Networking University of Augsburg ungerer@informatik.uni-augsburg.de http://www.informatik.uni-augsburg.de/sik/. Basic Principle of Multithreading. thread 1:. Register set 1.

corin
Download Presentation

Opportunities for Hardware Multithreading in Microprocessors and Microcontrollers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Opportunities for Hardware Multithreading in Microprocessors and Microcontrollers Theo Ungerer Systems and Networking University of Augsburg ungerer@informatik.uni-augsburg.de http://www.informatik.uni-augsburg.de/sik/

  2. Basic Principle of Multithreading thread 1: Register set 1 PC PSR 1 thread 2: Register set 2 PC PSR 2 thread 3: Thread pointer Register set 3 PC PSR 3 thread 4: Register set 4 PC PSR 4 ... ... ...

  3. Multithreadingin High Performance Processors Hardware multithreading is the ability to pursue more than one thread within a processor pipeline. Typically features: multiple register sets, fast context switching Main objective: performance gain by latency hiding for multithreaded workloads Multithreading in high-performance microprocessors • IBM RS64 IV (SStar) • Sun UltraSPARC V • Intel Xeon TM

  4. Outline of the Presentation • Motivation • State-of-the-art • Multithreading • Multithreading for throughput increase • Multithreading for power reduction • Multithreading for embedded real-time systems • Conclusions & Research Opportunities

  5. Todays Multiple-issue Processors Utilization of instruction level parallelism by a long instruction pipeline and by the superscalar or the VLIW-/EPIC-technique.

  6. Problem: Low Resource Utilization by Sequential Programs issue slots horizontal loss = 1 horizontal loss = 2 processor vertical loss (= 4) cycles vertical loss (= 4) horizontal loss = 3 Losses by empty issue slots

  7. Outline of the Presentation • Motivation • State-of-the-art • Multithreading • Multithreading for throughput increase • Multithreading for power reduction • Multithreading for embedded real-time systems • Conclusions & Research Opportunities

  8. Multithreading • Two basic multithreading techniques • Interleaved Multithreading • Block Multithreading • Simultaneous multithreading (SMT) • combines wide issue superscalar with multithreading, • issues instructions from several threads simultaneously.

  9. Basic Multithreading Techniques Single thread Interleaved MT Block MT

  10. SMT vs. CMP SMT CMP

  11. Characteristics of Multithreading • Latency Utilization • The latencies that arise in the computation of a single instruction stream are filled by computations of another thread.  Throughput of multithreaded workloads is increased • Power Reduction • Using less speculation • Rapid Context Switching • appropriate for real-time applications

  12. Outline of the Presentation • Motivation • State-of-the-art • Multithreading • Multithreading for throughput increase • Multithreading for power reduction • Multithreading for embedded real-time systems • Conclusions & Research Opportunities

  13. Multithreading for Throughput Increase • Lots of research results with simulated SMT since 1995 • Some of our own research results • Performance estimation of SMT multimedia • Regard transistor count and chip-space estimation of the models.

  14. Relevant Attributes for Rating Microprocessors Performance Resource Requirement Clock Speed Power Consumption • Two tools • Performance estimation tool • Transistor count and chip-space estimation tool

  15. Transistor Count and Chip-space Estimator • Vision: • The resources of the baseline model should be adjusted such that the same chip space or the same transistor count is covered as in the new microachitecture models. • We use an analytical method for memory-based structures like register files or internal queues and • an empirical method for logic blocks like control logic and functional units. • half-feature size l as measure of length of basic cell • Estimator tool is available (also for SimpleScalar) at:http://www.informatik.uni-augsburg.de/lehrstuehle/info3/research/complexity/

  16. Execution-based Simulator:Baseline SMT Multimedia Processor Model

  17. 1 2 Results of Performance and Hardware Cost Estimation • Demonstrated by two set of models: „Maximum“ processor models with an abundance of resources Small processor models Workload is a MPEG-2 decoder made multithreaded

  18. Simulation Parameters • Fixed parameters: • 1024-entry BTAC, gshare branch predictor (2 K 2-bit counters, 8 bit history, mispred. pen. 5 cycles) • 4-way set-associative D- and I-caches with 32 byte cache lines • 32 KB local on-chip RAM • 64-bit system bus, 4 MB main memory • Varied parameters: • 8-12 execution units • 256- and 32-entry reservation stations • 10 to 4 result buses • different D-cache sizes, D- and I-caches of 4 MB and 64 KB • Parameters Varied with Number of Threads: • 32 32-bit general-purpose registers and 40 rename registers (per thread), • 32- and 16-entry issue and retirement buffers (per thread) • Fetch and decode bandwidth is scaled with issue bandwidth and number of threads: 1x1 – 8x8

  19. 1 Performance vs. Hardware CostEstimation:Maximum Processor Models 4 MB I- and D-caches, 6 integer/mm units 2 local load/store units

  20. 1 Transistor Count and Chip Space Estimation of Maximum Processor Models

  21. 2 Small Processor Models 64 KB I- and D-caches, 3 integer/mm units 1 local load/store unit 32-enty reserv. stations 16-entry issue and retirement buffers 4 result buses 2x4 fetch and decode bandwidth fixed

  22. 2 Transistor Count and Chip Space Estimation of Small Processor Models

  23. Results • 4-threaded 8-issue SMT over a single-threaded 8-issue: • Commercial Multithreaded Processors: • Tera, MAJC, Alpha 21464, IBM Blue Gene, Sun UltraSPARC V • Network processors (Intel IXP, IBM PowerNP, Vitesse IQ2x00, Lextra,..) • IBM RS64 IV: two-threaded block MT, reported 5% overhead • Intel Xeon TM (hyperthreading): two-threaded SMT, reported 5% overhead Speedup Transistor Chip Space Increase Increase maximum model: 3 2% 9% small model: 1.5 9% 27%

  24. Outline of the Presentation • Motivation • State-of-the-art • Multithreading • Multithreading for throughput increase • Multithreading for power reduction • Multithreading for embedded real-time systems • Conclusions & Research Opportunities

  25. SMT for Reduction of Power Consumption • Observation:Mispredictions cost energy • Todays superscalars: ~ 60% of the fetched and ~ 30% of the executed instructions are squashed • Idea: fill issue slots by less speculative instructions of other threads  Simulations of Seng et al. 2000 show that ~ 22% less energy is consumed by using a power-aware scheduler

  26. Outline of the Presentation • Motivation • State-of-the-art • Multithreading • Multithreading for throughput increase • Multithreading for power reduction • Multithreading for embedded real-time systems • Conclusions & Research Opportunities

  27. Multithreading in Embedded Real-time Systems– The Komodo Approach • Observation:multithreading allows a context switching overhead of zero cycles • Idea:harness multithreading for embedded real-time systems • Komodo Project: Real-time Java Based on a Multithreaded Java-microcontroller http://www.informatik.uni-augsburg.de/lehrstuehle/info3/research/ komodo/indexEng.html

  28. Real-time Requirements • run-time predictability • isolation of the threads • programmability • real-time scheduling support • fast context switching Hard real-time: a deadline may never be missed Soft real-time: a deadline may occasionally be missed

  29. Komodo Solutions • Extremely fast context switching by hardware multithreading • Real-time scheduling in hardware • Based on a Java processor core • Predictability of all instruction executions by a careful hardware design

  30. Komodo Microcontroller Pipeline

  31. Komodo Microcontroller Design

  32. Hardware Real-time Scheduling • Real-time scheduler is realized in hardware (by the priority manager) • Scheduling decision every clock cycle • Four different scheduling algorithms implemented: • Fixed Priority Preemptive (FPP) • Earliest Deadline First (EDF) • Least Laxity First (LLF) • Guaranteed Percentage (GP)

  33. e v e n t A ( 2 0 % ) s t a r t d e a d l i n e e v e n t B ( 4 0 % ) d e a d l i n e s t a r t e v e n t C ( 3 0 % ) s t a r t d e a d l i n e t i m e o n a c o n v e n t Guaranteed Percentage Scheme i o n a l p r o c e s s o r v i o la t i o n c o n t e x t s w i t c h o n a m u l t i t h r e a d e d p r o c e s s o r s u r p l u s

  34. Simulation Results thread mix (IC, PID, and FFT) applied

  35. Technical Data of the Komodo Prototype • Implementation of Komodo core pipeline on a Xilinx XCV800 with 800k gates • ASIC synthesis of whole microcontroller (0.18 mm technology): 340 MHz, 3 mm2 chip data bit width address space number of threads instruction window size stack size external frequency internal frequency CLBs number of gates 32 bit 19 bit 4 8 bytes 128 entries 33 MHz 8.25 MHz 9 200 133 000

  36. Chip-Space of Komodo Core Pipeline

  37. Reducing Power Consumption Using Real-time Scheduling in Hardware Current work: Idea: Use information about the thread states and configurations available within the priority manager for a „fine-grained“ adaption of power consumption and performance. • Frequency and voltage adjustments in short time intervals done by hardware

  38. State of the Komodo Project • Software simulator • FPGA prototyp • Real-time Java system • ASIC • Middleware for distributed embedded systems

  39. Conclusions onMultithreading in Real-time Environments Multithreaded processor cores: • Performance gain due to fast context switching (for hard real-time) and latency hiding (for soft and non real-time) • More efficient event handling by ISTs • Helper threads possible (garbage collection, debugging) Real-time scheduling in hardware: • Software overhead for real-time scheduling removed • more efficient power saving mechanisms possible • better predictablility by isolation of threads (GP scheduling)

  40. Conclusions & Research Opportunities • Multithreading proves advantageous: • Latency hiding: speed-ups of 2-3 for SMT, lots of research done, next generation of microprocessors • Power reduction: 22% savings reported, not much research up to now • Fast context switching utilized by microcontroller for real-time systems,not much research up to now • Research opportunities: • Scheduling in SMT, network processors and multithreaded real-time systems • Thread-speculation: how to speed-up single-threaded programs? • Multithreading and power consumption • Multithreading in other communities: microcontrollers, SoCs • System software based on helper threads

  41. Acknowledgements • SMT Multimedia research group • Uli Sigmund and Heiko Oehring • Complexity estimation group • Marc Steinhaus, Reiner Kolla, Josep L. Larriba-Pey,Mateo Valero • Komodo project group • Jochen Kreuzinger, Matthias Pfeffer, Sascha Uhrig, Uwe Brinkschulte, Florentin Picioroaga, Etienne Schneider

  42. Mikroprozessors: Technology Prognosis up to 2012 • SIA (semiconductor industries association) Prognose 1997:

  43. Research Directions? • Increase performance of a single thread of control by • more instruction-level speculation • Better branch prediction, • Trace cache and next trace prediction, • Data dependence and value prediction • Increase throughput of a workload of multiple threads • Utilize thread-level and instruction-level parallelism • Chip-Multiprocessors • Multithreading (hardware thread = thread or process) • Thread speculation

More Related