1 / 41

Multiscalar Processors

Multiscalar Processors. Presented by Matthew Misler Gurindar S. Sohi , Scott E. Breach, T. N. Vijayjumar University of Wisconsin-Madison ISCA ‘95. Scalar Processors. Instruction Queue. Execution Unit. addu $20, $20, 16. ld $23, SYMVAL -16($20). move $17, $21. beq $17, $0, SKIPINNER.

preston
Download Presentation

Multiscalar Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiscalar Processors Presented by Matthew Misler Gurindar S. Sohi, Scott E. Breach, T. N. Vijayjumar University of Wisconsin-Madison ISCA ‘95

  2. Scalar Processors Instruction Queue Execution Unit addu $20, $20, 16 ld $23, SYMVAL -16($20) move $17, $21 beq $17, $0, SKIPINNER ld $8, LELE($17)

  3. SuperScalar Processors Instruction Queue Execution Unit addu $20, $20, 16 ld $23, SYMVAL -16($20) move $17, $21 beq $17, $0, SKIPINNER ld $8, LELE($17)

  4. Fetch-Execute • Paradigm has been around for about 60 years • Superscalar processors to execute instructions out of order • Sometimes re-ordering done in hardware • Sometimes software • Sometimes both • Partial ordering

  5. Control Flow Graphs • Segments are split on control dependencies (conditional branches)

  6. Sequential “Walk” • Walk through the CFG with enough parallelism • Use speculative execution and branch prediction to raise the level of parallelism • Sequential semantics must be preserved • Can still execute out of order, but in-order commit

  7. Multiscalars and Tasks • CFG broken down into tasks • Multiscalars step through at the task level • No inspection of instructions within a task • Each Task is assigned to one ‘processing unit’ • Multiple tasks can execute in parallel

  8. Multiscalar Microarchitecture • Sequencer • Queue of processing units • Unidirectional ring • Each has an instruction cache, processing element, register file • Interconnect • Data Bank • Each has: address resolution buffer, data cache

  9. Multiscalar Microarchitecture

  10. Outline • Multiscalar Microarchitecture • Tasks • Multiscalars in-depth • Distribution of cycles • Comparison to other paradigms • Performance • Conclusion

  11. Tasks • Sequencer distributes a task to a Processing unit • Unit fetches and executes the task until completion • Instructions in the window are bounded • By the first instruction in the earliest executing task • By the last instruction in the latest executing task

  12. Tasks • Sequencer distributes a task to a Processing unit • Unit fetches and executes the task until completion • The Instruction Window is bounded by • The first instruction in the earliest executing task • The last instruction in the latest executing task • So? Instruction windows can be huge

  13. Tasks Example A B C D E A B C B B C D

  14. Tasks Example A B C D E A B C B B C D A B B C D A B C B C D E

  15. Tasks • Hold true to sequential semantics inside each block • Enforce sequential order overall on tasks • The circular queue takes care of this part • In the previous example: • Head of queue does ABCBBCD • Middle unit does ABBCD • Tail of the queue ABCBCDE

  16. Tasks • Registers • Create mask • May produce values for a future task • Forward values down the ring • Accum mask • Union of the create masks of active tasks • Memory • If it’s a known producer-consumer, then synchronize on loads and stores

  17. Tasks • Memory (cont’d) • Unknown P-C relationship • Conservative approach: wait • Aggressive approach: speculate • Conservative approach means sequential operation • Aggressive approach requires dynamic checking, squashing and recovery

  18. Outline • Multiscalar basics • Tasks • Multiscalars in-depth • Distribution of cycles • Comparison to other paradigms • Performance • Conclusion

  19. Multiscalar Programs • Code for the tasks • Small changes to existing ISA • add specification of tasks • no major overhaul • Structure of the CFG and tasks • Communications between tasks

  20. Control Flow Graph Structure • Successors • Task descriptor • Producing and consuming values • Forward register information on last update • Compiler can mark instructions: operate and forward • Stopping conditions • Special condition, evaluate conditions, complete • All of these can be viewed as tag bits

  21. Multiscalar Hardware • Walks through the CFG • Assign tasks to processing units • Execute tasks in a ‘sequential’ order • Sequencer fetches the task descriptors • Using the address of the first instruction • Specifying the create masks • Constructing the accum mask • Using the task descriptor, predict successor

  22. Multiscalar Hardware • Databanks • Updates to cache not speculative • Use of Address Resolution Buffer • Detects violation of dependencies • Initiates corrective actions • If it runs out of space, squash tasks • Not the head of the queue; it doesn’t use the ARB • Can stall rather than squash

  23. Multiscalar Hardware • Remember the earlier architectural picture?

  24. Multiscalar Hardware • It’s not the only possible architecture • Possible design with shared functional units • Possible design with ARB and data cache on the same side as the processing units • Scaling the interconnect is non-trivial • Glossed over

  25. Outline • Multiscalar Basics • Tasks • Multiscalars In-Depth • Distribution of Cycles • Comparison to Other Paradigms • Performance • Conclusion

  26. Distribution of Cycles • Wasted cycles: • Non-useful computation • Squashed • No computation • Waiting • Remains idle • No assigned task

  27. Distribution of Cycles • Non-useful computation cycles • Determine useless computation early • Validate prediction early • Check if the next task is predicted correctly • Eg. Test for loop exit at the start of the loop • Tasks violating sequentiality are squashed • To avoid, try to synchronize memory communication with register communication • Could delay the load for a number of cycles • Can use signal-wait synchronization

  28. Distribution of Cycles • Contrast with no assigned task • No computation cycles • Dependencies within the same task • Dependencies between tasks (earlier/later) • Load Balancing

  29. Outline • Multiscalar Basics • Tasks • Multiscalars In-Depth • Distribution of Cycles • Comparison to Other Paradigms • Performance • Conclusion

  30. Comparison to Other Paradigms • Branch prediction • Sequencer only needs to predict branches across tasks • Wide instruction window • Check to see which is ready for issue, in Multiscalar relatively few ready for inspection

  31. Comparison to Other Paradigms • Issue logic • Superscalar processors have n2 logic • Multiscalar logic is distributed, • Each processing unit issues instructions independently • Loads and stores • Normally sequence numbers for managing the buffers • In multiscalar, the loads and stores are independent

  32. Comparison to Other Paradigms • Superscalar processors need to discover CFG as it decodes branches • Only requires the compiler to split code into tasks • Multiprocessors require all dependence to be known or conservatively provided for • If a compiler could compile independently, it can be executed in parallel

  33. Outline • Multiscalar Basics • Tasks • Multiscalars In-Depth • Distribution of Cycles • Comparison to Other Paradigms • Performance • Conclusion

  34. Performance • Simulated • 5 stage pipeline Functional unit latency

  35. Performance • Memory • Non-blocking loads and stores • 10 cycle latency for first 4 words • 1 cycle for each additional 4 words • Instruction Cache: 1 cycle for 4 words • 10+3 cycles for miss • Data Cache: 1 word per cycle multiscalar • 10+3 cycles + bus contention, for a miss • 1024 entry cache of task descriptors

  36. Performance +12.2% on average

  37. Performance – In-Order

  38. Performance – Out-of-Order

  39. Performance – Summary • Most of the benchmarks achieve speedup • Eg. An average of 1.924 in 1-way in-order 4-unit multiscalar • Worst case 0.86 speedup (slowdown) • Many squashes in prediction and memory order in Gcc and Xlisp • Leads to almost sequential execution • Keeping in mind, 12.2% increase in IC

  40. Outline • Multiscalar Basics • Tasks • Multiscalars In-Depth • Distribution of Cycles • Comparison to Other Paradigms • Performance • Conclusion

  41. Conclusion • Divide the CFG into tasks • Assign tasks to processing units • Walk the CFG in task-size steps • Shows performance gains

More Related