Exploiting Accidental Heterogeneity in Multicore Processors

1. Exploiting Accidental Heterogeneity in Multicore Processors Raghuvardhan Moola Varun Vats 1 ECE 753 (Spring 2011) University of Wisconsin, Madison

2. Introduction Exploiting Accidental Heterogeneity Three key concepts: Heterogeneity Accidental Heterogeneity Exploiting Accidental Heterogeneity 2 ECE 753 (Spring 2011) University of Wisconsin, Madison

3. I - Heterogeneity Multicore processors with cores that have different instruction set architectures (ISAs). General purpose cores and some specialized cores. Reason: General purpose processors not best for optimal power/performance. Different applications require different types of cores. Financial applications may require DMR or TMR. 3 ECE 753 (Spring 2011) University of Wisconsin, Madison

4. II - Accidental Heterogeneity Heterogeneous cores becoming popular, but homogeneous cores still reign supreme easier to program and provide more consistent performance Originally homogeneous cores can become heterogeneous not by design, but due to defects. Renders core unable to execute certain instructions. Modern processors contain large amount of redundant logic solely for performance gains. Such redundant structures can be used to compensate for defective logic. 4 ECE 753 (Spring 2011) University of Wisconsin, Madison

5. III - Exploiting Accidental Heterogeneity Traditional approach in failure scenario: abandon the core. TMR DMR approaches used in Tandem NonStop, IBM zSeries. Drawbacks: Waste of hardware resources. Rapid performance degradation. Reduced lifetime. Cost inefficient. However, defects tolerable if properly managed. Possible to salvage healthy components of the faulty core and keep core functional. Reduced performance, reduced execution ability, but extended lifetime and better utilization of resources. 5 ECE 753 (Spring 2011) University of Wisconsin, Madison

6. III - Exploiting Accidental Heterogeneity Cores should have the ability to isolate defective units and reconfigure to keep core functional. Key requirement � RECONFIGURABILITY Three sub-tasks of reconfiguration: Fault detection: detect presence of fault. Fault diagnosis: identify faulty component. Reconfiguration/Recovery: isolate faulty component and restore system to a functional state, leveraging some form of redundancy. 6 ECE 753 (Spring 2011) University of Wisconsin, Madison

7. Granularity of Reconfiguration Support for reconfiguration at various levels, from ultrafine grain systems (replace individual logic gates) to coarser designs (isolate entire processor cores). Trend Potential lifetime enhancement increases with finer granularity. Complexity of implementation increases with finer granularity. Trade-off between ease of implementation and lifetime extension. 7 ECE 753 (Spring 2011) University of Wisconsin, Madison

8. Our Contribution Classification Based on granularity of reconfiguration support. Comparison and Evaluation Area overhead Performance cost Lifetime throughput Complexity of implementation Targeted faults Identify promising techniques 8 ECE 753 (Spring 2011) University of Wisconsin, Madison

9. Our Contribution Classification Comparison and Evaluation Identify promising techniques 9 ECE 753 (Spring 2011) University of Wisconsin, Madison

10. Classification Granularity levels (in increasing order of granularity): Gate level Microarchitectural/Module level Stage level Architectural level Core level 10 ECE 753 (Spring 2011) University of Wisconsin, Madison

11. Reconfiguration Granularity: Gate Level System replaces individual logic gates as they fail. Advantages Highest lifetime extension High production yield Highly dependable Drawbacks Highly complicated to implement Tremendous area overhead due to redundant gates and routing. 11 ECE 753 (Spring 2011) University of Wisconsin, Madison

12. Reconfiguration Granularity: Microarchitectural/Module Level System replaces microarchitectural structures/modules. ALU, branch predictor etc. Advantages Comparatively easier to implement Drawbacks Suitable primarily for superscalar cores that have good amount of inherent redundancy for performance gains. Few opportunities exits for reconfiguration as most modules are unique. Performance loss. 12 ECE 753 (Spring 2011) University of Wisconsin, Madison

13. Reconfiguration Granularity: Stage Level System replaces stages as they fail. Advantages Stages convenient boundary for reconfiguration as cores divide work at level of stages. Easier to implement Challenges Pipeline stages tightly coupled, so difficult to isolate/replace. Maintaining spare stages area intensive. 13 ECE 753 (Spring 2011) University of Wisconsin, Madison

14. Reconfiguration Granularity: Architectural Level Defect in a core renders it unable to execute certain instruction. Un-executable instructions moved to different core. Advantages Low area overhead. Fairly easy to implement. Drawbacks Rapid performance degradation. Low lifetime enhancement. 14 ECE 753 (Spring 2011) University of Wisconsin, Madison

15. Reconfiguration Granularity: Core Level Degraded functionality of core is used. Advantages Easy to implement Low area overhead Drawbacks Poor lifetime enhancement Poor utilization of hardware resources 15 ECE 753 (Spring 2011) University of Wisconsin, Madison

16. Reconfigurable Multicore Processor Architectures Granularity levels: Gate level approaches Fine Grain Redundancy (FGR) Microarchitectural/Module level approaches Stage level approaches Architectural level approaches Core level approaches 16 ECE 753 (Spring 2011) University of Wisconsin, Madison

17. FGR: Fine Grain RedundancyT. Nakura, K. Nose, and M. Mizuno. �Fine-Grain Redundant Logic Using Defect-Prediction Flip-Flops� IEEE International Solid State Circuits Conference 2007. Key Idea Fine-grain redundant logic for switching defective portion Defects can be killer or latent Killer : Defects apparent at the fabrication process Latent : Defects apparent only in actual use 17 ECE 753 (Spring 2011) University of Wisconsin, Madison

18. FGR: Fine Grain Redundancy ECE 753 (Spring 2011) University of Wisconsin, Madison 18 Latent Defects : Partial insufficiency, cracked vias, extra-metal etc., which develop into the opening or shorting of connections Path Delay Increase: Latent defects gradually appear as path delay increase in use

19. FGR: Fine Grain Redundancy ECE 753 (Spring 2011) University of Wisconsin, Madison 19 Implementation

20. FGR: Fine Grain Redundancy ECE 753 (Spring 2011) University of Wisconsin, Madison 20 Advantages Enhances a production yield of 70% to 91% Prevents 80% of in-filed failures caused by one or two latent defects Highly dependable chip market like automotive industry Drawbacks If the area ratio of combinational logic/DFF is 6:4, the area becomes about 2.5x larger Area penalty would be 18% at 45nm

21. Reconfigurable Multicore Processor Architectures Reconfiguration levels Gate level approaches Microarchitectural/Module level approaches Rescue Structural Duplication Stage level approaches Architectural level approaches Core level approaches 21 ECE 753 (Spring 2011) University of Wisconsin, Madison

22. RescueE.Schuchman and T.N.Vijaykumar, �Rescue: A microarchitecture for testability and defect tolerance,� ISCA 2005. Key Idea Out-of-order multiple-issue superscalar cores may be thought as two in-order half-pipelines Frontend and backend connected by issue Disable the entire half-pipeline way that is affected by the fault 22 ECE 753 (Spring 2011) University of Wisconsin, Madison

23. Rescue ECE 753 (Spring 2011) University of Wisconsin, Madison 23 Possible degradations tolerated: Front end supports degraded fetch, decode and rename Issue queue and the load/store queue can be degraded to half their original size Backend supports faults in register read, execute and memory or writeback Extra logic added Shifter stage after fetch so that the instructions can be shifted around Shifter stage after issue to route issued instructions to functional ways

24. Rescue ECE 753 (Spring 2011) University of Wisconsin, Madison 24 Example : Fetch Stage Instructions are fetched in parallel and passed in program order to the decode stage If one or more of the frontend ways are faulty: Assign the earliest instruction to the first fault-free way, second instruction to the second fault-free way and so on.. Stall fetch and assign any remaining instructions until all fetched instructions are processed Routing stage is composed of muxes for each frontend way to choose an instruction for that way

25. Rescue ECE 753 (Spring 2011) University of Wisconsin, Madison 25 IPC Degradation:

26. Rescue ECE 753 (Spring 2011) University of Wisconsin, Madison 26 Advantages Reduces IPC only by 4% Improves instruction throughput over core sparing by 12% and 22% at 32nm and 18nm, respectively Overhead

27. Structural DuplicationJ.Srinivasan, S.V.Adve, P.Bose, and J.A.Rivers, �Exploiting Structural Duplication for Lifetime Reliability Enhancement�, ISCA 2005. ECE 753 (Spring 2011) University of Wisconsin, Madison 27 Key Idea Exploit microarchitectural redundancy for reliability enhancement. Three techniques: Structural Duplication (SD) Redundant structures added to processor and designated as spares. Gradual Performance Degradation (GPD) Based on inherent redundancy, which is not required for functional correctness. Exploited to improve reliability or extend lifetime. SD + GPD Inherent redundancy as well as spares added for reliability enhancement.

28. Structural Duplication ECE 753 (Spring 2011) University of Wisconsin, Madison 28 Cost incurred: SD Addition of cold spares causes area overhead. Performance not affected. GPD Performance boosting redundant structures used for lifetime extension, so no area overhead. Performance degrades on occurrence of fault.

29. Structural Duplication ECE 753 (Spring 2011) University of Wisconsin, Madison 29 Performance, Area and MTTFs

30. Structural Duplication ECE 753 (Spring 2011) University of Wisconsin, Madison 30 Lifetimes

31. Structural Duplication ECE 753 (Spring 2011) University of Wisconsin, Madison 31 Performance/Cost (P/C) metric

32. Structural Duplication ECE 753 (Spring 2011) University of Wisconsin, Madison 32 Limitations Not all structures replicated. Failures occurring in non-replicated structures will cause core to fail. Groups of structures allowed to degrade are few and hence lifetime extension is not much.

33. Reconfigurable Multicore Processor Architectures Reconfiguration levels Gate level approaches Microarchitectural/Module level approaches Stage level approaches Core Cannibalization Architecture StageNet StageWeb Architectural level approaches Core level approaches 33 ECE 753 (Spring 2011) University of Wisconsin, Madison

34. Core Cannibalization ArchitectureB.F.Romanescu and D.J.Sorin, �Core cannibalization architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults�, PACT 2008. Two types of cores Cannibalizable Cores (CC): whose stages can be cannibalized when fault occurs. Non-cannibalizable Cores (NC): whose stages cannot be cannibalized. In absence of faults, CCs function like normal cores. When fault occurs, CCs stages are cannibalized. Stages replaced only if the fault occurs in NCs. 34 ECE 753 (Spring 2011) University of Wisconsin, Madison

35. Core Cannibalization Architecture ECE 753 (Spring 2011) University of Wisconsin, Madison 35

36. Core Cannibalization Architecture ECE 753 (Spring 2011) University of Wisconsin, Madison 36 Placement of CC critical.

37. Core Cannibalization Architecture ECE 753 (Spring 2011) University of Wisconsin, Madison 37 Results Area Overhead

38. Core Cannibalization Architecture ECE 753 (Spring 2011) University of Wisconsin, Madison 38 Results Lifetime Performance

39. Core Cannibalization Architecture ECE 753 (Spring 2011) University of Wisconsin, Madison 39 Results Cumulative Performance Advantage

40. StageNetS.Gupta, Shuguang Feng, A. Ansari, S.Mahalke, �StageNet: A Recon?gurable Fabric for Constructing Dependable CMPs,� IEEE Transactions on Computers 2011. Key Idea Multicore processor designed as a reconfigurable network of processor pipeline stages. Pipeline stages are isolated processing elements that can be connected in arbitrary fashion to form a logical core. Network formed by replacing pipeline registers with crossbar switches. 40 ECE 753 (Spring 2011) University of Wisconsin, Madison

41. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 41

42. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 42 Two major components Configuration Manager Interconnection Crossbars Configuration Manager Constructs logical cores from the pool of available stages at boot up. Sets up the routing table on each pipeline stage. Implemented in OS (higher flexibility). Interconnection Crossbars Direct incoming operations to the correct destination stage. Destination stage identified using routing tables on each stage.

43. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 43 Fault Tolerance StageNet Islands

44. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 44 Results Area Overhead

45. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 45 Results Lifetime Performance

46. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 46 Results Cumulative Performance

47. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 47 Limitations Doesn�t scale well (fully connected pipeline stages increase delay) Crossbar switches vulnerable to failure (single points of failure) Cold spares increase area drastically Process variations not addressed

48. StageWebShantanu Gupta, Amin Ansari, Shuguang Feng, and Scott Mahlke �StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric�, DSN 2010. Builds upon StageNet Addresses the three limitations: Scalability Network failure tolerance Resilience to process variations 48 ECE 753 (Spring 2011) University of Wisconsin, Madison

49. StageWeb: Scalability and Reliability ECE 753 (Spring 2011) University of Wisconsin, Madison 49 Single crossbar configuration Island 1 is unable to form any logical SNS Island 2 forms one logical SNS (SNS 0)

50. StageWeb: Scalability and Reliability ECE 753 (Spring 2011) University of Wisconsin, Madison 50 Overlapping crossbar configuration Island 1 is not able to form any logical SNS Island 2 and 3 form one logical SNS each.

51. StageWeb: Scalability and Reliability ECE 753 (Spring 2011) University of Wisconsin, Madison 51 Overlapping and front-back crossbar configuration One more logical SNS (SNS 2) added over the overlapping crossbar con?guration, resulting in three SNSs.

52. StageWeb ECE 753 (Spring 2011) University of Wisconsin, Madison 52 Throughput cost

53. StageWeb ECE 753 (Spring 2011) University of Wisconsin, Madison 53 Cumulative work

54. StageWeb: Interconnect Reliability ECE 753 (Spring 2011) University of Wisconsin, Madison 54 Simple crossbar single crossbar switch used at each interconnection spot. No redundancy is maintained. Simple Crossbar with spares One cold spare maintained for every crossbar in the system. Brought into use when the original develops certain number of faults. Fault-Tolerant Crossbar (no spares) Multiple paths exist from each input to output port. Nearly eliminates chances of crossbar failures. Most expensive option (2x to 3x of simple crossbar).

55. Reconfigurable Multicore Processor Architectures Reconfiguration levels Gate level approaches Microarchitectural/Module level approaches Stage level approaches Core level approaches Elastic Necromancer Architectural level approaches 55 ECE 753 (Spring 2011) University of Wisconsin, Madison

56. ElastIC*D. Sylvester, D. Blaauw, and E. Karl. �Elastic: An Adaptive Self-Healing Architecture for Unpredictable Silicon,� IEEE Design and Test 2006. ECE 753 (Spring 2011) University of Wisconsin, Madison 56 Key Idea Employs run-time self-diagnosis to keep track of performance. Four key components: Processing Elements (PE): simple processors that contain reliability, performance and power monitors. Diagnostic and Adaptive Processing Unit (DAP): performs detailed diagnostics of PEs (parametric variation and wear-out). Memory and Interconnect Systems: Use ECC and redundancy to tackle functional failures. Scheduler: examines the state of each PE and distributes workload accordingly. *No simulation studies available.

57. ElastIC ECE 753 (Spring 2011) University of Wisconsin, Madison 57 DAP Conducts power and performance characterization of PE by testing its operation at different frequencies and voltages. Can initiate active healing of damaged components by taking advantage of reversibility of several reliability effects like NBTI and electromigration. Made immune to failures by using aggressive redundancy. Scheduler: Uses data produced by DAP to maximize performance by controlling PEs� voltage and frequency. Also steers processor traffic based on this data.

58. ElastIC ECE 753 (Spring 2011) University of Wisconsin, Madison 58 Limitations Not scalable to massively multicore architectures as area and power overhead will be high. Possibly very complex to implement.

59. NecromancerAmin Ansari, Shuguang Feng, Shantanu Gupta, and Scott Mahlke �Necro-mancer: Enhancing System Throughput by Animating Dead Cores�, ISCA 2010. Key idea Execution traces on a defective core resembles fault-free execution Partition the cores in a CMP into multiple groups Each group shares a lightweight core 59 ECE 753 (Spring 2011) University of Wisconsin, Madison

60. Necromancer ECE 753 (Spring 2011) University of Wisconsin, Madison 60 Relax the correct execution constraint on a faulty core Define a Similarity Index (SI) � measures similarity between the PC For SI 90% - at least 100K instructions before execution differs by 10% Leverage high level execution information (hints) from faulty core to accelerate animator core Disable the hints if they are not profitable

61. Necromancer ECE 753 (Spring 2011) University of Wisconsin, Madison 61 Resynchronize the faulty core whenever it goes too far from the correct execution path Takes about 100 cycles At least 100K committed instructions 85% cases Less synchronization overhead

62. Necromancer ECE 753 (Spring 2011) University of Wisconsin, Madison 62 High level Architecture

63. Necromancer ECE 753 (Spring 2011) University of Wisconsin, Madison 63 Can achieve 87.6% performance of a fully functioning core Area and power overheads are 5.3% and 8.5% respectively

64. Reconfigurable Multicore Processor Architectures Reconfiguration levels Gate level approaches Microarchitectural/Module level approaches Stage level approaches Core level approaches Architectural level approaches Architecture Core Salvaging 64 ECE 753 (Spring 2011) University of Wisconsin, Madison

65. Architectural Core SalvagingM.Powell, A.Biswas, S.Gupta, S.Mukherjee, �Architectural Core Salvaging in a Multi-Core Processor for Hard Error Tolerance�, ISCA 2009. ECE 753 (Spring 2011) University of Wisconsin, Madison 65 Key Idea Even if individual cores cannot execute certain operations, CPU die can still be ISA complaint Migrate the offending thread to another core that can execute the instruction Find a stable thread that does not utilize un-executable instructions and assign it to defective core

66. Architectural Core Salvaging ECE 753 (Spring 2011) University of Wisconsin, Madison 66 Relax the requirement Each core need not be fully functional Non-replicated structures � non-essential Potential of the method

67. Architectural Core Salvaging ECE 753 (Spring 2011) University of Wisconsin, Madison 67 Implementation : Detecting the presence of un-executable instructions Programmable lookup table Transferring the architectural state to and from the core Similar to deep-sleep power state Migration and Overhead Thread migration is thread swap Done over the existing interconnect Order of tens to a few hundred cycles Can be amortized as long as they are infrequent

68. Architectural Core Salvaging ECE 753 (Spring 2011) University of Wisconsin, Madison 68 Advantages Covers a significant fraction of the core area 30% of the vulnerable area Requires only small changes to the microarchitecture

69. Our Contribution Classification Comparison and Evaluation Identify promising techniques 69 ECE 753 (Spring 2011) University of Wisconsin, Madison

70. Comparison and Evaluation ECE 753 (Spring 2011) University of Wisconsin, Madison 70

71. Our Contribution Classification Comparison and Evaluation Promising Approaches 71 ECE 753 (Spring 2011) University of Wisconsin, Madison

72. Promising Approaches Mircoarchitectural techniques Fairly low area overhead Achieve a performance level quite close to the base processor Stage Level techniques Low area overhead Have a high lifetime throughput Architectural Level techniques Low area overhead High performance Low lifetime throughput 72 ECE 753 (Spring 2011) University of Wisconsin, Madison

73. Conclusion Studied and classified reconfigurable multicore processor architectures based on granularity of reconfiguration. Architectures compared and evaluated based on area overhead, performance cost, implementation complexity and targeted faults. Techniques employing reconfigurability at Microarchitectural level, Stage level and Architectural level identified to be efficient. 73 ECE 753 (Spring 2011) University of Wisconsin, Madison

74. Questions? 74 ECE 753 (Spring 2011) University of Wisconsin, Madison

Exploiting Accidental Heterogeneity in Multicore Processors

Exploiting Accidental Heterogeneity in Multicore Processors

Presentation Transcript

On-Chip Optical Communication for Multicore Processors

Programming Multicore Processors

Cache Utilization-Aware Scheduling for Multicore Processors

III. Multicore Processors (6)

Multicore / Manycore Processors

Multicore: Commercial Processors

Comparison of Multicore Processors using Sourcery VSIPL++

Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures

Onchip Interconnect Exploration for Multicore Processors Utilizing FPGAs

PEEP: Exploiting Predictability of Memory Dependences in SMT Processors

III. Multicore Processors (2)

Harnessing Multicore Processors for High Speed Secure Transfer

11. Multicore Processors

III. Multicore Processors (3)

III. Multicore Processors (5)

III. Multicore Processors (4)

Image Reconstruction on Multicore Processors

Lecture 6. Multithreading & Multicore Processors

III. Multicore Processors (4)

Power Efficiency for Variation-Tolerant Multicore Processors

Cache Coherence Techniques for Multicore Processors

“Temperature-Aware Task Scheduling for Multicore Processors”

Exploiting Accidental Heterogeneity in Multicore Processors