Single-ISA Heterogeneous
1 / 23

Rakesh Kumar,Keith I. Farkas,Norman P. Jouppi,Parthasarathy Ranganathan,Dean M. Tullsen - PowerPoint PPT Presentation

  • Uploaded on

Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. Rakesh Kumar,Keith I. Farkas,Norman P. Jouppi,Parthasarathy Ranganathan,Dean M. Tullsen Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Rakesh Kumar,Keith I. Farkas,Norman P. Jouppi,Parthasarathy Ranganathan,Dean M. Tullsen' - long

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Single-ISA HeterogeneousMulti-Core Architectures:The Potential for Processor Power Reduction

Rakesh Kumar,Keith I. Farkas,Norman P. Jouppi,Parthasarathy Ranganathan,Dean M. Tullsen

Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36’03)

Advanced computer architecture CSE 8383

What is multi core architectures
What is Multi-Core Architectures?

  • a multi-core processor delivers two or more complete execution units - or cores - in a single, physical processor. all cores run at the same frequency, and are plugged into a single processor socket. they also share the same platform interface, which connects them to memory, I/O and storage resources.

General idea
General Idea

  • This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation.

  • Main Point : gathering heterogeneous architectures on a die . for an application; choose the most power efficient processor given some performance constraints  save power


  • The architecture consists of a chip-level multiprocessor with multiple, diverse processor cores. These cores all execute the same instruction set, but include significantly different resources and achieve different performance and energy efficiency on the same application.

  • the operating system software tries to match the application to the different cores, attempting to meet a defined objective function. For example, it may be trying to meet a particular performance requirement or goal, but doing so with maximum energy efficiency.

Cores processors targeted
Cores / Processors targeted

  • This work examines a diverse set of execution cores. In a processor where the objective function is static (and perhaps the workload is well known), some of the results indicate that a smaller set of cores (often two) will be sufficient to achieve very significant gains. However, if the objective function varies over time or workload, a larger set of cores has even greater benefit.

Close look
Close look

  • Four cores: Alpha EV4, EV5, EV6, and EV8

  • Each core has different power/performance characteristics

  • During execution, software dynamically chooses the core that best meets the power and performance needs

  • Only one core and one thread is running at any given time

  • The goal is not performance increase, but power usage decrease


  • By 2015 processors will consume 300W

  • Existing CMP designs use only homogeneous cores

  • Applications with high ILP can be exploited on wider cores (e.g. EV8) but applications with low ILP use less power on narrower cores (e.g. EV4) with little loss in performance

  • No need to design cores from scratch because existing Alpha cores run on practically the same ISA


  • EV4: Alpha 21064

  • EV5: Alpha 21164

  • EV6: Alpha 21264

  • EV8-: single-threaded version of Alpha 21464 (based on “projected numbers”)

Cores cont
Cores, cont.

  • Assuming all cores are implemented in 0.10 micron technology

  • We assume the four cores have private L1 data and instruction caches and share a common L2 cache, phase-lock loop circuitry, and pins.

  • All cores run at 2.1GHz (the frequency at which an EV6 core would run if its 600MHz, 0.35 micron implementation was scaled to 0.10 micron)

  • All cores share an on-chip 3.5 MB 7-way set associative L2 cache (latencies were calculated using CACTI)

  • ISA differences solved by. Either programs are compiled to the least common denominator (the EV4), or we use software traps for the older cores.

    2.2 2.3


  • Wattch was used to simulate power usage, but had to be calibrated with scaling and offset factors to compare older technologies alongside newer technologies

  • CACTI was used to simulate L2 power consumption

  • 14 SPEC2000 benchmarks were run: 7 integer and 7 floating point

  • Benchmarks are simulated using SMTSIM in non-multithreading mode

  • Since several assumptions were made based on common rules-of-thumb used in typical processor design, several sensitivity experiments with widely different assumptions about the range of power dissipation in the core were performed. From these experiments, it was clear that power differences between cores dominates any power differences between applications on the same core

Core switching
Core Switching

  • Switching done at the operating system level

  • Two options for switching granularity:

    • Granularity of application

    • Granularity of operating system timeslice intervals

  • OS switch involves cache flush and saving and loading user states for the cores

  • Unused cores are completely powered down (therefore no leakage)

  • Estimate that a core can be powered up in ~1000 cycles at 2.1 GHz

  • Switching overhead turns out to be negligible (~1%)

2.3 3

Switching algorithms oracle based dynamic switching using energy heuristic
Switching Algorithms:Oracle based dynamic switching using energy heuristic

  • With oracle knowledge of power requirements and performance potential, chose the core that would have the lowest energy consumption, as long as it performs within 10% of EV8-

  • Average energy reduction = 38%

  • Average performance degradation = 4%


Results table

Switching algorithms oracle based dynamic switching using energy delay heuristic
Switching Algorithms:Oracle based dynamic switching using energy-delay heuristic

  • With oracle knowledge of power and performance, chose the core that would maximize IPS2/Watt, as long as it performs within 50% of EV8-

  • Average energy reductions = 73%

  • Average energy-delay reduction = 63%

  • Average performance degradation = 22%


Switching algorithms static core selection
Switching Algorithms:Static Core Selection

  • Chose a single core to run for the duration of execution, perhaps based on compiler analysis, profiling, past history, or simple sampling

  • based on energy heuristic (performance constraint within 10% of EV8-)

    • Average energy savings = 32%.

    • Average performance degradation = 2.6%

  • Based on energy-delay heuristic

    • Average energy-delay savings = 31%

    • Average energy-delay2 savings = 30%

Oracle based dynamic switching using energy heuristic
Oracle based dynamic switching using energy heuristic

We can note that EV6,EV8 are heavily used because of the performance constraint they apply

some don't achieve any thing because switching was denied by performance constraint (10% of EV8- )

To heuristics

Switching algorithms realistic dynamic switching
Switching Algorithms:Realistic Dynamic Switching

  • Every 100 million instructions, one or more core is sampled for 5 million instructions

  • Neighbor

    • One of the neighboring cores is chosen at random to be sampled

  • Neighbor-global

    • The neighboring core that would be “expected” to have the lowest energy-delay is sampled

  • Random

    • A core is chosen at random to be sampled

  • All

    • All cores are sampled

Realistic dynamic switching results
Realistic Dynamic Switching Results

  • Results shown normalized to EV8- performance

  • Realistic schemes achieved up to 93% of energy-delay gains of oracle-based schemes

  • Performance degradation of realistic schemes is less than in oracle-based schemes

  • Realistic schemes resulted in more core switching

Related work power related optimizations for processor design can be classified into two categories
Related workpower-related optimizations for processor design can be classified into two categories

  • (1) work that uses gating for power management : provides a turn off option , but limited by the granularity of the structure that can be gated

  • (2) work that uses voltage and frequency scaling of the processor core to reduce power ( limited by the process technology in which the processor is built , the power reductions are uniform – across both the portions of the core that are performance-critical for this workload as well as the portions of the core that are not )

Final words
Final words

Overall, having heterogeneous processor cores provides potentially greater power savings compared to previous approaches and greater flexibility and scalability of architecture design. Moreover, these previous approaches can still be used in a multi-core processor to greater advantage.

A multi-core heterogeneous architecture can support a range of execution characteristics not possible in an adaptable single-core processor, even one that employs aggressive gating.

Such an architecture can adapt not only to changing demands in a single application, but also to changing demands between applications, changing priorities or objective functions within a processor or between applications, or even changing operating environments.


  • Realistic dynamic switching algorithms show a decrease in energy and energy-delay with only a small decrease in performance

  • Single ISA heterogeneous multi-core processors using existing technology may be a way to curb power usage

  • Could this be implemented with multiple simultaneous threads? (ISCA 2004)


  • Based partly on a processor that does not actually exist (EV8-)

  • Assumed that all processors could simply be built on 0.10 micron technology and run at 2.1 GHz

  • 3.5 MB 7-way set associative L2 cache

    • Where did they come up with these numbers?

    • This would add latency and slow down performance compared to a single processor with a regularly sized cache

  • There was lots of tweaking of power numbers without many details or explanation

  • Why were only 14 SPEC2000 benchmarks used?

  • Since SMTSIM was used rather than SimpleScalar, the results cannot really be compared with other studies that were done using SimpleScalar