memory management challenges in the power aware computing era n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Memory Management Challenges in the Power-Aware Computing Era PowerPoint Presentation
Download Presentation
Memory Management Challenges in the Power-Aware Computing Era

Loading in 2 Seconds...

play fullscreen
1 / 37

Memory Management Challenges in the Power-Aware Computing Era - PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on

Memory Management Challenges in the Power-Aware Computing Era. Dr. Avi Mendelson, Intel - Mobile Processors Architecture group avi.mendelson@intel.com and adjunct Professor in the CS and EE departments, Technion Haifa mendlson@{cs,ee}.technion.ac.il. Disclaimer.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Memory Management Challenges in the Power-Aware Computing Era' - harris


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
memory management challenges in the power aware computing era

Memory Management Challenges in the Power-Aware Computing Era

Dr. Avi Mendelson,

Intel - Mobile Processors Architecture groupavi.mendelson@intel.com

and adjunct Professor in the CS and EE departments, Technion Haifamendlson@{cs,ee}.technion.ac.il

disclaimer
Disclaimer
  • No Intel proprietary information is disclosed.
  • Every future estimate or projection is only a speculation
  • Responsibility for all opinions and conclusions falls on the author only. 
    • It does not mean you cannot trust them… 

© Dr. Avi Mendelson - ISMM'2006

before we start
Before we start
  • Personal observation: focusing on low-power resemblesAlice through the Looking-Glass: We are looking at the same old problems, but from the other side of the looking glass, and the landscape appears much different...

Out of the box thinking is needed

© Dr. Avi Mendelson - ISMM'2006

agenda
Agenda
  • What are power aware architectures and why they are needed
    • Background of the problem
    • Implications and architectural directions
  • Memory related issues
    • Static power implications
    • Dynamic power implications
  • Implications on software and memory management
  • Summary

© Dr. Avi Mendelson - ISMM'2006

the power aware era
The “power aware era”
  • Introduction
  • Implications on computer architectures

© Dr. Avi Mendelson - ISMM'2006

moore s law

109

256M

64M

Memory

108

16M

Pentium®4

Microprocessor

4M

Pentium®III

107

1M

Pentium® II

Pentium®Pro

256K

106

Pentium®

64K

i486™

Transistors Per Die

16K

105

4K

i386™

80286

1K

104

8086

8080

103

4004

Source: Intel

102

’70

’73

’76

’79

’82

’85

’88

’91

’94

'97

2000

Moore’s law
  • “Doubling the number of transistors on a manufactured die every year” - Gordon Moore, Intel Corporation

© Dr. Avi Mendelson - ISMM'2006

in the last 25 years life was easy
In the last 25 years life was easy(*)
  • Idle process technology allowed us to
    • Double transistor density every 30 months
    • Improve their speed by 50% every 15-18 month
    • Keep the same power density
    • Reduce the power of an old architecture or introduce a new architecture with significant performance improvement at the same power
  • In reality
    • Process usually is not ideal and more performance than process scaling is needed, so:
      • Die size and power and power densities increased over time

(*) source Fred Pollack, Micro-32

© Dr. Avi Mendelson - ISMM'2006

processor power evolution

?

100

Pentium® II

Pentium® 4

Pentium® Pro

Pentium® III

10

Max Power (Watts)

Pentium®

Pentium®

w/MMX tech.

i486

i386

1

1.5m

1m

0.8m

0.6m

0.35m

0.25m

0.18m

0.13m

Processor power evolution

Traditionally: a new generation always increase power

Compactions: higher performance at lower power

Used to be “one size fits all”: start with high power and shrink for mobile

© Dr. Avi Mendelson - ISMM'2006

power energy
Power & energy

Power

  • Dynamic power: consumed by all transistor that switch
    • P = aCV2f - Work done per time unit (Watts)(a: activity, C: capacitance, V: voltage, f: frequency)
  • Static power (leakage): consumed by all “inactive transistors” - depends on temperature and voltage.

Energy

    • Power consumed during a time period.

Energy efficiency

    • Energy * Delay (or Energy * Delay2)

© Dr. Avi Mendelson - ISMM'2006

why high power maters
Why high power maters

Power Limitations

  • Higher power  higher current
    • Cannot exceed platform power delivery constraints
  • Higher power  higher temperature
    • Cannot exceed the thermal constraints (e.g., Tj < 100oC)
    • Increases leakage.
  • The heat must be controlled in order to avoid electric migration and other “chemical” reactions of the silicon
  • Avoid the “skin effect”

Energy

  • Affects battery life.
    • Consumer devices – the processor may consume most of the energy
    • Mobile computers (laptops) - the system (display, disk, cooling, energy supplier, etc) consumes most of the energy
  • Affects the cost of electricity

© Dr. Avi Mendelson - ISMM'2006

the power crisis power consumption
The power crisis – power consumption

Sourse: cool-chips, Micro 32

© Dr. Avi Mendelson - ISMM'2006

power density

2

Watts/cm

Power density

Sun's

Surface

1000

Rocket

Nozzle

Nuclear Reactor

100

Pentium® 4

Pentium® III

Hot plate

Pentium® II

10

Pentium® Pro

Pentium®

i386

i486

1

1.5m

1m

0.7m

0.5m

0.35m

0.25m

0.18m

0.13m

0.1m

0.07m

* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.

© Dr. Avi Mendelson - ISMM'2006

conclusions so far
Conclusions so far
  • Currently and in the near future, new processes keep providing more transistors, but the improvement in power reduction and speed is much lower than in the past
  • Due to power consumption and power density constrains, we are limited in the amount of logic that can be devoted to improve single thread performance
  • We can use the transistors to add more “low power density” and “low power consumption” such as memory, assuming we can control the static power.
  • BUT, we still need to double the performance of new computer generations every two years (in average).

© Dr. Avi Mendelson - ISMM'2006

we must go parallel
We must go parallel
  • In theory, power increases in the order of the frequency cube.
  • Assuming that frequency approximates performance
    • Doubling performance by increasing its frequency increases the power exponentially
    • Doubling performance by adding another core, increases the power linearly.
  • Conclusion: as long as enough parallelism exists, it is more efficient to achieve the same performance by doubling the number of cores rather than doubling the frequency.

© Dr. Avi Mendelson - ISMM'2006

cpu architecture multicores
CPU architecture - multicores

Performance

CMP

Power Wall

MP Overhead

Uniprocessors

Uniprocessors have lower power efficiency due to higher speculation and complexity

Power

© Dr. Avi Mendelson - ISMM'2006

Source: Tomer Morad, Ph.D student, Technion

the new trend parallel systems on die
The new trend – parallel systems on die
  • There are at least three camps in the computer architects community
    • Multi-cores - Systems will continue to contain a small number of “big cores” – Intel, AMD, IBM
    • Many-cores – Systems will contain a large number of “small cores” – Sun T1 (Niagara)
    • Asymmetric-cores – combination of a small number of big cores and a large number of small cores – IBM Cell architecture.

© Dr. Avi Mendelson - ISMM'2006

slide18

We deserve it!!!.

We deserve it!!!.

New era in computer architectures

But this is totally different!!!!, now Alewife is called Niagra, DASH is called NOC and shared bus architecture is called CMP!!!

I remember, in the late 80’s it was clear that we cannot improve the performance of single threaded programs any longer, so we went parallel as well

Let’s hope this time we will have an “happy end”

Hammm, you are right - it is all of a new world.

© Dr. Avi Mendelson - ISMM'2006

from the opposite side of the mirror
From the opposite side of the mirror
  • There are many similarities between the motivation and the solutions we are building today and what was developed in the late 80’s
  • But the root-cause is different, the software environment is different and so new approach is needed to come with right solutions
  • Power and power density are real physical limitations so changing them requires a new direction (biological computing?????)

© Dr. Avi Mendelson - ISMM'2006

agenda1
Agenda
  • What is power aware architectures and why they are needed
    • Background on the problem
    • Implications and architectural directions
  • Memory related issues
    • Static power implications
    • Dynamic power implications
  • Implications on software and memory management
  • Summary

© Dr. Avi Mendelson - ISMM'2006

memory related implications
Memory related implications
  • The portion of the memory out of the overall die area increases over time (cache memories)
  • Most of memory has very little contribution to the active power consumption
    • Most of the active power that on-die memory consumes is spent at the L1 cache (which remains at the same or smaller size)
    • Most of the active power that off-die memory consumes is spent on the busses, interconnect and coherency related activities.
    • Snoop traffic may have significant impact on active power
  • But larger memory may consume significant static power (leakage) if not handled very carefully
    • Will be discussed in the next few slides

© Dr. Avi Mendelson - ISMM'2006

accessing the cache can be power hungry if designed for speed

V-address

Tag

Data

Way

Set select

Sets

P-address

Accessing the cache can be power hungry if designed for speed

TLB

Address

  • If access time is very important:
    • Memory cells are tuned for speed, not for power
    • parallel access to all ways in the tag and data arrays of the cache
    • TLB is accessed in parallel (need to make sure that the width of the cache is less than a memory page)
    • Power is mainly spent in the sense amplifiers.
    • High associativity increases the power of external snoops as well.

To CPU

© Dr. Avi Mendelson - ISMM'2006

slide23
There are many techniques to control the power consumption of the cache (and memory) if designed for power
  • Active power
    • Sequential access (tag first and only then data)
    • Optimized cells (can help both active and leakage power)
  • Passive (Leakage) power
    • Will be discuss later on

BUT if the cache is large, the accumulative static power can be very significant.

© Dr. Avi Mendelson - ISMM'2006

how to control static power general
How to control static power - General
  • Process optimizations are out of the scope of this talk
  • Design techniques
    • Sleep transistors
    • Power gating
    • Forward and backward biasing (out of our scope)
  • Micro-architecture level
    • Hot spot control
    • Allowing advanced design techniques
  • Architectural level
    • Program behavior dependent techniques
    • Compiler and program’s hint based techniques
    • ACPI

© Dr. Avi Mendelson - ISMM'2006

design techniques few examples
Design techniques – few examples

© Dr. Avi Mendelson - ISMM'2006

architectural level program behavior related techniques

L

B

D

Architectural level – program behavior related techniques
  • We knows for long time that most of the lines in the cache are “dead”
  • But dead lines are not free since they are consuming leakage power
  • So, if we can predict what lines are dead we could put them under
    • sleep transistor that keeps the data (in the case that the prediction was not perfect) – Drowsy cache
    • save more leakage power and use sleep transistors that loose the data and pay higher penalty for miss-predicting a dead line -- cache decay

© Dr. Avi Mendelson - ISMM'2006

architectural level acpi
Operating system mechanism to control power consumption at the system level. (we will focus on CPU only)

Control three aspects of the system

C-State, when the system has nothing to do, how deep it can go to sleep. Deeper  more power is saved, but more time is needed to wake it.

P-State, when run, what is the minimum frequency (and voltage) the processor can run at without causing a notable slow down to the system

T-State, prevents the system from becoming too hot.

Implementation:

Periodically check the activity of the system and decide if to change the operational point.

Architectural level – ACPI

© Dr. Avi Mendelson - ISMM'2006

combining sleep transistors and acpi intel core duo example
Intel Core Duo was designed to be dual core for low-power computer.

L2 cache size in Core Duo is 2MB and in Core Duo-2, 4M

When the system runs (C0 states) all the caches are active

It has a sophisticated DVS algorithm to control the T and P states

When starting to “nap” (C3), it cleans the L1 caches and close the power to them (to save leakage)

When in sleep (C4), it gradually shrink the size of the L2 till it is totally empty.

How much performance you pay for this?  some time you gain performance, most of the time it is in order of 1%.

More details: Intel Journal of Technology, May, 2006

Combining sleep transistors and ACPI – Intel Core Duo example

© Dr. Avi Mendelson - ISMM'2006

agenda2
Agenda
  • What is power aware architectures and why they are needed
    • Background on the problem
    • Implications and architectural directions
  • Memory related issues
    • Static power implications
    • Dynamic power implications
  • Implications on software and memory management
  • Summary

© Dr. Avi Mendelson - ISMM'2006

memory management
Memory management
  • Adding larger shared memory arrays on die may not make sense any more
    • Access time to the LLC (last level cache) start to be very slow
      • But it is too fast for handling it with SW or OS
    • May cause a contention on resources (shared memory and buses).
      • Solving these problems may cost a significant power
  • What are the alternatives (WIP)
    • COMA within the chip
    • Use of buffers instead of caches – change of the programming model
      • May require different approach for memory allocation
    • Separate the memory protection mechanisms from the VM mechanism

© Dr. Avi Mendelson - ISMM'2006

compiler and applications
Compiler and applications
  • Currently most of the cache related optimization are based on performance
    • Many of them are helping energy since they improve the efficiency of the CPU.
    • But may worsen the max power and power density
  • Increasing parallelism is THE key for future systems.
    • Speculation may hurt if not done with a high degree of confidence.
    • Do we need a new programming models such as transactional memory for that?
  • Reducing working sets can help reducing leakage power if the system supports Drowsy or Decay caches
  • The program may give the HW and the OS “hints” that can help improving the efficiency of power consumption
    • Response time requirements
    • If new HW/SW interfaces defined, we can control the power of the machine at a very fine granularity; e.g., when Floating-Point is not used, close it to save leakage power

© Dr. Avi Mendelson - ISMM'2006

garbage collection
Garbage collection
  • In many situations, not all the cores in the system will be active. An idle processor can be used to perform GC
    • In this case we may want to do the CG at a very fine granularity.
  • Most of CMP architectures shares many of the memory hierarchies. GC done by one processor may replace the cache content and slow down the execution of the entire system.
    • Thus we may like to do the GC at a very coarse granularity in order to limit the overhead.
  • New interface between HW and SW may be needed in order to allow new algorithms for GC or optimize the execution of the system when using the current ones

© Dr. Avi Mendelson - ISMM'2006

summary and future directions
Summary and future directions
  • Power aware era impacts all aspects of computer architecture
  • It forces the market to “go parallel” and may cause the memory portion of the die to increase over time
    • To take advantage of “cold transistors”
    • To reduce memory and IO bandwidth
  • We may need to start looking at new paradigms for memory usage and HW/SW interfaces
    • At all levels of the machine; e.g., programming models, OS etc.
    • New programming models such as Transactional Memory may become very important in order to allow better parallelism. Do we need to develop a new memory paradigm to support it?
  • Software will determine if the new (old) trend will become a major success or not.
    • Increase parallelism (but use speculative execution only with high confidence)
    • Control power
    • New SW/HW interfaces may be needed.

© Dr. Avi Mendelson - ISMM'2006

question
Question?

© Dr. Avi Mendelson - ISMM'2006

multi cores intel amd and ibm
Multi-cores – Intel AMD and IBM
  • Both companies have dual core processors
    • Intel uses shared cache architecture and AMD introduces split cache architecture
    • AMD announces that in their next generation processors they will use shared LLC (last level cache) as well
  • Intel announced that they are working on a four-core processors. Analysts think that AMD are doing the same.
  • Intel said that they will consider going to 8 way processors only after SW will catch up. AMD had a similar announcement.
  • For servers, analysts claims that Intel is building a 16 way Itanium based processor for 2009 time frame.
  • Power4 has 2 cores and power5 has 2 cores+SMT. They are considering to move in the near future to 2 cores+SMT for each of them.
  • Xbox has 3 Power4 cores.
  • All the three companies promise to increase the number of cores in a pace that fits the market’s needs.

back

© Dr. Avi Mendelson - ISMM'2006

sun sparc t1 niagara

(a)

(c)

(b)

Sun – Sparc-T1: Niagara
  • Looking at in-order machine, each thread has computational time followed by LONG memory access time (a)
  • If you put 4 of them on die, you can overlap between I/O, memory and computation (b)
  • You can use this approach to extend your system (c)
  • Alewife project did it in the 80’s

back

© Dr. Avi Mendelson - ISMM'2006

cell architecture ibm

Small core

Ring base bus unit

BIG core

Cell Architecture - IBM

back

© Dr. Avi Mendelson - ISMM'2006