from 4 bit micros to multi cores a brief history future challenges and how ces can prepare for them n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
From 4-bit micros to Multi-cores: A brief history, Future Challenges, and how CEs can prepare for them PowerPoint Presentation
Download Presentation
From 4-bit micros to Multi-cores: A brief history, Future Challenges, and how CEs can prepare for them

Loading in 2 Seconds...

play fullscreen
1 / 36

From 4-bit micros to Multi-cores: A brief history, Future Challenges, and how CEs can prepare for them - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

From 4-bit micros to Multi-cores: A brief history, Future Challenges, and how CEs can prepare for them. Ganesh Gopalakrishnan, School of Computing, University of Utah. NSF CSR-SMA: Toward Reliable and Efficient Message Passing Software Through Formal Analysis

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'From 4-bit micros to Multi-cores: A brief history, Future Challenges, and how CEs can prepare for them' - davin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
from 4 bit micros to multi cores a brief history future challenges and how ces can prepare for them

From 4-bit micros to Multi-cores: A brief history, Future Challenges, and how CEs can prepare for them

Ganesh Gopalakrishnan,

School of Computing,

University of Utah

  • NSF CSR-SMA: Toward Reliable and Efficient Message Passing Software Through Formal Analysis
  • (the ``Gauss'' project)
  • 2005-TJ-1318 (SRC/Intel Customization), Scaling Formal Methods Towards Hierarchical Protocols in
  • Shared Memory Processors (the ``MPV'' project)
  • Microsoft HPC Innovation Center, ``Formal Analysis and Code Generation Support for MPI''
desktops turn into supercomputers
“Desktops” turn into supercomputers

8 die x 2 CPUs x 2-way execution = 32-way shared memory machine!

supercomputers have become fundamental tools that underlie all of engineering
Supercomputers have become fundamentaltools that underlie all of engineering

(Image courtesy of Steve Parker, CSAFE, Utah)

(BlueGene/L - Image courtesy of IBM / LLNL)

from simulation to flight virtual roll out of boeing 787 dreamliner
From Simulation to Flight:Virtual Roll-out of Boeing 787 “Dreamliner”

Entire Airplane being

Designed and Flown

inside a Computer

(Simulation Program).

The first plane to fly is

the real one (not a

mockup model).

(Photo courtesy of Boeing.)

this talk
This Talk
  • Some history
    • How the micro came about
    • Past predictions
  • The future
    • Multicores
    • Hardware Challenges
    • Programming them
    • How I am trying to help (my own research)
  • General awareness
    • International matters
    • Tips to survive, … and to excel
the birth of the micro
The birth of the micro
  • Intel’s 4004 and TI’s TMS-1000 were the first
  • 4004 – with cover removed (L) and on (R)
  • Patent awarded to TI !
  • Intel made single-chip computer for Datapoint
  • Marketed it as 8008 when Datapoint did not use the design
revolution of the 70s and 80s
Revolution of the 70s and 80s
  • Intel : 4004, 4040, 8008, 8080, 8085, 8086, 80186, 80286, 80386, 80486, Pentium, PPro, … now “X86” (also Itanium)
  • Motorola: 6800, 6810, 6820, 68000, 68010, 68020, … then PowerPC (collab with IBM)
  • Other companies
  • Burst of activity – EVERY student wanted to build an embedded computer out of a micro in the 70s and 80s.
the micro killed the mini
The micro killed the mini
  • It became amply clear in the 80s that it was going to replace “mainframes”
    • casual experiments conducted between Sun-2 (68020) versus Digital’s VAX 11/750 and 780
  • The birth of the IBM PC around 1980 started things going mu-P’s way!
  • With the masses having a PC each, the Internet could be meaningfully reborn!
and is in every supercomputer
… and is in every supercomputer
  • John Hennessy’s prediction during SC’97: (http://news-service.stanford.edu/news/1997/november19/supercomp1119.html
  • John Hennessy: “Today’s microprocessor chipping away at supercomputer market”
    • Traditionally designed supercomputers will vanish within a decade – it has!
  • Clusters of them fill vast rooms now!
ibm asci white machine
IBM ASCI White Machine

Released in 2000

-- Peak Performance : 12.3 teraflops.

-- Processors used : IBM RS6000 SP Power3's - 375 MHz.

-- There are 8,192 of these processors

-- The total amount of RAM is 6Tb.

-- Two hundred cabinets - area of two basket ball courts.

ibm bluegene l
IBM BlueGene/L

The first machine in the family, Blue Gene/L, is expected to operate

at a peak performance of about 360 teraflops (360 trillion operations per

second), and occupy 64 racks -- taking up only about the same space as half of a tennis court. Researchers at the Lawrence Livermore National Laboratory (LLNL) plan to use Blue Gene/L to simulate physical phenomena that require computational capability much greater than presently available, such as cosmology and the behavior of stellar binary pairs, laser-plasma interactions, and the behavior and aging of high explosives.

now it s the era of multi cores e g sun niagara processor
Now it’s the era of Multi-cores: e.g., Sun Niagara processor

8 CPU cores (80 cores demoed by Intel already…)

energy advantages of multicores
Energy advantages of multicores
  • Putting two simple CPUs achieves 80% performance per cpu with only 50% of the power per CPU  chip as a whole gives 1.6x performance for same power PROVIDED we can keep the cores busy
  • Simple way to keep ‘em busy
    • Virus-checker in background while user computes
    • Photoshop in one and Windows on another
  • More complex ways to keep multiple cores busy are being investigated
so what are the design issues
So what are the design issues?

Lots! Here is a small subset:

  • Complex cache coherence protocols !
  • Silicon debugging is becoming a headache !
  • Programming apps is becoming hard !
  • The “Digital Divide”
1 dual and quad cores are the norm these days their caches are visibly central
1. Dual and Quad-cores are the norm these days. Their caches are visibly central

> 80% of chips

shipped will be

multi-core

(photo courtesy of Intel Corporation.)

what is cache coherence
What is cache coherence?
  • Illusion of global shared memory is preferred
  • Need mechanisms to keep caches consistent
    • Every read must fetch the data written by the latest write

P1 P2

read(a) write(a,1)

… ….

read(a) write(a,2)

what is cache coherence1
What is cache coherence?
  • Illusion of global shared memory is preferred
  • Need mechanisms to keep caches consistent
    • Every read must fetch the data written by the latest write

P1 P2

read(a,2) write(a,1)

… ….

read(a,1) write(a,2)

With a coherent cache, the

indicated outcome is not allowed

what is cache coherence2
What is cache coherence?
  • Illusion of global shared memory is preferred
  • Need mechanisms to keep caches consistent
    • Every read must fetch the data written by the latest write

P1 P2

read(a,2) write(a,1)

… ….

read(a,2) write(a,2)

But this outcome is allowed

cache coherence protocol verification
Cache Coherence Protocol Verification

My “MPV” research project develops techniques to ensure

that cache coherence protocols are correct

We use an approach called Model Checking

We control the complexity of model checking thru the Assume / Guarantee

approach

Intra-cluster protocols

Chip-level protocols

dir

dir

Inter-cluster protocols

mem

mem

a caching hierarchy such as this is too hard to verify
A caching hierarchy such as this is too hard to verify

Remote Cluster 1

Home Cluster

Remote Cluster 2

L1

Cache

L1

Cache

L1

Cache

L1

Cache

L1

Cache

L1

Cache

L2 Cache+Local Dir

L2 Cache+Local Dir

L2 Cache+Local Dir

RAC

RAC

RAC

Global Dir

Main

Memory

so we create several mutually supporting abstractions
So we create several “mutually supporting” abstractions

Home Cluster

L1

Cache

L1

Cache

Remote Cluster 1

Remote Cluster 2

L2 Cache+Local Dir’

L2 Cache+Local Dir

L2 Cache+Local Dir’

RAC

RAC

RAC

Global Dir

Main

Memory

abstracted protocol 2
Abstracted Protocol #2

Remote Cluster 1

L1

Cache

L1

Cache

Home Cluster

Remote Cluster 2

L2 Cache+Local Dir

L2 Cache+Local Dir’

L2 Cache+Local Dir’

RAC

RAC

RAC

Global Dir

Main

Memory

problem 2 silicon debugging can t see inside cpus without paying a huge price
Problem 2: Silicon Debugging: Can’t see “inside” CPUs without paying a huge price
  • On-chip instrumentation is one way to “see” what is inside
  • One must put in several built-in test circuits
  • One must design with the option of bypassing new features

cpu

cpu

cpu

cpu

Invisible

“miss” traffic

Visible

“miss” traffic

3 programming apps is hard e g threads
3: Programming Apps is hard! e.g. threads

Thread and process interactions need to coordinate

Otherwise something analogous to this will happen !

Teller 1 Teller 2

Read bank balance

($100)

Read bank balance

($100)

Add $10 on scratch paper

($110)

Subtract $10 on scratch paper

($90)

Enter $110 into account

Enter $90 into account

USER LEFT WITH $90 – NOT WITH $100 !!

programming msg passing supercomputers can be quite tricky
Programming Msg. Passing Supercomputers can be quite tricky

My “Gauss” project (in collaboration with Robert M. Kirby) ensures

that supercomputer programs do not contain bugs, and also perform efficiently

Virtually all supercomputers are programmed using the “MPI” communication library

Mis-using this library can often result in bugs that show up only after porting

P1

MPI_SEND(to P2, Msg)

MPI_RECV(from P2, Msg)

P2

MPI_SEND(to P1, Msg)

MPI_RECV(from P1, Msg)

If the system does not provide sufficient buffering,

the sends may both block, thus causing a deadlock !

slide27

Simulation code that does automatic load balancing is

difficult to write and debug

(Photo courtesy NHTSA)

lots of hard problems remain open
LOTS of hard problems remain open
  • How to provide memory bandwidth?
    • Put multicore CPU chip on top of highly dense DRAM chip (e.g. 8 GB)
    • Most users will buy just “one of those”
    • Others will buy SDRAM module add-ons
      • Slow access for now
      • Optical interconnect is an active research area
    • Higher memory bandwidth solutions coming
      • So the real challenge remains programming!
        • Insights from recent Microsoft visit
emerging programming paradigms
Emerging Programming Paradigms
  • Microsoft’s Task Pallallel Library
  • Intel’s Thread Building Blocks
  • OpenMP, Cluster OpenMP, Cuda, Cilk
  • Transaction memories
  • Special purpose paradigms
    • LINQ and PLINQ for Relational Databases
    • Game Programming: roll customized solutions
emerging programming paradigms1
Emerging Programming Paradigms
  • Transaction Memories!
    • Users cause too many bugs when programming using locks
    • Transaction memories allow shared memory threads to “watch” each others read/write actions
    • Conflicting accesses can rollback and retry
problem 4 huge the digital divide need plenty of outreach
Problem 4: Huge! The “digital divide” Need plenty of Outreach
  • Better CE / CS projects in SLVSEF
  • Mentoring
learn from history learn computer history
Learn from History – Learn Computer History
  • If you want to understand today, you have to search yesterday.  ~Pearl Buck
  • Things are changing SO fast that basic principles are often being diluted
  • Get excited by studying computer history and seeing how much better off we are (also be chagrined by all the lost opportunity!)
where to learn computer history
Where to learn computer history?
  • Computer History Museum, Mountain View
  • Intel Museum, Santa Clara
  • Boston Computer Museum
  • Many in the UK (Manchester, London, …)
  • Travel widely – be inspired by what you see!
it is important to understand the international scene
It is important to understand the International Scene
  • Lessons from MSR India
    • Amazing talent-pool
    • Relatively high availability of talent
  • Lessons from Intel India
    • Talent-pool still lacks depth and abilities of many of our CEs
    • We can stay competitive in hardware for a LONG time to come
  • Apply for international internships!
gradual loss of manufacturing death
Gradual loss of manufacturing  death
  • Lots of manufacturing happening outside the US
  • Fear not – CE / CS jobs are still on the rise
    • Huge demand forecast within the US
  • THE REAL DANGER
    • Loss of manufacturing kills pride and incentive to learn – we don’t want that in CE
recipe for success
Recipe for success
  • The best ideas don’t always work
    • Wait for the world to be ready for the ideas
    • The devil is in the detail
    • Too much established momentum
    • Decide goal (short-term impact vs. long-term)
  • Quiet tenacity
    • Tenacity without ruffling feathers needlessly
    • Work hard! work smart! learn theory! be a champion algorithm / program designer! learn advanced hardware design!
  • Learn to write extremely clearly and precisely!
  • Learn to give inspiring talks! (be inspired first!)