Slide1 l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 57

All the chips outside… and around the PC what new platforms? Apps? Challenges, what’s interesting, and what needs doing? PowerPoint PPT Presentation


  • 164 Views
  • Uploaded on
  • Presentation posted in: General

All the chips outside… and around the PC what new platforms? Apps? Challenges, what’s interesting, and what needs doing?. Gordon Bell Bay Area Research Center Microsoft Corporation.

Download Presentation

All the chips outside… and around the PC what new platforms? Apps? Challenges, what’s interesting, and what needs doing?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide1 l.jpg

All the chips outside… and around the PCwhat new platforms? Apps?Challenges, what’s interesting, and what needs doing?

Gordon Bell

Bay Area Research Center

Microsoft Corporation


Slide2 l.jpg

Architecture changes when everyone and everything is mobile!Power, security, RF, WWW, display, data-types e.g. video & voice… it’s the application of architecture!


The architecture problem l.jpg

The architecture problem

  • The apps

    • Data-types: video, voice, RF, etc.

    • Environment: power, speed, cost

  • The material: clock, transistors…

  • Performance… it’s about parallelism

    • Program & programming environment

    • Network e.g. WWW and Grid

    • Clusters

    • Multiprocessors

    • Storage, cluster, and network interconnect

    • Processor and special processing

    • Multi-threading and multiple processor per chip

    • Instruction Level Parallelism vs

    • Vector processors


Ip on everything l.jpg

IP On Everything


Poochi l.jpg

poochi


Sony playstation export limiits l.jpg

Sony Playstation export limiits


Pc at an inflection point l.jpg

Non-PCdevices and Internet

PC At An Inflection Point?

It needs to continue to be upward. These scalable systems provide the highest technical (Flops) and commercial (TPC) performance.

They drive microprocessor competition!

PCs


Slide8 l.jpg

Consumer PCs

Mobile

Companions

TV/AV

The Dawn Of The PC-Plus Era, Not The Post-PC Era…devices aggregate via PCs!!!

Household Management

Communications

Automation & Security


Pc will prevail for the next decade as a dominant platform 2 nd to smart mobile devices l.jpg

PC will prevail for the next decade as a dominant platform … 2nd to smart, mobile devices

  • Moore’s Law increases performance; and alternatively reduces prices

  • PC server clusters with low cost OS beat proprietary switches, smPs, and DSMs

  • Home entertainment & control …

    • Very large disks (1TB by 2005) to “store everything”

    • Screens to enhance use

  • Mobile devices, etc. dominate WWW >2003!

  • Voice and video become important apps!

C = Commercial; C’ = Consumer


Where s the action problems l.jpg

Where’s the action? Problems?

  • Constraints: Speech, video, mobility, RF, GPS, security…Moore’s Law, including network speed

  • Scalability and high performance processing

    • Building them: Clusters vs DSM

    • Structure: where’s the processing, memory, and switches (disk and ip/tcp processing)

    • Micros: getting the most from the nodes

  • Not ISAs: Change can delay Moore Law effect … and wipe out software investment! Please, please, just interpret my object code!

  • System on a chip alternatives… apps drive

    • Data-types (e.g. video, video, RF) performance, portability/power, and cost


High performance computing l.jpg

High Performance Computing

A 60+ year view


High performance architecture program timeline l.jpg

High performance architecture/program timeline

1950.1960.1970.1980.1990.2000

VtubesTrans.MSI(mini) Micro RISCnMicr

Sequential programming---->------------------------------

(single execution stream)

<SIMD Vector--//---------------

Parallelization---

Parallel programs aka Cluster Computing <---------------

multicomputers <--MPP era------

ultracomputers 10X in size & price!10x MPP

“in situ” resources 100x in //sm NOWVLSCC

geographically dispersedGrid


Computer types l.jpg

Computer types

-------- Connectivity--------

WAN/LAN SAN DSM SM

Netwrked

Supers…

GRID

VPPuni

NEC mP

NEC super

Cray X…T

(all mPv)

Clusters

micros vector

Legion

Condor

Beowulf

NT clusters

T3E

SP2(mP)

NOW

SGI DSM

clusters &

SGI DSM

Mainframes

Multis

WSs PCs


Technical computer types l.jpg

Technical computer types

WAN/LAN SAN DSM SM

Old

World

( one

program

stream)

New world:

Clustered

Computing

(multiple program

streams)

Netwrked

Supers…

GRID

VPPuni

NEC mP

T series

NEC super

Cray X…T

(all mPv)

micros vector

Legion

Condor

Beowulf

SP2(mP)

NOW

SGI DSM

clusters &

SGI DSM

Mainframes

Multis

WSs PCs


Dead supercomputer society l.jpg

Dead Supercomputer Society


Dead supercomputer society16 l.jpg

ACRI

Alliant

American Supercomputer

Ametek

Applied Dynamics

Astronautics

BBN

CDC

Convex

Cray Computer

Cray Research

Culler-Harris

Culler Scientific

Cydrome

Dana/Ardent/Stellar/Stardent

Denelcor

Elexsi

ETA Systems

Evans and Sutherland Computer

Floating Point Systems

Galaxy YH-1

Goodyear Aerospace MPP

Gould NPL

Guiltech

Intel Scientific Computers

International Parallel Machines

Kendall Square Research

Key Computer Laboratories

MasPar

Meiko

Multiflow

Myrias

Numerix

Prisma

Tera

Thinking Machines

Saxpy

Scientific Computer Systems (SCS)

Soviet Supercomputers

Supertek

Supercomputer Systems

Suprenum

Vitesse Electronics

Dead Supercomputer Society


Sci research c1985 1995 l.jpg

SCI Research c1985-1995

  • 35 university and corporate R&D projects

  • 2 or 3 successes…

  • All the rest failed to work or be successful


How to build scalables l.jpg

How to build scalables?

To cluster or not to cluster… don’t we need a single, shared memory?


Application taxonomy l.jpg

Application Taxonomy

General purpose, non-parallelizable codes(PCs have it!)

Vectorizable

Vectorizable & //able(Supers & small DSMs)

Hand tuned, one-ofMPP course grainMPP embarrassingly //(Clusters of PCs...)

DatabaseDatabase/TP

Web Host

Stream Audio/Video

Technical

Commercial

If central control & rich then IBM or large SMPs

else PC Clusters


Slide20 l.jpg

SNAP … c1995Scalable Network And PlatformsA View of Computing in 2000+We all missed the impact of WWW!

Gordon Bell

Jim Gray


Computing snap built entirely from pcs l.jpg

Legacy

mainframes &

minicomputers

servers & terms

Portables

Legacy

mainframe &

minicomputer

servers & terminals

ComputingSNAPbuilt entirelyfrom PCs

Wide-area

global

network

Mobile

Nets

Wide & Local

Area Networks

for: terminal,

PC, workstation,

& servers

Person

servers

(PCs)

scalable computers

built from PCs

A space, time (bandwidth), & generation scalable environment

Person

servers

(PCs)

Centralized

& departmental

uni- & mP servers

(UNIX & NT)

Centralized

& departmental

servers buit from

PCs

???

TC=TV+PC

home ...

(CATV or ATM

or satellite)


Bell prize and future peak tflops t l.jpg

*IBM

Bell Prize and Future Peak Tflops (t)

Petaflops study target

NEC

CM2

XMP NCube


Top 10 tpc c l.jpg

Top 10 tpc-c

Top two Compaq systems are:1.1 & 1.5X faster than IBM SPs;1/3 price of IBM1/5 price of SUN


Slide24 l.jpg

Courtesy of Dr. Thomas Sterling, Caltech


Five scalabilities l.jpg

Five Scalabilities

Size scalable -- designed from a few components, with no bottlenecks

Generation scaling -- no rewrite/recompile or user effort to run across generations of an architecture

Reliability scaling… chose any level

Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites)

Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer.

Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency,


Why i gave up on large smps dsms l.jpg

Why I gave up on large smPs & DSMs

  • Economics: Perf/Cost is lower…unless a commodity

  • Economics: Longer design time & life. Complex. => Poorer tech tracking & end of life performance.

  • Economics: Higher, uncompetitive costs for processor & switching. Sole sourcing of the complete system.

  • DSMs … NUMA! Latency matters. Compiler, run-time, O/S locate the programs anyway.

  • Aren’t scalable. Reliability requires clusters. Start there.

  • They aren’t needed for most apps… hence, a small market unless one can find a way to lock in a user base. Important as in the case of IBM Token Rings vs Ethernet.


Fvcore performance finite volume community climate model joint code development nasa llnl and ncar l.jpg

FVCORE PerformanceFinite Volume Community Climate Model, Joint Code development NASA, LLNL and NCAR

50

SX-5

SX-4

Max C90-16

Max T3E


Architectural contrasts vector vs microprocessor l.jpg

Vector System

Microprocessor System

Vector registers

8 KBytes

1st & 2nd Lvl Caches

8 MBytes

Memory

Memory

CPU

CPU

Architectural Contrasts – Vector vs Microprocessor

500Mhz

600Mhz

Two results per clock

Two results per clock

(Will be 4 in next Gen SGI)

Vector lengths arbitrary

Vector lengths fixed

Vectors fed at low speed

Vectors fed at high speed

Cache based systems are nothing more than “vector” processors with a highly programmable “vector” register set (the caches). These caches are 1000x larger than the vector registers on a Cray vector system, and provide the opportunity to execute vector work at a very high sustained rate. In particular, note 512 CPU Origins contain 4 GBytes of cache. This is larger than most problems of interest, and offers a tremendous opportunity for high performance across a large number of CPUs. This has been borne out in fact at NASA Ames.


Convergence to one architecture l.jpg

Convergence to one architecture

mPs continue

to be the main line


Jim what are the architectural challenges for clusters l.jpg

“Jim, what are the architectural challenges … for clusters?”

  • WANS (and even LANs) faster than backplanes at 40 Gbps

  • End of busses (fc=100 MBps)… except on a chip

  • What are the building blocks or combinations of processing, memory, & storage?

  • Infiniband http://www.infinibandta.orgstarts at OC48, but it may not go far or fast enough if it ever exists. OC192 is being deployed.


What is the basic structure of these scalable systems l.jpg

What is the basic structure of these scalable systems?

  • Overall

  • Disk connection especially wrt to fiber channel

  • SAN, especially with fast WANs & LANs


Modern scalable switches also hide a supercomputer l.jpg

Modern scalable switches … also hide a supercomputer

  • Scale from <1 to 120 Tbps of switch capacity

  • 1 Gbps ethernet switches scale to 10s of Gbps

  • SP2 scales from 1.2 Gbps


Gb plumbing from the baroque evolving from the 2 dance hall model l.jpg

GB plumbing from the baroque:evolving from the 2 dance-hall model

Mp — S — Pc

: | :

|—————— S.fc — Ms

| :

|— S.Cluster

|— S.WAN —

MpPcMs — S.Lan/Cluster/Wan — :


Snap architecture l.jpg

SNAP Architecture----------


Istore hardware vision l.jpg

ISTORE Hardware Vision

  • System-on-a-chip enables computer, memory, without significantly increasing size of disk

  • 5-7 year target:

  • MicroDrive:1.7” x 1.4” x 0.2” 2006: ?

    • 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek

    • 2006: 9 GB, 50 MB/s ? (1.6X/yr capacity, 1.4X/yr BW)

  • Integrated IRAM processor

    • 2x height

  • Connected via crossbar switch

    • growing like Moore’s law

  • 16 Mbytes; ; 1.6 Gflops; 6.4 Gops

  • 10,000+ nodes in one rack!

  • 100/board = 1 TB; 0.16 Tflops


The disk farm or a system on a card l.jpg

14"

The Disk Farm? or a System On a Card?

The 500GB disc card

An array of discs

Can be used as

100 discs

1 striped disc

50 FT discs

....etc

LOTS of accesses/second

of bandwidth

A few disks are replaced by 10s of Gbytes of RAM and a processor to run Apps!!


Map of gray bell prize results l.jpg

Map of Gray Bell Prize results

Redmond/Seattle, WA

single-thread single-stream tcp/ip via 7 hops desktop-to-desktop …Win 2K out of the box performance*

New York

Arlington, VA

San Francisco, CA

5626 km

10 hops


Ubiquitous 10 gbps sans in 5 years l.jpg

Ubiquitous 10 GBps SANs in 5 years

1 GBps

  • 1Gbps Ethernet are reality now.

    • Also FiberChannel ,MyriNet, GigaNet, ServerNet,, ATM,…

  • 10 Gbps x4 WDM deployed now (OC192)

    • 3 Tbps WDM working in lab

  • In 5 years, expect 10x, wow!!

120 MBps

(1Gbps)

80 MBps

5 MBps

40 MBps

20 MBps


The promise of san via 10x in 2 years http www viarch org l.jpg

The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/

  • Yesterday:

    • 10 MBps (100 Mbps Ethernet)

    • ~20 MBps tcp/ip saturates 2 cpus

    • round-trip latency ~250 µs

  • Now

    • Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,…

    • Fast user-level communication

      • tcp/ip ~ 100 MBps 10% cpu

      • round-trip latency is 15 us

  • 1.6 Gbps demoed on a WAN


Processor improvements 90 of isca s focus l.jpg

Processor improvements… 90% of ISCA’s focus


We get more of everything l.jpg

We get more of everything


Mainframes minis micros and risc l.jpg

Mainframes, minis, micros, and risc


Computer ops sec x word length l.jpg

Computer ops/sec x word length / $


Growth of microprocessor performance l.jpg

Micros

Supers

Growth of microprocessor performance

10000

Cray T90

Cray C90

Cray Y-MP

Cray 2

1000

Alpha

RS6000/590

Cray X-MP

Alpha

100

RS6000/540

Cray 1S

i860

10

R2000

Performance in Mflop/s

1

80387

0.1

6881

80287

8087

0.01

1998

1980

1982

1986

1988

1990

1992

1994

1996


Albert yu predictions 96 l.jpg

Albert Yu predictions ‘96

When20002006

Clock (MHz)90040004.4x

MTransistors403508.75x

Mops240020,0008.3x

Die (sq. in.)1.11.41.3x


Processor limit dram gap l.jpg

µProc

60%/yr..

1000

CPU

100

Processor-Memory

Performance Gap:(grows 50% / year)

Performance

10

DRAM

7%/yr..

DRAM

1

1992

2000

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1993

1994

1995

1996

1997

1998

1999

Processor Limit: DRAM Gap

“Moore’s Law”

  • Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions

  • Caches in Pentium Pro: 64% area, 88% transistors

  • *Taken from Patterson-Keeton Talk to SigMod


The memory gap l.jpg

The “memory gap”

  • Multiple e.g. 4 processors/chip in order to increase the ops/chip while waiting for the inevitable access delays

  • Or alternatively, multi-threading (MTA)

  • Vector processors with a supporting memory system

  • System-on-a-chip… to reduce chip boundary crossings


If system on a chip is the answer what is the problem l.jpg

If system-on-a-chip is the answer, what is the problem?

  • Small, high volume products

    • Phones, PDAs,

    • Toys & games (to sell batteries)

    • Cars

    • Home appliances

    • TV & video

  • Communication infrastructure

  • Plain old computers… and portables


Soc alternatives not including c c cad tools l.jpg

SOC Alternatives… not including C/C++ CAD Tools

  • The blank sheet of paper: FPGA

  • Auto design of a basic system: Tensilica

  • Standardized, committee designed components*, cells, and custom IP

  • Standard components including more application specific processors *, IP add-ons and custom

  • One chip does it all: SMOP

    *Processors, Memory, Communication & Memory Links,


Xilinx 10mg 500mt 12 mic l.jpg

Xilinx 10Mg, 500Mt, .12 mic


Free 32 bit processor core l.jpg

Free 32 bit processor core


System on a chip alternatives l.jpg

System-on-a-chip alternatives


Cradle universal microsystem trading verilog hardware for c c l.jpg

Cradle: Universal Microsystemtrading Verilog & hardware for C/C++

UMS : VLSI = microprocessor : special systemsSoftware : Hardware

  • Single part for all apps

  • App spec’d@ run time using FPGA & ROM

  • 5 quad mPs at 3 Gflops/quad = 15 Glops

  • Single shared memory space, caches

  • Programmable periphery including: 1 GB/s; 2.5 GipsPCI, 100 baseT, firewire

  • $4 per flops; 150 mW/Gflops


Ums architecture l.jpg

UMS Architecture

  • Memory bandwidth scales with processing

  • Scalable processing, software, I/O

  • Each app runs on its own pool of processors

  • Enables durable, portable intellectual property


Recapping the challenges l.jpg

Recapping the challenges

  • Scalable systems

    • Latency in a distributed memory

    • Structure of the system and nodes

    • Network performance for OC192 (10 Gbps)

    • Processing nodes and legacy software

  • Mobile systems… power, RF, voice, I/0

    • Design time!


The end l.jpg

The End


  • Login