climate machine update n.
Skip this Video
Download Presentation
Climate Machine Update

Loading in 2 Seconds...

play fullscreen
1 / 15

Climate Machine Update - PowerPoint PPT Presentation

  • Uploaded on

Climate Machine Update. David Donofrio RAMP Retreat 8/20/2008 . Agenda. Project Overview Tensilica Architecture and Design Flow Tensilica Tools Demo Why we need RAMP Current Progress Next Steps. A New Approach to HPC. Current HPC Design approach:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Climate Machine Update' - astrid

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
climate machine update

Climate Machine Update

David Donofrio

RAMP Retreat


  • Project Overview
  • Tensilica Architecture and Design Flow
  • Tensilica Tools Demo
  • Why we need RAMP
  • Current Progress
  • Next Steps
a new approach to hpc
A New Approach to HPC
  • Current HPC Design approach:
    • Leverage commodity processors from Intel, AMD, etc
    • Once machine is built, optimize problems to run on it
    • Power wall prevents scaling to exaflop performance
    • Power is the new design point

Olukotun and Sutter

Moore’s Law still in effect - but number of processors double every 18 months rather than clock rate

a new approach to hpc1
A New Approach to HPC
  • Our approach:
    • Identify application, then tailor machine using semi-custom design
    • Optimize CPU architecture and further extend with semi-custom ISA
    • Leverage auto-tuning to access architecture specific optimizations
    • Even if each simple core is 1/4 as computationally efficient as a complex core you can fit hundreds on a single die and be 100x more power efficient
  • Learn from embedded market where Flops / Watt and rapid design cycles are crucial
    • Start with building blocks from embedded designs rather than full custom ASIC
    • Preserve ability to run general purpose C code
  • Application Target: 1km Scale Climate Model

Tailor machine architecture to application to

reduce waste

climate model resource requirements
Climate Model Resource Requirements
  • DOE has identified high-resolution climate modeling as a leading justification for exascale computing
  • Must express 20M way parallelism
  • Requires performance of 200 Pflops peak
  • Simulation must run 1000x faster than real time
  • Amenable to massively concurrent architectures composed of power efficient embedded cores.
  • Actively working with the climate science community to enable new Icosahedral model


Randall / CSU

tensilica processor design flow
Tensilica Processor Design Flow
  • Complete Solution: Hardware, Software and Verification
  • Fully customizable
    • Required base ISA ensures general purpose applications
  • Processor configuration submitted to Tensilica’s servers where synthesis is performed
    • Returned design can be spun for ASIC or FPGA
    • Bit file available for Avnet boards
  • Building block approach drastically reduces design cycle time compared to full-custom design

Tensilica Inc.

tensilica architecture features
Tensilica Architecture Features
  • Verilog-like TIE language allows for custom ISA extensions
    • Functional and performance verification built in
    • Auto generated compiler intrinsics
    • 64-bit IEEE-DP floating point coded up in TIE and available
  • Custom VLIW support
  • Inter-processor communication easily enabled through:
    • TIE Ports
    • TIE Queues
      • Access to direct HW support for interprocessor communication
    • TIE Lookups
      • Allows interface to external ROMs or other RTL block
tensilica performance debug
Tensilica Performance Debug
  • Processor viewed as black box
  • State can be compressed (via HW) and pushed out JTAG port
    • Intended for program replay
  • Xtensa trace port gives real-time visibility into internal pipeline state with unprecedented detail
    • $ hit miss with virtual address
    • Branch taken / not taken
    • Call / return
    • Resource dependency
    • Etc…
  • Opportunity for hundreds

of performance counters

to be made available

Tensilica Inc.

why we need ramp
Why we need RAMP
  • Fast, accurate emulation enables:
    • Dual nested loop of HW / SW co-design
      • Preliminary work using Stanford SM sim shows significant improvement in power eff. using automated HW/SW co-tuning
      • RAMP critical to accelerate
    • Rapid prototyping and analysis of Tensilica architectural options
    • Inter-processor communication architecture exploration
    • Running FULL climate code providing a more complete performance picture
  • Cycle accurate simulator currently running at ~100 kHz vs. 50MHz on V5
    • Extensive HW performance counter data enables an emulation environment with similar resolution but much greater speed

Tensilica provided emulation environment

kick-starts this effort

current status
Current Status
  • ML505 used for initial design exploration
    • Basic xtensa processor + JTAG and memory controller is ~50% of a Virtex 5 50t
    • Runs at 50MHz
      • ASIC in 65G process runs at 650MHz
  • OnChip Debug working
  • Can load / run programs using main memory synthesized from BRAM
  • DRAM interface coded - currently being debugged
  • RTL license recently obtained - full simulation environment (in ModelSim) being brought up
next steps
Next Steps…
  • Transition to BEE3 from ML505
  • Bring up XTOS environment on single xtensa processor on BEE3
  • Run single column of climate code on single processor
    • Demo at SC’08 in November
    • Continue HW / SW co-tuning optimization
  • Begin multi-processor emulation
    • Emulation of single socket, 32 core, using networked BEE3s
    • Running full 2 Million line climate model
the need for exascale computing
The Need for Exascale Computing


  • DOE has identified high-resolution climate modeling as leading justification for exascale computing
    • 1 km resolution targeted for accurate cloud resolving model
  • Difficult to scale existing systems
    • HPC design using commodity processors estimated to draw 179MW
    • BlueGene design estimated to draw 20MW
    • Leveraging embedded cores and more application specific design a power envelope of 3-5MW is projected

Randall / CSU

LBNL will seek an external vendor to build the machine if our approach is proven valid - LBNL is not entering the commercial HPC market.