Climate machine update
1 / 15

Climate Machine Update - PowerPoint PPT Presentation

  • Uploaded on

Climate Machine Update. David Donofrio RAMP Retreat 8/20/2008 . Agenda. Project Overview Tensilica Architecture and Design Flow Tensilica Tools Demo Why we need RAMP Current Progress Next Steps. A New Approach to HPC. Current HPC Design approach:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Climate Machine Update' - astrid

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Climate machine update

Climate Machine Update

David Donofrio

RAMP Retreat



  • Project Overview

  • Tensilica Architecture and Design Flow

  • Tensilica Tools Demo

  • Why we need RAMP

  • Current Progress

  • Next Steps

A new approach to hpc
A New Approach to HPC

  • Current HPC Design approach:

    • Leverage commodity processors from Intel, AMD, etc

    • Once machine is built, optimize problems to run on it

    • Power wall prevents scaling to exaflop performance

    • Power is the new design point

Olukotun and Sutter

Moore’s Law still in effect - but number of processors double every 18 months rather than clock rate

A new approach to hpc1
A New Approach to HPC

  • Our approach:

    • Identify application, then tailor machine using semi-custom design

    • Optimize CPU architecture and further extend with semi-custom ISA

    • Leverage auto-tuning to access architecture specific optimizations

    • Even if each simple core is 1/4 as computationally efficient as a complex core you can fit hundreds on a single die and be 100x more power efficient

  • Learn from embedded market where Flops / Watt and rapid design cycles are crucial

    • Start with building blocks from embedded designs rather than full custom ASIC

    • Preserve ability to run general purpose C code

  • Application Target: 1km Scale Climate Model

    Tailor machine architecture to application to

    reduce waste

Climate model resource requirements
Climate Model Resource Requirements

  • DOE has identified high-resolution climate modeling as a leading justification for exascale computing

  • Must express 20M way parallelism

  • Requires performance of 200 Pflops peak

  • Simulation must run 1000x faster than real time

  • Amenable to massively concurrent architectures composed of power efficient embedded cores.

  • Actively working with the climate science community to enable new Icosahedral model


Randall / CSU

Tensilica processor design flow
Tensilica Processor Design Flow

  • Complete Solution: Hardware, Software and Verification

  • Fully customizable

    • Required base ISA ensures general purpose applications

  • Processor configuration submitted to Tensilica’s servers where synthesis is performed

    • Returned design can be spun for ASIC or FPGA

    • Bit file available for Avnet boards

  • Building block approach drastically reduces design cycle time compared to full-custom design

Tensilica Inc.

Tensilica architecture features
Tensilica Architecture Features

  • Verilog-like TIE language allows for custom ISA extensions

    • Functional and performance verification built in

    • Auto generated compiler intrinsics

    • 64-bit IEEE-DP floating point coded up in TIE and available

  • Custom VLIW support

  • Inter-processor communication easily enabled through:

    • TIE Ports

    • TIE Queues

      • Access to direct HW support for interprocessor communication

    • TIE Lookups

      • Allows interface to external ROMs or other RTL block

Tensilica performance debug
Tensilica Performance Debug

  • Processor viewed as black box

  • State can be compressed (via HW) and pushed out JTAG port

    • Intended for program replay

  • Xtensa trace port gives real-time visibility into internal pipeline state with unprecedented detail

    • $ hit miss with virtual address

    • Branch taken / not taken

    • Call / return

    • Resource dependency

    • Etc…

  • Opportunity for hundreds

    of performance counters

    to be made available

Tensilica Inc.

Why we need ramp
Why we need RAMP

  • Fast, accurate emulation enables:

    • Dual nested loop of HW / SW co-design

      • Preliminary work using Stanford SM sim shows significant improvement in power eff. using automated HW/SW co-tuning

      • RAMP critical to accelerate

    • Rapid prototyping and analysis of Tensilica architectural options

    • Inter-processor communication architecture exploration

    • Running FULL climate code providing a more complete performance picture

  • Cycle accurate simulator currently running at ~100 kHz vs. 50MHz on V5

    • Extensive HW performance counter data enables an emulation environment with similar resolution but much greater speed

Tensilica provided emulation environment

kick-starts this effort

Current status
Current Status

  • ML505 used for initial design exploration

    • Basic xtensa processor + JTAG and memory controller is ~50% of a Virtex 5 50t

    • Runs at 50MHz

      • ASIC in 65G process runs at 650MHz

  • OnChip Debug working

  • Can load / run programs using main memory synthesized from BRAM

  • DRAM interface coded - currently being debugged

  • RTL license recently obtained - full simulation environment (in ModelSim) being brought up

Next steps
Next Steps…

  • Transition to BEE3 from ML505

  • Bring up XTOS environment on single xtensa processor on BEE3

  • Run single column of climate code on single processor

    • Demo at SC’08 in November

    • Continue HW / SW co-tuning optimization

  • Begin multi-processor emulation

    • Emulation of single socket, 32 core, using networked BEE3s

    • Running full 2 Million line climate model

The need for exascale computing
The Need for Exascale Computing


  • DOE has identified high-resolution climate modeling as leading justification for exascale computing

    • 1 km resolution targeted for accurate cloud resolving model

  • Difficult to scale existing systems

    • HPC design using commodity processors estimated to draw 179MW

    • BlueGene design estimated to draw 20MW

    • Leveraging embedded cores and more application specific design a power envelope of 3-5MW is projected

Randall / CSU

LBNL will seek an external vendor to build the machine if our approach is proven valid - LBNL is not entering the commercial HPC market.