1 / 18

Stanford Streaming Supercomputer (SSS) Project Meeting

Stanford Streaming Supercomputer (SSS) Project Meeting. Bill Dally, Pat Hanrahan, and Ron Fedkiw Computer Systems Laboratory Stanford University October 2, 2001. Agenda. Introductions (now) Vision – subset of ASCI review slides Goals for the quarter Schedule of meetings for the quarter.

tracen
Download Presentation

Stanford Streaming Supercomputer (SSS) Project Meeting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stanford Streaming Supercomputer (SSS)Project Meeting Bill Dally, Pat Hanrahan, and Ron FedkiwComputer Systems LaboratoryStanford University October 2, 2001

  2. Agenda • Introductions (now) • Vision – subset of ASCI review slides • Goals for the quarter • Schedule of meetings for the quarter

  3. Computation is inexpensive and plentiful nVidea GeForce3 ~80 Gflops/sec ~800 Gops/sec Velio VC3003 1Tb/s I/O BW DRAM < $0.20/MB

  4. But supercomputers are very expensive • Cost more per GFLOPS, GUPS, and GByte than low end machines • Hard to achieve high fraction of peak performance on global problems • Based on clusters of CPUs that are scaling at only 20%/year vs. 50% historically

  5. Microprocessors no longer realize the potential of VLSI 52%/year 19%/year 30:1 74%/year 1,000:1 30,000:1

  6. Streaming processors leverage emerging technology • Streaming supercomputer can achieve • $20/GFLOPs, $2/M-GUPS • Scalable to PFLOPS and 1013 GUPS • Enabled by • Stream architecture • Exposes and exploits parallelism and locality • High arithmetic intensity (ops/BW) • Hides latency • Efficient interconnection networks • High global bandwidth • Low latency

  7. What is stream processing? Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Streams expose data parallelism

  8. SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Why does it get good performance – easily?

  9. Architecture of a Streaming Supercomputer

  10. Streaming processor

  11. A layered software system simplifies stream programming

  12. Domain-specific languageexample: Marble shader in RTSL float turbulence4_imagine_scalar (texref noise, float4 pos) { fragment float4 addr1 = pos; fragment float4 addr2 = pos * {2, 2, 2, 1}; fragment float4 addr3 = pos * {4, 4, 4, 1}; fragment float4 addr4 = pos * {8, 8, 8, 1}; fragment float val; val = (0.5) * texture(noise, addr1)[0]; val = val + (0.25) * texture(noise, addr2)[0]; val = val + (0.125) * texture(noise, addr3)[0]; val = val + (0.0625) * texture(noise, addr4)[0]; return val; } float3 marble_color(float x) { float x2; x = sqrt(x+1.0)*.7071; x2 = sqrt(x); return { .30 + .6*x2, .30 + .8*x, .60 + .4*x2 }; } surface shader float4 shiny_marble_imagine (texref noise) { float4 Cd = lightmodel_diffuse({ 0.4, 0.4, 0.4, 1 }, { 0.5, 0.5, 0.5, 1 }); float4 Cs = lightmodel_specular({ 0.35, 0.35, 0.35, 1 }, Zero, 20); fragment float y; fragment float4 pos = Pobj * {10, 10, 10, 1}; y = pos[1] + 3.0 * turbulence4_imagine_scalar(noise, pos); y = sin(y*pi); return ({marble_color(y), 1.0f} * Cd + Cs); }

  13. Lights, Normals, & Materials Shader Traverser Ray Gen Intersector Stream-level application descriptionexample: SHARP Raytracer • Computation expressed as streams of records passing through kernels • Similar to computation required for Monte-Carlo radiation transport Camera Grid Triangles Rays + Rays Hits VoxID Rays Rays Pixels

  14. Expected application performance • Arithmetic-limited applications • Includes applications where domain decomposition can be applied • Like TFLO and LES • Expected to achieve a large fraction of peak performance • Communication-limited applications • Such as applications requiring matrix solution Ax = b • At the very least will benefit from high global bandwidth • We hope to find new methods to solve matrix equations using streaming

  15. Conclusion • Computation is cheap yet supercomputing is expensive • Streams enable supercomputing to exploit advantages of emerging technology • by exposing locality and concurrency • Order of magnitude cost/performance improvement for both arithmetic-limited and communication-limited codes • $20/GFLOPS and $2/M-GUPS • Scalable from desktop (1 TFLOPS) to machine room (1 PFLOPS) • A layered software system using domain-specific languages simplifies stream programming • MCRT, ODEs, PDEs • Early results on graphics and image processing are encouraging

  16. Plan for AY2001-2002

  17. Project Goals for Fall Quarter AY2001-2002 • Map two applications to the stream model • Fluid flow (TFLO), and molecular dynamics candidates • Define a high-level stream programming language • Generalize stream access without destroying locality • Draft strawman SSS architecture and identify key issues

  18. Meeting Schedule Fall Quarter AY2001-2002 Goal: shared knowledge base and vision across the project • 10/9 – TFLO (Juan) • 10/16 – RTSL (Bill M.) • 10/23 – Molecular Dynamics (Eric) • 10/30 – Imagine and its programming system (Ujval) • 11/6 – C*, ZPL, etc… + SPL brainstorming (Ian) • 11/13 – Metacompilation (Ben C.) • 11/20 – Application followup (Ron/Heinz) • 11/27 – Strawman architecture (Ben S.) • 12/4 – Streams vs. CMP (Blue Gene/Light, etc…) (Bill D.)

More Related