Learning From the Stanford/DOE Visualization Cluster - PowerPoint PPT Presentation

johana
learning from the stanford doe visualization cluster n.
Skip this Video
Loading SlideShow in 5 Seconds..
Learning From the Stanford/DOE Visualization Cluster PowerPoint Presentation
Download Presentation
Learning From the Stanford/DOE Visualization Cluster

play fullscreen
1 / 27
Download Presentation
Learning From the Stanford/DOE Visualization Cluster
329 Views
Download Presentation

Learning From the Stanford/DOE Visualization Cluster

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Learning From the Stanford/DOE Visualization Cluster Mike Houston, Greg Humphreys, Randall Frank, Pat Hanrahan

  2. Outline • Stanford’s current cluster • Design decisions • Performance evaluation • Bottleneck evaluation • Cluster “Landscape” • General classification • Bottleneck evaluation • Stanford’s next cluster • Design goals • Research directions

  3. Stanford/DOE Visualization Cluster The Chromium Cluster

  4. Cluster Configuration (Jan. 2000) • Cluster: 32 graphics nodes + 4 server nodes • Computer: Compaq SP750 • 2 processors (800 MHz PIII Xeon, 133MHz FSB) • i840 core logic (big issue for vis-clusters) • Simultaneous fast graphics and networking • Network: 64-bit, 66 MHz PCI • Graphics: AGP-4x • 256 MB memory • 18GB SCSI 160 disk (+ 3*36GB on servers) • Graphics (Sept. 2002) • 16 NVIDIA GeForce3 w/ DVI (64 MB) • 16 NVIDIA GeForce4 TI4200 w/ DVI (128 MB) • Network • Myrinet 64-bit, 66 MHz (LANai 7)

  5. Graphics Evaluation • NVIDIA GeForce3 • 25 MTri/s triangle rate observed • 680 MPix/s fill rate observed • NVIDIA GeForce4 • 60 MTri/s triangle rate observed • 800 MPix/s fill rate observed • Read Pixels performance • 35 MPix/s (140 MB/s) RGBA • 22 MPix/s (87 MB/s) Depth • Draw Pixels performance • 45 MPix/s (180 MB/s) RGBA • 21 MPix/s (85 MB/s) Depth

  6. Network Evaluation • Myrinet LANai 7 PCI64A boards • Theoretical Limit: 160 MB/s • 142 MB/s observed peak under Linux • ~100 MB/s observed sustained under Linux • ServerNet not chosen • Driver support • Large switching infrastructure required • Gigabit Ethernet • Performance and scalability concerns

  7. Myrinet Issues • Fairness: Clients starved of network resources • Implemented credit scheme to minimize congestion • Lack of buffering in switching fabric • Causes poor performance in high load conditions • Open issue Partitioned Cluster Unpartitioned Cluster

  8. i840 Chipset Evaluation • 66MHz 64bit PCI performance not full speed: • 210 MB/s PCI read (40% of theoretical peak) • 288 MB/s PCI write (54% of theoretical peak) • Combined read/write ~121 MB/s • AGP • Fast Writes / Side Band Addressing unstable under Linux

  9. Sort-First Performance • Configuration • Application runs application on client • Primitives distributed to servers • Tiled Display • 4x3 @ 1024x768 • Total resolution: 4096x2304, 9 Megapixel • Quake 3 • 50 fps • Atlantis • 450 fps

  10. Sort-Last Performance • Configuration • Parallel rendering on multiple nodes • Composite to final display node • Volume Rendering on 16 nodes • 1.57 GVox/s [Humphreys 02] • 1.82 GVox/s (tuned) 9/02 • 256x256x1024 volume1 rendered twice 1Data Courtesy of G. A Johnson, G.P.Cofer, S.L Gewalt, and L.W. Hedlund from the Duke Center for In Vivo Microscopy (an NIH/NCRR National Resource)

  11. Cluster Accomplishments • Development Platform • WireGL • Chromium • Cluster configuration replicated • Interactive Performance • 256x512x1024 volume @ 15fps • 9 Megapixel Quake3 @ 50fps

  12. Sources of Bottlenecks • Sort-First • Packing speed (processor) • Primitive distribution (network and bus) • Rendering (processor and graphics chip) • Sort-Last • Rendering (graphics chip) • Composite (network, bus, and read/draw pixels)

  13. Bottleneck Evaluation – Stanford • Sort-First: Processor and Network • Sort-Last: Network and Read/Draw

  14. The Landscape of Graphics Clusters • Many Options • Low End <$2500/node • Mid End ~$5000/node • High End >$7500/node • Tradeoffs • Different bottlenecks • Price/Performance • Scalability • Usage • Evaluation • Based off of published benchmarks and specs

  15. Cluster Interconnect Options • Many choices • GigE • ~100 MB/s • Myrinet 2000 (http://www.myrinet.com) • 245MB/s • SCI/Dolphin (http://www.dolphinics.com) • 326 MB/s • Quadrics (http://www.quadrics.com) • 340 MB/s • Future options • 10 GigE • Infiniband • HyperTransport

  16. Low End • General Definition • Single CPU • Consumer Mainboard • Integrated Graphics • High Speed commodity network • Example Node Configuration • Nvidia NForce2 • AMD Athlon 2400+ • 512 MB DDR • GigE and 10/100 • 1U rack chassis • Estimated Price: $1500

  17. Bottleneck Evaluation – Low End • Bus/Network limited

  18. Mid End • General Definition • Dual Processor • “Workstation” mainboard • High performance bus • 64-bit PCI or PCI-X • High Speed Commodity / Low end cluster interconnect • High-End consumer graphics board • Example Node Configuration • Intel i860 • Dual Intel P4 Xeon 2.4GHz • 2GB RDRAM • ATI Radeon 9700 • GigE onboard + Myrinet 2000 • 2U rack chassis • Estimated Price: $4000

  19. Bottleneck Evaluation – Mid End • Sort-First: Network limited • Sort-Last: Read/Draw and Network limited

  20. High End • General Definition • Dual or Quad processor • Cutting edge bus • PCI-X, HyperTransport, PCI Enhanced • High Speed Commodity/ High end cluster interconnect • “Professional” graphics board • RAID system • Example Node Configuration • ServerWorks GC-WS • Dual P4 Xeon 2.6GHz • Nvidia Quadro4 900XGL • 4GB DDR • GigE onboard + Infiniband • Estimated Price: $7500

  21. Bottleneck Evaluation – High End • Sort-First: Well balanced • Sort-Last: Read/Draw limited

  22. Balanced System is Key • Only as fast as slowest component • Spend money where it matters!

  23. Goals for Next Cluster • Performance • Sort-Last • 5 GVox/s • 1 GTri/s • Sort-First at 4096x2304 • Quake3 @ >100fps • Research • Remote visualization • Time-varying datasets • Compositing

  24. What we plan to build • 16 Node cluster, 1U nodes • Mainboard chipsets • Intel Placer • ServerWorks GC-WS • AMD Hammer • Memory • 2-4GB • Graphics Chip • Nvidia NV30 • ATI R300/350 • Interconnect • Infiniband, Quadrics • Disk • IDE RAID or SCSI

  25. Continuing Chipset Issues • Why do chipsets perform so poorly? • “Workstation” • Intel i860 • 215 MB/s read (40% of theoretical) • 300 MB/s write (56% of theoretical) • AMD 760MPX • 300 MB/s read (56% of theoretical) • 312 MB/s write (59% of theoretical) • “Server” • ServerWorks ServerSet III LE • 423 MB/s read (79% of theoretical) • 486 MB/s write (91% of theoretical) • Why can’t a “server” have an AGP slot? Performance numbers from http://www.conservativecomputer.com

  26. Ongoing Bottlenecks • Readback performance • Will be fixed “soon” • Hardware compositing? • Chipset Performance • Achieve fraction of theoretical • Need faster busses in commodity chipsets • Network Performance • Scalability • Fast is VERY expensive

  27. Conclusions • What we still need • More vendors • More chipsets • More performance • Graphics Clusters are getting better • Chipsets • Interconnects • Form factor • Processing • Graphics Chips • Things are really starting to get interesting!