learning from the stanford doe visualization cluster
Download
Skip this Video
Download Presentation
Learning From the Stanford/DOE Visualization Cluster

Loading in 2 Seconds...

play fullscreen
1 / 27

Learning From the Stanford/DOE Visualization Cluster - PowerPoint PPT Presentation


  • 301 Views
  • Uploaded on

Learning From the Stanford/DOE Visualization Cluster. Mike Houston, Greg Humphreys, Randall Frank, Pat Hanrahan. Outline. Stanford’s current cluster Design decisions Performance evaluation Bottleneck evaluation Cluster “Landscape” General classification Bottleneck evaluation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Learning From the Stanford/DOE Visualization Cluster' - johana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
learning from the stanford doe visualization cluster

Learning From the Stanford/DOE Visualization Cluster

Mike Houston, Greg Humphreys, Randall Frank, Pat Hanrahan

outline
Outline
  • Stanford’s current cluster
    • Design decisions
    • Performance evaluation
    • Bottleneck evaluation
  • Cluster “Landscape”
    • General classification
    • Bottleneck evaluation
  • Stanford’s next cluster
    • Design goals
    • Research directions
cluster configuration jan 2000
Cluster Configuration (Jan. 2000)
  • Cluster: 32 graphics nodes + 4 server nodes
  • Computer: Compaq SP750
    • 2 processors (800 MHz PIII Xeon, 133MHz FSB)
    • i840 core logic (big issue for vis-clusters)
      • Simultaneous fast graphics and networking
      • Network: 64-bit, 66 MHz PCI
      • Graphics: AGP-4x
    • 256 MB memory
    • 18GB SCSI 160 disk (+ 3*36GB on servers)
  • Graphics (Sept. 2002)
    • 16 NVIDIA GeForce3 w/ DVI (64 MB)
    • 16 NVIDIA GeForce4 TI4200 w/ DVI (128 MB)
  • Network
    • Myrinet 64-bit, 66 MHz (LANai 7)
graphics evaluation
Graphics Evaluation
  • NVIDIA GeForce3
    • 25 MTri/s triangle rate observed
    • 680 MPix/s fill rate observed
  • NVIDIA GeForce4
    • 60 MTri/s triangle rate observed
    • 800 MPix/s fill rate observed
  • Read Pixels performance
    • 35 MPix/s (140 MB/s) RGBA
    • 22 MPix/s (87 MB/s) Depth
  • Draw Pixels performance
    • 45 MPix/s (180 MB/s) RGBA
    • 21 MPix/s (85 MB/s) Depth
network evaluation
Network Evaluation
  • Myrinet LANai 7 PCI64A boards
    • Theoretical Limit: 160 MB/s
    • 142 MB/s observed peak under Linux
    • ~100 MB/s observed sustained under Linux
  • ServerNet not chosen
    • Driver support
    • Large switching infrastructure required
  • Gigabit Ethernet
    • Performance and scalability concerns
myrinet issues
Myrinet Issues
  • Fairness: Clients starved of network resources
    • Implemented credit scheme to minimize congestion
  • Lack of buffering in switching fabric
    • Causes poor performance in high load conditions
    • Open issue

Partitioned Cluster

Unpartitioned Cluster

i840 chipset evaluation
i840 Chipset Evaluation
  • 66MHz 64bit PCI performance not full speed:
    • 210 MB/s PCI read (40% of theoretical peak)
    • 288 MB/s PCI write (54% of theoretical peak)
    • Combined read/write ~121 MB/s
  • AGP
    • Fast Writes / Side Band Addressing unstable under Linux
sort first performance
Sort-First Performance
  • Configuration
    • Application runs application on client
    • Primitives distributed to servers
  • Tiled Display
    • 4x3 @ 1024x768
    • Total resolution: 4096x2304,

9 Megapixel

  • Quake 3
    • 50 fps
  • Atlantis
    • 450 fps
sort last performance
Sort-Last Performance
  • Configuration
    • Parallel rendering on multiple nodes
    • Composite to final display node
  • Volume Rendering on 16 nodes
    • 1.57 GVox/s [Humphreys 02]
    • 1.82 GVox/s (tuned) 9/02
    • 256x256x1024 volume1

rendered twice

1Data Courtesy of G. A Johnson, G.P.Cofer, S.L Gewalt, and L.W. Hedlund from the Duke Center for In Vivo Microscopy (an NIH/NCRR National Resource)

cluster accomplishments
Cluster Accomplishments
  • Development Platform
    • WireGL
    • Chromium
  • Cluster configuration replicated
  • Interactive Performance
    • 256x512x1024 volume @ 15fps
    • 9 Megapixel Quake3 @ 50fps
sources of bottlenecks
Sources of Bottlenecks
  • Sort-First
    • Packing speed (processor)
    • Primitive distribution (network and bus)
    • Rendering (processor and graphics chip)
  • Sort-Last
    • Rendering (graphics chip)
    • Composite (network, bus, and read/draw pixels)
bottleneck evaluation stanford
Bottleneck Evaluation – Stanford
  • Sort-First: Processor and Network
  • Sort-Last: Network and Read/Draw
the landscape of graphics clusters
The Landscape of Graphics Clusters
  • Many Options
    • Low End <$2500/node
    • Mid End ~$5000/node
    • High End >$7500/node
  • Tradeoffs
    • Different bottlenecks
    • Price/Performance
    • Scalability
    • Usage
  • Evaluation
    • Based off of published benchmarks and specs
cluster interconnect options
Cluster Interconnect Options
  • Many choices
    • GigE
      • ~100 MB/s
    • Myrinet 2000 (http://www.myrinet.com)
      • 245MB/s
    • SCI/Dolphin (http://www.dolphinics.com)
      • 326 MB/s
    • Quadrics (http://www.quadrics.com)
      • 340 MB/s
  • Future options
    • 10 GigE
    • Infiniband
    • HyperTransport
low end
Low End
  • General Definition
    • Single CPU
    • Consumer Mainboard
    • Integrated Graphics
    • High Speed commodity network
  • Example Node Configuration
    • Nvidia NForce2
    • AMD Athlon 2400+
    • 512 MB DDR
    • GigE and 10/100
    • 1U rack chassis
    • Estimated Price: $1500
mid end
Mid End
  • General Definition
    • Dual Processor
    • “Workstation” mainboard
    • High performance bus
      • 64-bit PCI or PCI-X
    • High Speed Commodity / Low end cluster interconnect
    • High-End consumer graphics board
  • Example Node Configuration
    • Intel i860
    • Dual Intel P4 Xeon 2.4GHz
    • 2GB RDRAM
    • ATI Radeon 9700
    • GigE onboard + Myrinet 2000
    • 2U rack chassis
    • Estimated Price: $4000
bottleneck evaluation mid end
Bottleneck Evaluation – Mid End
  • Sort-First: Network limited
  • Sort-Last: Read/Draw and Network limited
high end
High End
  • General Definition
    • Dual or Quad processor
    • Cutting edge bus
      • PCI-X, HyperTransport, PCI Enhanced
    • High Speed Commodity/ High end cluster interconnect
    • “Professional” graphics board
    • RAID system
  • Example Node Configuration
    • ServerWorks GC-WS
    • Dual P4 Xeon 2.6GHz
    • Nvidia Quadro4 900XGL
    • 4GB DDR
    • GigE onboard + Infiniband
    • Estimated Price: $7500
bottleneck evaluation high end
Bottleneck Evaluation – High End
  • Sort-First: Well balanced
  • Sort-Last: Read/Draw limited
balanced system is key
Balanced System is Key
  • Only as fast as slowest component
    • Spend money where it matters!
goals for next cluster
Goals for Next Cluster
  • Performance
    • Sort-Last
      • 5 GVox/s
      • 1 GTri/s
    • Sort-First at 4096x2304
      • Quake3 @ >100fps
  • Research
    • Remote visualization
    • Time-varying datasets
    • Compositing
what we plan to build
What we plan to build
  • 16 Node cluster, 1U nodes
  • Mainboard chipsets
    • Intel Placer
    • ServerWorks GC-WS
    • AMD Hammer
  • Memory
    • 2-4GB
  • Graphics Chip
    • Nvidia NV30
    • ATI R300/350
  • Interconnect
    • Infiniband, Quadrics
  • Disk
    • IDE RAID or SCSI
continuing chipset issues
Continuing Chipset Issues
  • Why do chipsets perform so poorly?
    • “Workstation”
      • Intel i860
        • 215 MB/s read (40% of theoretical)
        • 300 MB/s write (56% of theoretical)
      • AMD 760MPX
        • 300 MB/s read (56% of theoretical)
        • 312 MB/s write (59% of theoretical)
    • “Server”
      • ServerWorks ServerSet III LE
        • 423 MB/s read (79% of theoretical)
        • 486 MB/s write (91% of theoretical)
  • Why can’t a “server” have an AGP slot?

Performance numbers from http://www.conservativecomputer.com

ongoing bottlenecks
Ongoing Bottlenecks
  • Readback performance
    • Will be fixed “soon”
    • Hardware compositing?
  • Chipset Performance
    • Achieve fraction of theoretical
    • Need faster busses in commodity chipsets
  • Network Performance
    • Scalability
    • Fast is VERY expensive
conclusions
Conclusions
  • What we still need
    • More vendors
    • More chipsets
    • More performance
  • Graphics Clusters are getting better
    • Chipsets
    • Interconnects
    • Form factor
    • Processing
    • Graphics Chips
  • Things are really starting to get interesting!
ad