scalable cluster interconnect overview technology roadmap l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Scalable Cluster Interconnect Overview & Technology Roadmap PowerPoint Presentation
Download Presentation
Scalable Cluster Interconnect Overview & Technology Roadmap

Loading in 2 Seconds...

play fullscreen
1 / 20

Scalable Cluster Interconnect Overview & Technology Roadmap - PowerPoint PPT Presentation


  • 126 Views
  • Uploaded on

Scalable Cluster Interconnect Overview & Technology Roadmap. Patrick Geoffray Opinionated Software Developer (patrick@myri.com). June 2002 Top500 List.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Scalable Cluster Interconnect Overview & Technology Roadmap' - reba


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
scalable cluster interconnect overview technology roadmap

Scalable Cluster InterconnectOverview & Technology Roadmap

Patrick GeoffrayOpinionated Software Developer(patrick@myri.com)

june 2002 top500 list
June 2002 Top500 List
  • Clusters are the fastest growing category of supercomputers in the TOP500 List. (But why do the TOP500 authors classify many other systems, such as the IBM SP, as not in the cluster category?)
    • 80 clusters (16%) in the June 2002 list
    • 43 clusters (8.6%) in the November 2001 list
    • 33 clusters (6.6%) in the June 2001 list
  • The June 2002 list includes 21 Myrinet clusters, led by the University of Heidelberg HELICS cluster, ranking #35.
    • http://helics.iwr.uni-heidelberg.de/gallery/index.html
  • Additionally, 145 HP Superdome/Hyperplex systems, MPPs, use Myrinet for communication between the Superdome SMP hosts.
  • One-third of the supercomputers in the June 2002 TOP500 list use Myrinet technology!
the new 512 host dual xeon cluster at lsu
The new 512-host dual-Xeon cluster at LSU

Integrator: Atipa HPL benchmark: 2.1 Tflops

myrinet 2000 links 2 2 gbit s full duplex
Myrinet-2000 Links, 2+2 Gbit/s, full duplex

Note: The signaling rate on these links is 2.5 GBaud, which, after 8b/10b encoding, yields a data rate of 2 Gbit/s.

Advantages of fiber: small-diameter, lightweight, flexible cables; reliability;EMC; 200m length; connector size. (See http://www.myri.com/news/01723/)

links changes planned
Links: Changes Planned
  • June 2002 - first chips (Lanai XM) with multi-protocol (X) ports
    • A multi-protocol (X) port can act as a Myrinet port, long-range-Myrinet port, GbE port, or InfiniBand port.
    • Interoperability between Myrinet, GbE, & InfiniBand.
  • February 2003 - PCI-X interfaces with two ports
    • A good alternative to introducing faster links.
    • 2 x (250+250) MB/s = 1GB/s, a good match to 1 GB/s PCI-X.
    • Two links acting as one is provided by dispersive routing in GM 2.
  • Early 2003 - SerDes integrated into Myricom custom-VLSI chips
    • To be used initially in a SAN-Serial chip, then in a 32-port Xbar-switch chip.
    • 2+2 Gb/s data rate, 2.5+2.5 GBaud (8b/10b encoded) links are also used as the base PHY by PCI Express. Myricom plans to support PCI Express -- initially 4x -- as soon as PCI Express hosts become available.
  • 2004 - “4x” Myrinet (multi-protocol) links
    • Most product volume is expected to continue with “1x” links through 2006.
in progress multi protocol switch line card
In progress: Multi-protocol switch line card

Firmware-development prototype (M3-SW16-2E2X4F) of a switch line card with2 GbE ports, 2 “X” ports over 850nm fiber, and 4 Myrinet ports.

The Physical-level and protocol conversion between the Myrinet switch on the line card and the GbE or “X” ports on the front panel is performed by a Lanai XM chip on each of these ports. The protocol-conversion firmware – e.g., between native GbE and ethernet over Myrinet – for each Lanai XM is loaded during the line-card initialization process. The Lanai XM is the first in the family of Lanai X (10) chips.

lanai x 10 as a protocol converter
LANai X (10) as a protocol converter

To line-cardfront-panelport

SerDesor

GbE PHY

To line cardXBar16 port

Modes - Myrinet - Program control long-range fiber - InfiniBand - GbE

SANnetworkinterface

Xnetworkinterface

Send/recvDMAengines

Send/recvDMAengines

To line card µC(JTAG)

Control &memoryinitialize

L-busmemoryinterface

x72bSRAM

RISC

225MHz RISC& Memory

LANai XM

This circuitry is repeated for each line-card port.

switches changes planned
Switches: Changes Planned
  • Switches with a mix of Myrinet, long-range-Myrinet, GbE, and (possibly) InfiniBand ports (starting 4Q02).
    • High-degree switches with GbE ports may find a market for “Beowulf” clusters that use next-generation hosts with GbE on the motherboard.
  • The use of dispersive routing (GM 2) allows better utilization of Myrinet Clos networks (also HA at a finer time scale).
  • More capable monitoring line card that can run Linux (1Q03).
  • Possible late-2003 introduction of switches based on the XBar32.
    • “Clos256+256” switch in a 12U rack-mount enclosure (?).
  • Very few “Myrinet” changes until the advent of “4x Myrinet” links (2004).
  • Switch pricing (~$400/port) is expected to remain unchanged.
64 bit 66mhz myrinet pci interfaces
64-bit, 66MHz, Myrinet/PCI interfaces
  • PCI64B, 133MHz RISC and memory
    • 1067 MB/s memory bandwidth
  • PCI64C, 200MHz RISC and memory
    • 1600 MB/s memory bandwidth

533 MB/s

1067 or 1600 MB/s

500 MB/s

PCIDMA chip

LANai 9 chip

interfaces changes planned
Interfaces: changes planned
  • PCI-X interfaces with several new features
    • Initially (4Q02): 225MHz x 8B memory & 225MHz RISC; one port (Lanai XP)
      • ~5.5µs GM latency (vs 7µs today). MPI, VI, etc, latency will decrease correspondingly.
      • Higher level of integration, and the starting point for low-cost Myrinet/PCI-X interfaces.
        • Pricing of low-end Myrinet/PCI-X interfaces is expected to decrease by 50% in the first 18 months.
    • Then (1Q03): 300MHz x 8B memory & 300MHz RISC; one or two ports (Lanai 2XP)
      • ~4.8µs GM latency.
      • The starting point for high-end Myrinet/PCI-X interfaces.
    • Self initializing
      • Simplified initialization; self-test capability.
      • Allows diskless booting through the Myrinet.
      • Blade and other high-density applications.
    • PCI-X series of interfaces will require GM 2.
  • PCI-Express interfaces in 2H03
  • Lanai 11 with on-chip memory & 4x Myrinet port in 2004
    • Open-ended performance growth to 600+MHz RISC and memory.
lanai xp pci x interface
LANai XP PCI-X Interface

SerDes

PCI-cardport

Xnetworkinterface

Send/recvDMAengines

InterfaceEEPROM & JTAG

Control &memoryinitialize

L-busmemoryinterface

x72bSRAM

PCI-X &

DMAEngine

PCI-X bus

(225MHz)

RISC

LANai XP

lanai xp based interface m3f pcixd 2
Lanai-XP-based Interface (M3F-PCIXD-2)

EEPROM

SRAM

Xtals

2.5V Reg

SerDes

Lanai XP

Fiber Transceiver

high end lanai 2xp pci x interface
High-End LANai 2XP PCI-X Interface

SerDes

PCI-cardport

SerDes

PCI-cardport

Xnetworkinterface

Xnetworkinterface

Send/recvDMAengines

Send/recvDMAengines

InterfaceEEPROM & JTAG

Control &memoryinitialize

L-busmemoryinterface

x72bSRAM

(300MHz)

PCI-X &

DMAEngine

RISC

PCI-X bus

LANai 2XP

current choice of myrinet software interfaces
Current Choice of Myrinet Software Interfaces
  • The GM API
    • Low level, but some applications are programmed at this level
  • TCP/IP
    • Actually, “ethernet emulation,” included in all GM releases
      • 1.8 Gb/s TCP/IP (netperf benchmarks)
  • MPICH-GM
    • An implementation of the Argonne MPICH directly over GM
  • VI-GM
    • An implementation of the VI Architecture API directly over GM
      • Possibly relevant to InfiniBand compatibility
  • Sockets-GM
    • An implementation of UNIX or Windows sockets (or DCOM) over GM. Completely transparent to application programs. Use the same binaries!
      • Sockets-GM/GM/Myrinet is similar to the proposed SDP/InfiniBand.
myrinet software changes planned
Myrinet Software: Changes Planned
  • GM 2, a restructuring of the GM control program
    • New features, such as the get function. Alpha release available now.
    • Improvements in mapping and in traffic dispersion to take better advantage of the Clos networks, and to use multiple-port NICs.
    • Interrupt coalescing in Ethernet emulation.
  • Increasing emphasis on Myrinet interoperability with GbE and (perhaps) InfiniBand.
  • Applications support.
    • Myricom is now supporting selected application developers.
gm s ethernet emulation

TCP/IP

1853 Mb/s

50%

74%

UDP/IP

1962 Mb/s

41%

45%

GM’s Ethernet Emulation
  • In addition to its OS-bypass features, GM also presents itself to the host operating system as an Ethernet interface. This ”Ethernet emulation" feature of GM allows Myrinet to carry any packet traffic and protocols that can be carried on Ethernet, including TCP/IP and UDP/IP.
  • TCP/IP and UDP/IP performance over GM depends primarily on the host-CPU performance and the host-OS's IP protocol stack.
  • Following is the netperf TCP/IP and UDP/IP performance of GM-1.5.2.1 between a pair of dual 2.0Ghz Intel Pentium-4 Xeon hosts that use the Serverworks Grand Champion chipset. The test machines were running Redhat 7.3 and the Redhat 2.4.18-4smp Linux kernel. The measured, short-message, one-way, TCP or UDP latency was ~31µs.

CPU Utilization

Bandwidth

Receiver

Sender