1 / 17

Scalable Reconfigurable Interconnects

Scalable Reconfigurable Interconnects. Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf. CSCAPES Workshop, Santa Fe, June 11, 2008. Ultra-scale systems rely on increased concurrency. Huge increases in concurrency since 2004.

casey
Download Presentation

Scalable Reconfigurable Interconnects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES Workshop, Santa Fe, June 11, 2008

  2. Ultra-scale systems rely on increased concurrency. Huge increases in concurrency since 2004. How to connect huge numbers of processors?

  3. Torus Fat tree What is a good interconnect for ultra-scale systems? • Mesh/torus networks provide limited performance. • Fat-trees are widely used due to their flexibility. • 94 of 100 of Top500 in 2004 • 72 of 100 of Top500 in 2007 • Cost of a fat-tree scales as O(PlgP). • Cost of the interconnect dominates the cost of compute power for large numbers of processors.

  4. Step-by-step approach • Characterize the communication requirements of applications. • Replaces theoretical metrics with practical ones. • Minimize the interconnection requirements • Choice of subdomains • Task-to-processor mapping • Scheduling of messages • Design alternative interconnects • Static networks: Fit-trees • Reconfigurable networks

  5. Static Applications

  6. Static Applications

  7. Most messages are small Employ a separate network for low bandwidth messages

  8. Most fat-tree ports are not utilized >50% of the ports of a fat-tree are not used

  9. Clever task-to-procesor allocation yields better results. Hops reduced by an average of 25%; improved latency!

  10. Do we need the fat-tree bandwidth? • We need the flexibility of a fat tree, but not the full bandwidth. • Bandwidth requirement can de decreased with careful placement of tasks. • Proposed alternative: Fit trees • Idea: Analyze the communication requirements of apps and design the interconnect for what is really needed.

  11. Even all-to-all communication does not need a fat-tree. Randomized Optimal Standard • All-to-all communication is the bottleneck for FFT. • Clever scheduling of messages reduces bandwidth requirement. • Conventional algorithms for all-to-all communication do not distribute communication evenly. • The savings are even more pronounced in FFT with 2D decomposition. level Communication Step

  12. Fittrees: network should fit the application • Key observation: scalability of an application is related locality of computation. • Implication: required bandwidth decreases as we go higher in the tree. • Fitness ratio (f) : ratio of the bandwidth between two successive layers • 2D domains: f ~=1.4 • 3D domains: f ~=1.2 N Fattree N N Fittree fN

  13. Fit-trees provide scalability

  14. HFAST • Hybrid Flexibly-Assignable Switch Topology • Use Layer-1 (circuit) switches to configure Layer-2 (packet) switches at run-time (O(10-100ms) cost of reconfiguration) • Hardware to do so exists (optical networks) • Layer-1 switches cheaper per port (no dynamic decisions, like telephone switchboard) Collective communication uses a separate low-latency, low bandwidth tree network (like IBM BlueGene)

  15. How to use HFAST • Improved task to processor assignments • Even at runtime • Migrate processes with little overhead • Adapt to changing communication requirements • Avoid defragmentation at the system level • Build an interconnect for each application • Avoid overprovisioning the communication resources

  16. Processor allocation for adaptive applications We obtain 41% of ideal and 53% of ideal hops savings.

  17. Conclusions • Massive concurrencies of ultrascale machines will require new interconnects. • We cannot afford to overprovision the resources. • There is no magic solution that is good for all applications. • Flexibility or reconfigurability is necessary. • The technology for reconfigurable networks is available. • We need to • reduce the resource requirements • design networks for typical workloads • design methods to build networks for a given application.

More Related