1 / 11

Scalable Algorithms for Massive-scale Graphs

Fabrizio Petrini, IBM TJ Watson. Scalable Algorithms for Massive-scale Graphs. KAUST, SC11 November 2011 . Linkedin Network. Data in real-world: Large Complex Networks. Internet :

ella
Download Presentation

Scalable Algorithms for Massive-scale Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fabrizio Petrini, IBM TJ Watson Scalable Algorithms for Massive-scale Graphs KAUST, SC11 November 2011

  2. Linkedin Network

  3. Data in real-world: Large Complex Networks Internet : Nodes are the routers, edges are the connections. Finding network topology, solving bandwidth, flow and shortest paths problems.(1.8 billion users of the internet and growing) Network Security: Large graphs that can be use to dynamically track network behavior and identify potential attacks Social Networks Internet Finance Biological Transportation 1Data source: DIMACS, Wikipedia

  4. Data in real-world: Large Complex Networks Social networks: Nodes represent people, joined through personal or professional acquaintance, or through groups and communities. Social Networks Facebook : 100 to 150 million users in 4 months, >400 million currently Twitter : 75 million users (Jan 2010) 6.2 million new users per month Internet Finance Biological Transportation 1Data source: Nielson Report, The Inquirer

  5. Blue Gene/Q 4. Node Card: 32 Compute Cards, Optical Modules, Link Chips; 5D Torus 3. Compute card: One chip module, 8/16 GB DDR3 Memory, Heat Spreader for H2O Cooling 2. Single Chip Module 1. Chip: 16+2 P cores 5b. IO drawer: 8 IO cards w/16 GB 8 PCIe Gen2 x8 slots 3D I/O torus 7. System: 96 racks, 20PF/s 5a. Midplane: 16 Node Cards • Sustained single node perf: 10x P, 20x L • MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) • Software and hardware support for programming models for exploitation of node hardware concurrency 6. Rack: 2 Midplanes

  6. Message Latency Message Rate Single threaded 134 1184 pclk 1150 pclk 779 pclk 245 530 pclk Our Graph500 implementation relies on SPI Total MPI Latency 2864pclk Total MPI Overhead 1172pclk

  7. L1 PF PPC PPC PPC PPC 2MB L2 2MB L2 2MB L2 2MB L2 FPU FPU FPU FPU L1 PF PPC PPC PPC PPC 2MB L2 2MB L2 2MB L2 FPU FPU FPU FPU L1 PF PPC PPC PPC PPC 2MB L2 2MB L2 2MB L2 FPU FPU FPU FPU L1 PF PPC PPC PPC PPC 2MB L2 2MB L2 2MB L2 FPU FPU FPU FPU L1 PF L1 PF L1 PF L1 PF L1 PF L1 PF L1 PF L1 PF L1 PF L1 PF 2MB L2 L1 PF 2MB L2 L1 PF 2MB L2 L1 PF PPC FPU External DDR3 Blue Gene/Q chip architecture DDR-3 Controller • 16+1 core SMP Each core 4-way hardware threaded • Transactional memory and thread level speculation • Quad floating point unit on each core 204.8 GF peak node • Frequency: 1.6 GHz • 563 GB/s bisection bandwidth to shared L2 (Blue Gene/L at LLNL has 700 GB/s for system) • 32 MB shared L2 cache • 42.6 GB/s DDR3 bandwidth (1.333 GHz DDR3) (2 channels each with chip kill protection) • 10 intra-rack interprocessor links each at 2.0GB/s • one I/O link at 2.0 GB/s • 16 GB memory/node • 55 watts chip power DDR-3 Controller External DDR3 full crossbar switch PPC L1 PF 2 GB/s I/O link (to I/O subsystem) Network FPU dma Test 10*2GB/s intra-rack & inter-rack (5-D torus) Blue Gene/Q compute chip PCI_Express note: chip I/O shares function with PCI_Express IBM Confidential

  8. PPC PPC PPC PPC PPC PPC PPC PPC PPC PPC PPC L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 PF PF PF PF PF PF PF PF PF PF PF FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU switch 2MB L2 2MB L2 2MB L2 2MB L2 Scalable Atomic Operation(fetch_and_inc for example – queuing lock) • 1 round trip + 4 L2 cycles • Where N is the number of threads • For N=64 and L2 75 cycles 331 cycles • Compared to 9600 cycles for standard 1 1.2 2MB L2 2MB L2 PPC L1 PF FPU PPC L1 PF PPC L1 PF FPU FPU PPC L1 PF FPU 1.3 1.1 2MB L2 2MB L2 PPC L1 PF PPC L1 PF FPU FPU PPC L1 PF PPC L1 PF FPU FPU PPC L1 PF FPU IBM Confidential

More Related