Graphs data mining and high performance computing
Download
1 / 40

Graphs, Data Mining, and High Performance Computing - PowerPoint PPT Presentation


  • 183 Views
  • Uploaded on

Graphs, Data Mining, and High Performance Computing. Bruce Hendrickson Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept. Outline. High performance computing Why current approaches can’t work for data mining

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Graphs, Data Mining, and High Performance Computing' - yan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Graphs data mining and high performance computing

Graphs, Data Mining,and High Performance Computing

Bruce Hendrickson

Sandia National Laboratories, Albuquerque, NM

University of New Mexico, Computer Science Dept.


Outline
Outline

  • High performance computing

  • Why current approaches can’t work for data mining

  • Test case: graphs for knowledge representation

  • High performance graph algorithms, an oxymoron?

  • Implications for broader data mining community

  • Future trends


Data mining and high performance computing
Data Mining and High Performance Computing

  • “We can only consider simple algorithms”

    • Data too big for anything but O(n) algorithms

    • Often have some kind of real-time constraints

  • This greatly limits the kinds of questions we can address

    • Terascale data gives different insights than gigascale data

    • Current search capabilities are wonderful, but innately limited

  • Can high-performance computing make an impact?

    • What if our algorithms ran 100x faster and could use 100x more memory? 1000x?

  • Assertion: Quantitative improvements in capabilities result in qualitative changes in the science that can be done.


Modern computers
Modern Computers

  • Fast processors, slow memory

  • Use memory hierarchy to keep processor fed

    • Stage some data in smaller, faster memory (cache)

    • Can dramatically enhance performance

      • But only if accesses have spatial or temporal locality

        • Use accessed data repeatedly, or use near-by data next

  • Parallel computers are collections of these

    • Pivotal to have a processor own most data it needs

  • Memory patterns determine performance

    • Processor speed hardly matters


High performance computing
High Performance Computing

  • Largely the purview of science and engineering communities

    • Machines, programming models, & algorithms to serve their needs

    • Can these be utilized by learning and data mining communities?

    • Search companies make great use of parallelism for simple things

      • But not general purpose

  • Goals

    • Large (cumulative) core for holding big data sets

    • Fast and scalable and performance of complex algorithms

    • Ease of programmability


Algorithms we ve seen this week
Algorithms We’ve Seen This Week

  • Hashing (of many sorts)

  • Feature detection

  • Sampling

  • Inverse index construction

  • Sparse matrix and tensor products

  • Training

  • Clustering

  • All of these involve

    • complex memory access patterns

    • only small amounts of computation

  • Performance dominated by latency – waiting for data


Architectural challenges
Architectural Challenges

  • Runtime is dominated by latency

    • Lots of indirect addressing, pointer chasing, etc.

    • Perhaps many at once

  • Very little computation to hide memory costs

  • Access pattern can be data dependent

    • Prefetching unlikely to help

    • Usually only want small part of cache line

  • Potentially abysmal locality at all levels of memory hierarchy

    • Bad serial and abysmal parallel performance


Graphs for knowledge representation
Graphs for Knowledge Representation

  • Graphs can capture rich semantic structure in data

  • More complex than “bag of features”

  • Examples:

    • Protein interaction networks

    • Web pages with hyperlinks

    • Semantic web

    • Social networks, etc.

  • Algorithms of interest include

    • Connectivity (of various sorts)

    • Clustering and community detection

    • Common motif discovery

    • Pattern matching, etc.



Finding threats subgraph isomorphism
Finding Threats = Subgraph Isomorphism

Image Source:

T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47(3, March 2004): pp 45-47


Mohammed Jabarah (Canadian citizen handed over to US authorities on suspicion of links to 9/11).  

Omar Khadr (at Guantanamo)

Thanks to Kevin McCurley


Graph based informatics data
Graph-Based Informatics: Data authorities on suspicion of links to 9/11).  

  • Graphs can be enormous

    • High performance computing may be needed for memory and performance

  • Graphs are highly unstructured

    • High variance in number of neighbors

    • Little or no locality – Not partitionable

    • Experience with scientific computing graphs of limited utility

  • Terrible locality in memory access patterns


Desirable architectural features
Desirable Architectural Features authorities on suspicion of links to 9/11).  

  • Low latency / high bandwidth

    • For small messages!

  • Latency tolerant

  • Light-weight synchronization mechanisms

  • Global address space

    • No data partitioning required

    • Avoid memory-consuming profusion of ghost-nodes

    • No local/global numbering conversions

  • One machine with these properties is the Cray MTA-2

    • And successor XMT


Massive multithreading the cray mta 2
Massive Multithreading: The Cray MTA-2 authorities on suspicion of links to 9/11).  

  • Slow clock rate (220Mhz)

  • 128 “streams” per processor

  • Global address space

  • Fine-grain synchronization

  • Simple, serial-like programming model

  • Advanced parallelizing compilers

Latency Tolerant:

important for Graph Algorithms


Cray mta processor
Cray MTA Processor authorities on suspicion of links to 9/11).  

No Processor Cache!

Hashed Memory!

  • Each thread can have 8 memory refs in flight

  • Round trip to memory ~150 cycles


How does the mta work
How Does the MTA Work? authorities on suspicion of links to 9/11).  

  • Latency tolerance via massive multi-threading

    • Context switch in a single tick

    • Global address space, hashed to reduce hot-spots

    • No cache or local memory. Context switch on memory request.

    • Multiple outstanding loads

  • Remote memory request doesn’t stall processor

    • Other streams work while your request gets fulfilled

  • Light-weight, word-level synchronization

    • Minimizes access conflicts

  • Flexibly supports dynamic load balancing

  • Notes:

    • MTA-2 is 7 years old

    • Largest machine is 40 processors


Case study mta 2 vs bluegene l
Case Study: MTA-2 vs. BlueGene/L authorities on suspicion of links to 9/11).  

  • With LLNL, implemented S-T shortest paths in MPI

  • Ran on IBM/LLNL BlueGene/L, world’s fastest computer

  • Finalist for 2005 Gordon Bell Prize

    • 4B vertex, 20B edge, Erdös-Renyi random graph

    • Analysis: touches about 200K vertices

    • Time: 1.5 seconds on 32K processors

  • Ran similar problem on MTA-2

    • 32 million vertices, 128 million edges

    • Measured: touches about 23K vertices

    • Time: .7 seconds on one processor, .09 seconds on 10 processors

  • Conclusion: 4 MTA-2 processors = 32K BlueGene/L processors


But speed isn t everything
But Speed Isn’t Everything authorities on suspicion of links to 9/11).  

  • Unlike MTA code, MPI code limited to Erdös-Renyi graphs

    • Can’t support power-law graphs; pervasive in informatics

  • MPI code is 3 times larger than MTA-2 code

    • Took considerably longer to develop

  • MPI code can only solve this very special problem

    • MTA code is part of general and flexible infrastructure

  • MTA easily supports multiple, simultaneous users

  • But … MPI code runs everywhere

    • MTA code runs only on MTA/Eldorado and on serial machines


Multithreaded graph software design
Multithreaded Graph Software Design authorities on suspicion of links to 9/11).  

  • Build generic infrastructure for core operations including…

    • Breadth-first search (e.g. short paths)

    • Distributed local searches (e.g. subgraph isomorphism)

    • Rich filtering operations (numerous applications)

  • Separate basic kernels from instance specifics

  • Infrastructure is challenging to write

    • Parallelization & performance challenges reside in infrastructure

    • Must port to multiple architectures

  • But with infrastructure in place, application development is highly productive and portable


Customizing behavior visitors
Customizing Behavior: Visitors authorities on suspicion of links to 9/11).  

  • Idea from BOOST (Lumsdaine)

  • Application programmer writes small visitor functions

    • Get invoked at key points by basic infrastructure

      • E.g. when a new vertex is visited, etc.

    • Adjust behavior or copy data; build tailored knowledge products

  • Example, with one breadth-first-search routine, you can…

    • Find short paths

    • Construct spanning trees

    • Find connected components, etc.

  • Architectural dependence is hidden in infrastructure

    • Applications programming is highly productive

  • Use just enough C++ for flexibility, but not too much

  • Note: Code runs on serial Linux, Windows, Mac machines


Eldorado graph infrastructure c design levels
Eldorado Graph Infrastructure: C++ Design Levels authorities on suspicion of links to 9/11).  

Gives Parallelism,

Hides Most Concurrency

Gets parallelism

for free

Graph

Class

Algorithm

Class

“Visitor”

class

Data Str.

Class

Analyst

Support

Algorithms

Programmer

Infrastructure

Programmer

Inspired by Boost GL, but not Boost GL


Kahan s algorithm for connected components
Kahan’s Algorithm for Connected Components authorities on suspicion of links to 9/11).  


Infrastructure implementation of kahan s algorithm
Infrastructure Implementation of Kahan’s Algorithm authorities on suspicion of links to 9/11).  

Kahan’s Phase II visitor (Trivial)

Shiloach-Vishkin CRCW

(tricky)

Kahan’s Phase I visitor

Search

(tricky)

Kahan’s Phase III visitor (Trivial)


Infrastructure implementation of kahan s algorithm1
Infrastructure Implementation of Kahan’s Algorithm authorities on suspicion of links to 9/11).  

“component” values start “empty;”

Make them “full.”

Phase I:

Wait until both

“full,”

Add to hash

table




Bully algorithm implementation
“Bully” Algorithm Implementation algorithm

Traverse “e” if we would anyway, or if this test returns true

[or,and,replace]

Lock

dest while

testing



Mta 2 scaling of connected components
MTA-2 Scaling of Connected Components algorithm

Power Law Graph

(highly unstructured)

5.41s

2.91s



A renaissance in architecture
A Renaissance in Architecture algorithm

  • Bad news

    • Power considerations limit the improvement in clock speed

  • Good news

    • Moore’s Law marches on

  • Real estate on a chip is no longer at a premium

    • On a processor, much is already memory control

    • Tiny bit is computing (e.g. floating point)

  • The future is not like the past…



Example: AMD Opteron algorithm

Memory

(Latency Avoidance)

L1

D-Cache

L2

Cache

L1

I-Cache


Example: AMD Opteron algorithm

Memory

(Lat. Avoidance)

Out-of-Order Exec

Load/Store

Mem/Coherency

(Latency Tolerance)

Load/Store

Unit

L1

D-Cache

L2

Cache

I-Fetch

Scan

Align

L1

I-Cache

Memory Controller


Example: AMD Opteron algorithm

Memory

(Latency Avoidance)

Load/Store

Unit

L1

D-Cache

Out-of-Order Exec

Load/Store

Mem/Coherency

(Lat. Toleration)

L2

Cache

Bus

DDR

HT

I-Fetch

Scan

Align

L1

I-Cache

Memory and I/O

Interfaces

Memory Controller


Example: AMD Opteron algorithm

Memory

(Latency Avoidance)

FPU Execution

Load/Store

Unit

L1

D-Cache

Out-of-Order Exec

Load/Store

Mem/Coherency

(Lat. Tolerance)

L2

Cache

Int Execution

Bus

DDR

HT

I-Fetch

Scan

Align

L1

I-Cache

Memory and I/O

Interfaces

Memory Controller

COMPUTER

Thanks to Thomas Sterling


Consequences
Consequences algorithm

  • Current response, stamp out more processors.

    • Multicore processors. Not very imaginative.

    • Makes life worse for most of us

  • Near future trends

    • Multithreading to tolerate latencies

    • MTA-like capability on commodity machines

      • Potentially big impact on data-centric applications

  • Further out

    • Application-specific circuitry

      • E.g. hashing, feature detection, etc.

    • Reconfigurable hardware?

      • Adapt circuits to the application at run time


Summary
Summary algorithm

  • Massive Multithreading has great potential for data mining & learning

  • Software development is challenging

    • correctness

    • performance

  • Well designed infrastructure can hide many of these challenges

    • Once built, infrastructure enables high productivity

  • Potential to become mainstream. Stay tuned…


Acknowledgements
Acknowledgements algorithm

  • Jon Berry

  • Simon Kahan, Petr Konecny (Cray)

  • David Bader, Kamesh Madduri (Ga. Tech) (MTA s-t connectivity)

  • Will McClendon (MPI s-t connectivity)


ad