1 / 64

High Productivity Computing

High Productivity Computing. Large-scale Knowledge Discovery: Uses, Infrastructure, and Algorithms Steve Reinhardt Principal PM Architect Microsoft. Prof. John Gilbert, UCSB Dr. Viral Shah, UCSB/ISC. Agenda. Uses of knowledge discovery Distinction from data mining Infrastructure

marin
Download Presentation

High Productivity Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Productivity Computing Large-scale Knowledge Discovery: Uses, Infrastructure, and Algorithms Steve Reinhardt Principal PM Architect Microsoft Prof. John Gilbert, UCSB Dr. Viral Shah, UCSB/ISC

  2. Agenda • Uses of knowledge discovery • Distinction from data mining • Infrastructure • Algorithms • Practicalities

  3. Context for Knowledge Discovery From Debbie Gracio and Ian Gorton, PNNL Data Intensive Computing Initiative

  4. Homeland security • Problem: Need to detect concerted (concealed) actions • Data is huge, from many media types • Many types of possibleattacks • Detection sometimes needed on tacticaltime-scales Coffman, Greenblatt, and Marcus, “Graph-based Technologies for Intelligence Analysis”, CACM

  5. Cheminformatics:Identifying molecular precursors of negative trial outcomes • Problem: Modest numbers (O(1K)) of patients in clinical trials are insufficient to detect rare negative effects • Possible solution: Identify the molecular precursors at sub-injurious concentrations • Typically multiple precursors needed to identify negative effect • Causal network models generate clinically testable hypotheses for avoidance The Combined “Systems Profile” for EGF Inhibition Elliston et al., “Systems Pharmacology: An Application of Systems Biology”

  6. Manufacturing: Aircraft engine failure detection • Problem: Actual or borderline failures are expensive to fix and potentially disruptive of operations • Recognizing failure signatures before they become real is much better (repair cost, operational disruption) • Two uses: • Initial detection of signatures • Operational detection Courtesy of Rolls Royce PLC

  7. Telecommunications: Wireless traffic categorization • Nonnegative matrix factorization identifies essential components of traffic • Analyst labels different types of external behavior Karpinski, Gilbert, and Belding, “Non-parametric discrete mixture model recovery via non-negative matrix factorization”

  8. Knowledge Discovery Workflow 1. Cull relevant data 2. Build input graph 3. Analyze input graph 4. Visualize result graph memory - Gene - Email - Twitter - Video - Sensor - Web

  9. Agenda • Uses of knowledge discovery • Infrastructure • Algorithms • Practicalities

  10. Infrastructure: Microsoft Approach • Economic history is on the side of mass-consumption tool-builders, not artisans • Ancient scribes -> quill pens -> mass-produced pencils (Ticonderoga Dixon) • Early expert-only automobiles -> automobiles usable by anyone (Ford) • Refrigerator-sized motion-picture cameras -> hand-helds (Sony, …) • HPC’s impact on society is low because hard to use • Disruptive change: parallelism everywhere (intra-chip, intra-node, inter-node, cloud) and the need for applications to respond • Microsoft investing to enable developers and domain experts move to this parallel world

  11. A Cross-section of Today’s Tools Tools / Programming Models / Runtimes Tools Managed Languages Visual F# Axum Visual Studio 2010 Parallel Debugger Windows Native Libraries Managed Libraries DryadLINQ Async AgentsLibrary Parallel Pattern Library • Profiler Concurrency • Analysis Parallel LINQ Rx Task ParallelLibrary Data Structures Data Structures Microsoft Research Native Concurrency Runtime Task Scheduler Race Detection Managed Concurrency Runtime Resource Manager ThreadPool Fuzzing Operating System HPC Server Threads UMS Threads Windows 7 / Server 2008 R2 Research / Incubation Visual Studio 2010 / .NET 4 Key:

  12. Cluster-Aware Microsoft TC Technologies Memory-centric Data-parallel Loop-parallel Disk-centric Star-P Product Domain specialists Research/Incubation DryadLINQ HPC SOA Dryad Professional developers MPI • Job scheduling • Diagnostics • System monitoring Windows HPC Server node node node node node node node node

  13. Microsoft TC Tools for Knowledge Discovery Workflow 1. Cull relevant data 2. Build input graph 3. Analyze input graph 4. Visualize result graph DryadLINQ Star-P / KDT memory - Gene data - Email - Twitter - Video - Web data - …

  14. DryadLINQ: Query + Plan + Parallel Execution • Dryad • Distributed-memory coarse-grain run-time • Generalized MapReduce • Using computational vertices and communication channels to form a dataflow execution graph • LINQ (Language INtegrated Query) • A query-style language interface to Dryad • Traditional relational operators (e.g., Select, Join, GroupBy) • Scaling for histogram example • Initial data 10.2TB, using 1,800 cluster nodes, 43,171 execution-graph verticesspawning 11,072 processes, creating 33GB output data in 11.5 minutes of execution data plane Files, TCP, FIFO, Network sched V V V NS PD PD PD control plane Job manager cluster

  15. KDT: A Toolbox for Knowledge Discovery graph sparse matrix GPU/accelerator Input filesDryad streams OPeNDAP Hadoop Graph primitives (connected components, maximal independent sets, ...) Clustering Betweenness centrality Barycentric K-means Classification Support vector machines Markov models Bayesian Visualization IN-SPIRE, ... Dimensionality reduction / factorization(eigenvalues/vectors , singular values, nonnegative matrix factorization, …) Graph abstractionsand patterns(vertices, visitors, breadth-first search) Optimization Parallel I/O(HDF5, POSIX fopen/fread/…) Utility(sort, indexing) Linear algebra(sparse mat*vec, mat*mat, ...) Solvers(MUMPS, SuperLU, ...) Data structures(sparse matrices, ...) Parallel constructs (data- and loop-parallel) core Star-P

  16. MATLAB Star-P Bridges Scientists to HPCs Star-P enables domain experts to use parallel, big-memory systems via productivity languages (e.g., the M language of MATLAB) Knowledge discover scaling with Star-P • Kernels to 55B edges between 5B vertices, on 128 cores (consuming 4TB memory) • Compact applications to 1B edges on 256 cores

  17. Agenda • Uses of knowledge discovery • Infrastructure • Algorithms • Practicalities

  18. Algorithms • Must be usable by non-graph-expert on very large data • E.g., automatically detecting convergence • Must have practical computational complexity • O( |V|2 ) or O( |E|2 ) not practical • Core Star-P sparse matrix algorithms areO( |E|+|V| ), not O( |V|3) or O( |E|1.5 ) • State of the art (e.g., for clustering …) • Some agreement on best current algorithms • Girvan-Newman community detection: repeated recalculation of betweenness centrality too expensive • Non-negative matrix factorization: can give good results, but difficult to calibrate • Broad agreement current algorithms not good enough • E.g., only work for given number of clusters, don’t support multi-membership • Intense work on new algorithms • Better algorithms will arise from domain specialists working with at-scale data interactively

  19. Agenda • Uses of knowledge discovery • Infrastructure • Algorithms • Practicalities

  20. Practicalities • Voltaire: “The perfect is the enemy of the good.” • Today’s best algorithms are valuable in themselves, so worth propagating to a wider audience • And need to foster rapid development of new better algorithms •  Provide robust scalable infrastructure (Star-P) for algorithm development •  Seed development with open-source library of best current algorithms

  21. Summary • Knowledge discovery is a high-value technique relevant to many disciplines • Data sizes require cluster technologies for both Cull (disk) and Analyze (memory) steps • Rapid algorithm development is essential • DryadLINQ (disk) and Star-P (memory) robustly implement key infrastructure at scale • <<Watch this space>>

  22. © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista, Windows 7, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it shouldnot be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

  23. boundary

  24. Dryad: General-purpose Coarse-Grain Data Parallelism • A generalization of MapReduce • Using computational vertices and communication channels to form a dataflow execution graph • Automatically partitions queries to run where input data resides on disk, creating partitioned intermediate results or outputs data plane Files, TCP, FIFO, Network job scheduler V V V NS PD PD PD control plane Job manager cluster

  25. LINQ – Language INtegrated Query • A query-style language interface to Dryad • Traditional relational operators (e.g., Select, Join, GroupBy) • Integrated into .NET programming languages • Data model • Much more flexible than SQL tables • Integrated with C#, etc., today • How to integrate with domain-specialist languages?

  26. DryadLINQ: Query + Plan + Parallel Execution Dryad execution • Distributed execution plan • Static optimizations: pipelining, aggregation, etc. • Dynamic optimizations: data-dependent partitioning, dynamic aggregation, etc. • Automatic code generation • Vertex code that runs on vertices • Code to serialize transfers across channels • Automatically distributed to cluster machines • Separate LINQ query from its local context • Distribute referenced objects to cluster machines • Distribute application DLLs to cluster machines • Scaling for histogram example • Initial data 10.2TB, using 1,800 cluster nodes, 43,171 execution-graph verticesspawning 11,072 processes, creating 33GB output data in 11.5 minutes of execution

  27. References • http://www.microsoft.com/science • http://www.microsoft.com/hpc • http://www.microsoft.com/hpc/en/us/developer-resources.aspx • http://www.microsoft.com/hpc/en/us/product-documentation.aspx • http://resourcekit.windowshpc.net/home.html • http://social.microsoft.com/Forums/en-US/category/windowshpc • http://www.microsoft.com/visualstudio/products/2010 • http://msdn.microsoft.com/en-us/concurrency/default.aspx • http://www.solverfoundation.com/ • http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=3313856b-02bc-4bdd-b8b6-541f5309f2ce • http://paralleldwarfs.codeplex.com/ • http://www.osl.iu.edu/research/mpi.net • http://msdn.microsoft.com/en-us/fsharp/default.aspx • http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx • http://research.microsoft.com/en-us/projects/Accelerator/ • http://www.microsoft.com/downloads/details.aspx?FamilyID=9F943B2B-53EA-4F80-84B2-F05A360BFC6A&displaylang=en

  28. Agenda • GASnet-on-MS-MPI project • Layered accelerator-ready knowledge discovery toolbox • Dryad / Star-P linkage • System configuration for experiments with Microsoft technologies

  29. Layered Accelerator-Ready KDT (noodling with John Gilbert) Betweennesscentrality Betweenness centrality • High-level algorithms implemented in terms of a few key kernels (e.g., spmat*spmat, s-t connectivity) • Local accelerators (i.e., GPU) can overload at local level • Global communication still handled by Star-P’s global address space • Global accelerators (i.e., XMT) can overload at global level • Subsuming local and global phases • Benefits • Graph algorithm researchers always write in the very high-level language • System-focused developers can plug in lower-level or mid-level kernels • Questions: • Does the linear-algebra approach relax the requirement for single-thread latency tolerance? Which problems are hardest in the linear-algebra approach? All-pairs shortest path Spmat*spmat XMT GPU CPU

  30. Dryad and Star-P • Dryad http://research.microsoft.com/en-us/projects/Dryad/ • Run-time for data-parallel coarse-grained (disk-based) parallel queries • Generalizes MapReduce capabilities to less-structured task graphs (DAG) • Intended for Internet-scale searches • Interface: SQL-like language integration with C#, considering M interface • Dryad optimizes the DAG to minimize data size and motion, preserve affinity • Dryad executes the DAG, passing data between vertices via files/pipes/network • Star-P / Knowledge Discovery Toolbox • Star-P: distributed memory run-time for (memory-based) data-parallel and loop-parallel operations; programmed via M language of MATLAB™ • KDT: small set of graph-analysis operations implemented in terms of sparse matrices • Infrastructure already scaled to 4TB graph and 512 cores

  31. Dryad Example: Query histogram computation • Input: log file (n partitions) • Extract queries from log partitions • Re-partition by hash of query (k buckets) • Compute histogram within each bucket

  32. Dryad Example: Efficient histogram topology P parse lines D hash distribute S quicksort C count occurrences MS merge sort M non-deterministic merge Each is k : Q' Each T k R R C is : Each R S D is : T P C C Q' M MS MS n

  33. 450 33.4 GB 450 R R 118 GB 217 T T 154 GB 10,405 Q' Q' 99,713 10.2 TB Dryad Example: Final histogram refinement 1,800 computers 43,171 vertices 11,072 processes 11.5 minutes

  34. Simply Linking Dryad and Star-P % run a Dryad job, with results returned into Star-P variables [i j value label] = dryad(‘myjob’,start,end); % create a graph and run betweenness centrality on it G = graph(i,j,value,label); bc = betcentrality(G);

  35. Possible System Configuration • Cray CX1000 quarter-rack • head node plus … • 18 compute blades, each with • 2 6-core sockets • 24GB DRAM (2GB/core) [432GB aggregate] • One 250GB drive • Windows HPC Server and Cluster Manager • 3-year, 24x7 support • On-site installation • $193,473 • Options: • GPUs: to explore DPC++ • More disk: to explore Dryad Up to 1.95TF CX1000 – C $150k - $700k

  36. Agenda • Approach • Specifics • Examples • Use of accelerators

  37. Agenda • Approach • Specifics • Examples • Use of accelerators

  38. Vision for Large-scale Knowledge Discovery Enable domain expert to explorebig unstructured data interactively Domain Expert: Scientist or analyst, not math or graph expert Explore: Human-guided characterization, from simple statistics to complex clustering or factoring, even when best algorithm not known. Ideally done from a visual-analytic GUI. Big: 20GB+ of data commonplace, >1TB largest Unstructured Data: E.g., arising from metabolic networks, climate change, social interactions, and Internet traffic Allow analyst to discern structure in the data via exploration KDT implements many key algorithms; extensible for other algorithms Interactively: common queries take O(10 seconds) - scale system size to preserve interactivity Depends on a variety of algorithms, each general-purpose, reusable, and usable by non-experts, and extensible with other algorithms

  39. Agenda • Approach • Specifics • Examples • Use of accelerators

  40. Value for Graph-analytic Algorithm Developers Develop algorithm with M language on desktop Extend (slightly) for big data (parallel execution on server) Exploit robust implementations of key algorithms and primitives Exploit already-scaled infrastructure Up to 512 cores Up to 5TB memory

  41. Agenda • Approach • Specifics • Examples • Use of accelerators

  42. A complex kernel (SSCA#2 v1.1, kernel 4) function leader = kernel4f (G) % KERNEL4F : SSCA#2 Kernel 4 -- Graph Clustering % Find a Maximal Independent Set in G [IS, misrounds] = mis (G); fprintf ('MIS rounds: %d. MIS nodes: %d\n', misrounds, length(IS)); % Find neighbours of each node from the IS neighFromIS = neighbors(IS,1); % Pick one of the neighbouring IS nodes as a leader [ign leader] = max (neighFromIS, [], 2); % Collect votes from neighbours [I J] = find (G); S = sparse (I, leader(J), 1, n, n); % Pick the most popular leader among neighbours and join that cluster [ign leader] = max (S, [], 2); Discovers the underlying clique structure from an undirected graph **Gilbert, Reinhardt, and Shah

  43. App: Computational Ecology • Modeling dispersal of species within a habitat (to maximize range) • Large geographic areas, linked with GIS data • Blend of numerical and combinatorial algorithms Brad McRae and Paul Beier, “Circuit theory predicts gene flow in plant and animal populations”, PNAS, Vol. 104, no. 50, December 11, 2007

  44. Results Solution time reduced from 3 days (desktop) to 5 minutes (14p) for typical problems Aiming for much larger problems: Yellowstone-to-Yukon (Y2Y)

  45. App: Factoring network flow behavior **Karpinski, Almeroth, and Belding

  46. Algorithmic exploration Many NMF variants exist in the literature Not clear how useful on large data Not clear how to calibrate (i.e., number of iterations to converge) NMF algorithms combine linear algebra and optimization methods Basic and “improved” NMF factorization algorithms implemented: euclidean (Lee & Seung 2000) K-L divergence (Lee & Seung 2000) semi-nonnegative (Ding et al. 2006) left/right-orthogonal (Ding et al. 2006) bi-orthogonal tri-factorization (Ding et al. 2006) sparse euclidean (Hoyer et al. 2002) sparse divergence (Liu et al. 2003) non-smooth (Pascual-Montano et al. 2006)

  47. NMF traffic analysis results NMF identifies essential components of the traffic Analyst labels different types of external behavior

  48. Sorting performance **Cheng, Shah, Gilbert, and Edelman

  49. Scaling Performance: cSSCA#2 on 128 cores • Timings scale well – for large graphs, • 2x problem size  2x time • 2x problem size & 2x processors  same time

  50. Agenda • Approach • Specifics • Examples • Use of accelerators

More Related