1 / 43

Transforming Big Data with D4M

Transforming Big Data with D4M. Jeremy Kepner MIT Lincoln Laboratory 3 October 2012.

thimba
Download Presentation

Transforming Big Data with D4M

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transforming Big Data with D4M Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002.  Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.

  2. Acknowledgements • Nicholas Arcolano • Michelle Beard • Bob Bond • Josh Haines • Matthew Schmidt • Ben Miller • Benjamin O’Gwynn • Tamara Yu • Bill Arcand • Bill Bergeron • David Bestor • ChansupByun • Matt Hubbell • Pete Michaleas • Julie Mullen • AndyProut • Albert Reuther • Tony Rosa • Charles Yee • Dylan Hutchinson

  3. Outline • Introduction • Theory • Results • Summary

  4. Example Applications of Graph Analytics ISR Social Cyber • Graphs represent entities and relationships detected through multi-INT sources • 1,000s – 1,000,000s tracks and locations • GOAL: Identify anomalous patterns of life • Graphs represent relationships between individuals or documents • 10,000s – 10,000,000s individual and interactions • GOAL: Identify hidden social networks • Graphs represent communication patterns of computers on a network • 1,000,000s – 1,000,000,000s network events • GOAL: Detect cyber attacks or malicious software • Cross-Mission Challenge: Detection of subtle patterns in massivemulti-source noisy datasets

  5. Four Ecosystems Dominate Cloud Computing Enterprise Big Compute - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing - Java - Map/Reduce - Easy admin - Indexing - Search - Security Big Data DBMS • Each ecosystem is at the center of a multi-$B market • Pros/cons of each are numerous; diverging hardware/software • Some missions can exist wholly in one ecosystem; some can’t

  6. Four Ecosystems Dominate Cloud Computing LLGrid Enterprise Big Compute - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing MapReduce - Java - Map/Reduce - Easy admin - Indexing - Search - Security Big Data DBMS • LLGridMapReduce provides map/reduce interface in a big compute environment • D4M provides an interactive parallel scientific computing environment to databases

  7. Big Data + Big Compute Challenge Database Worldview “It’s the data!” Delivering data is the end Supercomputing Worldview “It’s the computer!” Delivering data is the start Shared Compute Shared Data Separate Compute Separate Data • Database and supercomputing views are fundamentally different • Have never coexisted; do not know how to coexist • Big Data “Analytics” are forcing them together • Current standard practice duplicates hardware and data

  8. Big Data + Big Compute Stack Novel Analytics for: Text, Cyber, Bio Weak Signatures, Noisy Data, Dynamics B High Level Composable API: D4M (“Databases for Matlab”) A Array Algebra C E Distributed Database/ Distributed File System Distributed Database: Accumulo (triple store) High Performance Computing: LLGrid + Hadoop Interactive Super-computing • Combining Big Compute and Big Data enables entirely new domains

  9. High Level Language: D4Mhttp://www.mit.edu/~kepner/D4M D4M Dynamic Distributed Dimensional Data Model • Associative Arrays • Numerical Computing Environment Distributed Database B A C Query: Alice Bob Cathy David Earl E D A D4M query returns a sparse matrix or a graph… …for statistical signal processing or graph analysis in MATLAB D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization

  10. Outline • Introduction • Theory • Associate Arrays • Incidence Matrix • Results • Summary

  11. What are Spreadsheets and Big Tables? Big Tables Spreadsheets • Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?) • Big Tables (Google, Amazon, …) store most of the analyzed data in the world (Exabytes?) • Simultaneous diverse data: strings, dates, integers, reals, … • Simultaneous diverse uses: matrices, functions, hash tables, databases, … • No formal mathematical basis; Zero papers in AMA or SIAM

  12. D4M Key Concept:Associative Arrays Unify Four Abstractions • Extends associative arrays to 2D and mixed data types A('alice ','bob ') = 'cited ' orA('alice ','bob ') = 47.0 • Key innovation: 2D is 1-to-1 with triple store('alice ','bob ','cited ') or('alice ','bob ',47.0) ATx x AT bob bob cited carl  alice cited carl alice

  13. Composable Associative Arrays • Key innovation: mathematical closure • All associative array operations return associative arrays • Enables composable mathematical operations A + B A - B A & B A|B A*B • Enables composable query operations via array indexing A('alice bob ',:) A('alice',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 • Simple to implement in a library (~2000 lines) in programming environments with: 1st class support of 2D arrays, operator overloading, sparse linear algebra • Complex queries with ~50x less effort than Java/SQL • Naturally leads to high performance parallel implementation

  14. Associative Array Definitions • Keys and values are from the infinite strict totally ordered set S • Associative arrayA(k) : SdS, k=(k1,…,kd), is a partial function from d keys (typically 2) to 1 value, where A(ki) = vi and  otherwise • Binary operations on associative arrays A3 = A1 A2, where  = f()orf(), have the properties • If A1(ki) = v1 and A2(ki) = v2, then A3(ki)is v1f() v2 = f(v1,v2)orv1f() v2 = f(v1,v2) • IfA1(ki) = v orand A2(ki) = orv, then A3(ki)is v f() = v orv f() =  • High level usage dictated by these definitions • Deeper algebraic properties set by the collision function f() • Frequent switching between “algebras” (how spreadsheets are used)

  15. Theory Questions • Associative arrays can be constructed from a few definitions • Similar to linear algebra, but applicable to a wider range of data • Key questions • Which linear algebra properties do apply to associative arrays (intuitive) • Which linear algebra properties do not apply to associative arrays (watch out) • Which associative array properties do not apply to linear algebra (new) Associative Arrays Linear Algebra new watch out intuitive

  16. References • Book: “Graph Algorithms in the Language of Linear Algebra” • Editors: Kepner (MIT-LL) and Gilbert (UCSB) • Contributors: • Bader (Ga Tech) • Bliss (MIT-LL) • Bond (MIT-LL) • Dunlavy (Sandia) • Faloutsos (CMU) • Fineman (CMU) • Gilbert (USCB) • Heitsch (Ga Tech) • Hendrickson (Sandia) • Kegelmeyer (Sandia) • Kepner (MIT-LL) • Kolda (Sandia) • Leskovec (CMU) • Madduri (Ga Tech) • Mohindra (MIT-LL) • Nguyen (MIT) • Radar (MIT-LL) • Reinhardt (Microsoft) • Robinson (MIT-LL) • Shah (USCB)

  17. Outline • Introduction • Theory • Associate Arrays • Incidence Matrix • Results • Summary

  18. Digraphs are Black & White

  19. The World is Color Artist: Ann Pibal; Painting: “XCRS”

  20. 5 Edge Colors Blue Silver Green Orange Pink Artist: Ann Pibal; Painting: “XCRS”

  21. 20 Vertices V12 V14 V3 V17 V8 V19 V13 V7 V20 V9 V11 V2 V6 V16 V5 V10 V1 V15 V4 V18 Artist: Ann Pibal; Painting: “XCRS”

  22. 1 Isolated Standard Edge P4 Artist: Ann Pibal; Painting: “XCRS”

  23. 12 Multi Edges B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 Artist: Ann Pibal; Painting: “XCRS”

  24. 18 Hyper Edges P5 B1,S1,G1,O1,O2,P1 P8 B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 B2,S2,G2,O3,O4,P2 O5 P7 P3 P6 Artist: Ann Pibal; Painting: “XCRS”

  25. 27 Edge Orderings O5 < P3,P6,P7,P8 O5 < B1,S1,G1,O1,O2,P1 O5 < B2,S2,G2,O3,O4,P2 < P7,P8 P5 B1,S1,G1,O1,O2,P1 P8 B2,S2,G2,O3,O4,P2 O5 P7 P3 P6 Artist: Ann Pibal; Painting: “XCRS”

  26. 52 Standard Multi Edges P5x2 (B1,S1,G1,O1,O2,P1)x2 P8x2 (B2,S2,G2,O3,O4,P2)x4 O5x5 P7x2 P3x3 P6x2 Artist: Ann Pibal; Painting: “XCRS”

  27. Summary Observations • Standard edge representation fragments hyper edges • Information is lost • Digraph representation compresses multi-edges • Information is lost • Matrix representation drops edge labels • Information is lost • Standard graph representation drops edge order • Information is lost • Need edge representation that preserves information Artist: Ann Pibal; Painting: “XCRS”

  28. Solution: Incidence Matrix Artist: Ann Pibal; Painting: “XCRS”

  29. Outline • Introduction • Theory • Results • Network monitoring example • Bioinformatics example • Summary

  30. Graph Construction Using D4M:Explode Schema Raw Data CSV Files Distributed Database Assoc.Arrays Create columns for each unique type/value pair Dense Table Use as row indices Exploded Table

  31. Graph Construction Using D4M:Storing Exploded Data as Triples Raw Data CSV Files Distributed Database Assoc.Arrays Exploded Table D4M stores the triple data representing both the exploded table and its transpose Table Triples Table Transpose Triples

  32. Graph Construction Using D4M:Construct Associative Arrays Raw Data CSV Files Distributed Database Assoc.Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,); (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1) ...

  33. Graph Construction Using D4M:Construct Associative Arrays Raw Data CSV Files Distributed Database Assoc.Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); (‘log_id|001’,‘server_ip|208.29.69.138’,1) (‘log_id|001’,‘src_ip|128.0.0.1’,1) (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1) ... (‘log_id|002’,‘server_ip|157.166.255.18’,1) (‘log_id|002’,‘src_ip|192.168.1.2’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1) ... (‘log_id|003’,‘server_ip|74.125.224.72’,1) (‘log_id|003’,‘src_ip|128.0.0.1’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1) ...

  34. Graph Construction Using D4M:Construct Associative Arrays Raw Data Raw Data CSV Files CSV Files Distributed Database Distributed Database Assoc.Arrays Assoc.Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); (‘src_ip|128.0.0.1’,‘server_ip|208.29.69.138’,1) (‘src_ip|128.0.0.1’,‘server_ip|74.125.224.72’,1) (‘src_ip|192.168.1.2’,‘server_ip|157.166.255.18’,1) ...

  35. Graph Construction Using D4M:Construct Associative Arrays Raw Data CSV Files Distributed Database Assoc.Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Adj(G); • Graphs can be constructed with minimal effort using D4M queries and associative array algebra

  36. Accumulo Ingestion Scalability StudyLLGridMapReduce With A Python Application Accumulo Database: 1 Master + 7 Tablet servers 4 Mil e/s Data #1: 5 GB of 200 files Data #2: 30 GB of 1000 files

  37. Outline • Introduction • Theory • Results • Network monitoring example • Bioinformatics example • Summary

  38. Relative Cost per DNA Sequence Big Data Energy Efficient Portable Sequencer High Volume Sequencer Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: www.genome.gov/sequencingcosts. Accessed 03/08/2012

  39. Example Disease Outbreak May-July 2011 - Virulent E. Coli Outbreak Germany Outbreak identified Spanish Cucumbers implicated diarrhea kidney DNA Sequence released Sprouts Identified Deaths www.rki.de EHEC final report Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses Publishing sequence data allowed for broad community to fully characterize pathogen Sequencing and crowd source analysis showed promising potential -> Still too slow

  40. Sequence Matching  Graph  Sparse Matrix Multiply in D4M Collected Sample RNA Reference Set reference bacteria unknown bacteria A1 A2 unknown sequence ID reference sequence ID A1 A2' sequence word (10mer) sequence word (10mer) reference sequence ID unknown sequence ID • Associative arrays provide a natural framework for sequence matching

  41. Database Automatically ComputesReference 10mer Distribution 5% 0.5% 50% • Using 10mer distribution can quickly select reference 10mers that maximally differentiate sample sequences and eliminate most 10mers

  42. Leveraging “Big Data” Technologies for High Speed Sequence Matching 100x smaller D4M • High performance triple store database trades computations for lookups • Used Apache Accumulo database to accelerate comparison by 100x • Used Lincoln D4M software to reduce code size by 100x BLAST 100x faster D4M + Triple Store

  43. Summary • Big data is found across a wide range of areas • Document analysis • Computer network analysis • DNA Sequencing • Currently there is a gap in big data analysis tools for algorithm developers • D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation

More Related