transforming big data with d4m
Download
Skip this Video
Download Presentation
Transforming Big Data with D4M

Loading in 2 Seconds...

play fullscreen
1 / 43

Transforming Big Data with D4M - PowerPoint PPT Presentation


  • 121 Views
  • Uploaded on

Transforming Big Data with D4M. Jeremy Kepner MIT Lincoln Laboratory 3 October 2012.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Transforming Big Data with D4M' - thimba


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
transforming big data with d4m

Transforming Big Data with D4M

Jeremy Kepner

MIT Lincoln Laboratory

3 October 2012

This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002.  Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.

acknowledgements
Acknowledgements
  • Nicholas Arcolano
  • Michelle Beard
  • Bob Bond
  • Josh Haines
  • Matthew Schmidt
  • Ben Miller
  • Benjamin O’Gwynn
  • Tamara Yu
  • Bill Arcand
  • Bill Bergeron
  • David Bestor
  • ChansupByun
  • Matt Hubbell
  • Pete Michaleas
  • Julie Mullen
  • AndyProut
  • Albert Reuther
  • Tony Rosa
  • Charles Yee
  • Dylan Hutchinson
outline
Outline
  • Introduction
  • Theory
  • Results
  • Summary
example applications of graph analytics
Example Applications of Graph Analytics

ISR

Social

Cyber

  • Graphs represent entities and relationships detected through multi-INT sources
  • 1,000s – 1,000,000s tracks and locations
  • GOAL: Identify anomalous patterns of life
  • Graphs represent relationships between individuals or documents
  • 10,000s – 10,000,000s individual and interactions
  • GOAL: Identify hidden social networks
  • Graphs represent communication patterns of computers on a network
  • 1,000,000s – 1,000,000,000s network events
  • GOAL: Detect cyber attacks or malicious software
  • Cross-Mission Challenge: Detection of subtle patterns in massivemulti-source noisy datasets
four ecosystems dominate cloud computing
Four Ecosystems Dominate Cloud Computing

Enterprise

Big Compute

- Interactive

- On-demand

- Elastic

- High performance

- Parallel Languages

- Scientific computing

- Java

- Map/Reduce

- Easy admin

- Indexing

- Search

- Security

Big Data

DBMS

  • Each ecosystem is at the center of a multi-$B market
  • Pros/cons of each are numerous; diverging hardware/software
  • Some missions can exist wholly in one ecosystem; some can’t
four ecosystems dominate cloud computing1
Four Ecosystems Dominate Cloud Computing

LLGrid

Enterprise

Big Compute

- Interactive

- On-demand

- Elastic

- High performance

- Parallel Languages

- Scientific computing

MapReduce

- Java

- Map/Reduce

- Easy admin

- Indexing

- Search

- Security

Big Data

DBMS

  • LLGridMapReduce provides map/reduce interface in a big compute environment
  • D4M provides an interactive parallel scientific computing environment to databases
big data big compute challenge
Big Data + Big Compute Challenge

Database Worldview

“It’s the data!”

Delivering data is the end

Supercomputing Worldview

“It’s the computer!”

Delivering data is the start

Shared Compute

Shared Data

Separate Compute

Separate Data

  • Database and supercomputing views are fundamentally different
  • Have never coexisted; do not know how to coexist
  • Big Data “Analytics” are forcing them together
  • Current standard practice duplicates hardware and data
big data big compute stack
Big Data + Big Compute Stack

Novel Analytics for:

Text, Cyber, Bio

Weak Signatures,

Noisy Data,

Dynamics

B

High Level Composable API: D4M (“Databases for Matlab”)

A

Array Algebra

C

E

Distributed Database/ Distributed File System

Distributed Database: Accumulo (triple store)

High Performance Computing: LLGrid + Hadoop

Interactive Super-computing

  • Combining Big Compute and Big Data enables entirely new domains
high level language d4m http www mit edu kepner d4m
High Level Language: D4Mhttp://www.mit.edu/~kepner/D4M

D4M

Dynamic

Distributed

Dimensional

Data

Model

  • Associative Arrays
  • Numerical Computing Environment

Distributed Database

B

A

C

Query:

Alice

Bob

Cathy

David

Earl

E

D

A D4M query returns a sparse matrix or a graph…

…for statistical signal processing or graph analysis in MATLAB

D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization

outline1
Outline
  • Introduction
  • Theory
    • Associate Arrays
    • Incidence Matrix
  • Results
  • Summary
what are spreadsheets and big tables
What are Spreadsheets and Big Tables?

Big Tables

Spreadsheets

  • Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?)
  • Big Tables (Google, Amazon, …) store most of the analyzed data in the world (Exabytes?)
  • Simultaneous diverse data: strings, dates, integers, reals, …
  • Simultaneous diverse uses: matrices, functions, hash tables, databases, …
  • No formal mathematical basis; Zero papers in AMA or SIAM
d4m key concept associative arrays unify four abstractions
D4M Key Concept:Associative Arrays Unify Four Abstractions
  • Extends associative arrays to 2D and mixed data types

A(\'alice \',\'bob \') = \'cited \'

orA(\'alice \',\'bob \') = 47.0

  • Key innovation: 2D is 1-to-1 with triple store(\'alice \',\'bob \',\'cited \')

or(\'alice \',\'bob \',47.0)

ATx

x

AT

bob

bob

cited

carl

alice

cited

carl

alice

composable associative arrays
Composable Associative Arrays
  • Key innovation: mathematical closure
    • All associative array operations return associative arrays
  • Enables composable mathematical operations

A + B A - B A & B A|B A*B

  • Enables composable query operations via array indexing

A(\'alice bob \',:) A(\'alice\',:) A(\'al* \',:)

A(\'alice : bob \',:) A(1:2,:) A == 47.0

  • Simple to implement in a library (~2000 lines) in programming environments with: 1st class support of 2D arrays, operator overloading, sparse linear algebra
  • Complex queries with ~50x less effort than Java/SQL
  • Naturally leads to high performance parallel implementation
associative array definitions
Associative Array Definitions
  • Keys and values are from the infinite strict totally ordered set S
  • Associative arrayA(k) : SdS, k=(k1,…,kd), is a partial function from d keys (typically 2) to 1 value, where

A(ki) = vi and  otherwise

  • Binary operations on associative arrays A3 = A1 A2,

where  = f()orf(), have the properties

    • If A1(ki) = v1 and A2(ki) = v2, then A3(ki)is

v1f() v2 = f(v1,v2)orv1f() v2 = f(v1,v2)

    • IfA1(ki) = v orand A2(ki) = orv, then A3(ki)is

v f() = v orv f() = 

  • High level usage dictated by these definitions
  • Deeper algebraic properties set by the collision function f()
  • Frequent switching between “algebras” (how spreadsheets are used)
theory questions
Theory Questions
  • Associative arrays can be constructed from a few definitions
  • Similar to linear algebra, but applicable to a wider range of data
  • Key questions
    • Which linear algebra properties do apply to associative arrays (intuitive)
    • Which linear algebra properties do not apply to associative arrays (watch out)
    • Which associative array properties do not apply to linear algebra (new)

Associative

Arrays

Linear

Algebra

new

watch out

intuitive

references
References
  • Book: “Graph Algorithms in the Language of Linear Algebra”
  • Editors: Kepner (MIT-LL) and Gilbert (UCSB)
  • Contributors:
    • Bader (Ga Tech)
    • Bliss (MIT-LL)
    • Bond (MIT-LL)
    • Dunlavy (Sandia)
    • Faloutsos (CMU)
    • Fineman (CMU)
    • Gilbert (USCB)
    • Heitsch (Ga Tech)
    • Hendrickson (Sandia)
    • Kegelmeyer (Sandia)
    • Kepner (MIT-LL)
    • Kolda (Sandia)
    • Leskovec (CMU)
    • Madduri (Ga Tech)
    • Mohindra (MIT-LL)
    • Nguyen (MIT)
    • Radar (MIT-LL)
    • Reinhardt (Microsoft)
    • Robinson (MIT-LL)
    • Shah (USCB)
outline2
Outline
  • Introduction
  • Theory
    • Associate Arrays
    • Incidence Matrix
  • Results
  • Summary
the world is color
The World is Color

Artist: Ann Pibal; Painting: “XCRS”

5 edge colors
5 Edge Colors

Blue

Silver

Green

Orange

Pink

Artist: Ann Pibal; Painting: “XCRS”

20 vertices
20 Vertices

V12

V14

V3

V17

V8

V19

V13

V7

V20

V9

V11

V2

V6

V16

V5

V10

V1

V15

V4

V18

Artist: Ann Pibal; Painting: “XCRS”

1 isolated standard edge
1 Isolated Standard Edge

P4

Artist: Ann Pibal; Painting: “XCRS”

12 multi edges
12 Multi Edges

B1,S1,G1,O1,O2,P1

B2,S2,G2,O3,O4,P2

Artist: Ann Pibal; Painting: “XCRS”

18 hyper edges
18 Hyper Edges

P5

B1,S1,G1,O1,O2,P1

P8

B1,S1,G1,O1,O2,P1

B2,S2,G2,O3,O4,P2

B2,S2,G2,O3,O4,P2

O5

P7

P3

P6

Artist: Ann Pibal; Painting: “XCRS”

27 edge orderings
27 Edge Orderings

O5 < P3,P6,P7,P8

O5 < B1,S1,G1,O1,O2,P1

O5 < B2,S2,G2,O3,O4,P2 < P7,P8

P5

B1,S1,G1,O1,O2,P1

P8

B2,S2,G2,O3,O4,P2

O5

P7

P3

P6

Artist: Ann Pibal; Painting: “XCRS”

52 standard multi edges
52 Standard Multi Edges

P5x2

(B1,S1,G1,O1,O2,P1)x2

P8x2

(B2,S2,G2,O3,O4,P2)x4

O5x5

P7x2

P3x3

P6x2

Artist: Ann Pibal; Painting: “XCRS”

summary observations
Summary Observations
  • Standard edge representation fragments hyper edges
    • Information is lost
  • Digraph representation compresses multi-edges
    • Information is lost
  • Matrix representation drops edge labels
    • Information is lost
  • Standard graph representation drops edge order
    • Information is lost
  • Need edge representation that preserves information

Artist: Ann Pibal; Painting: “XCRS”

solution incidence matrix
Solution: Incidence Matrix

Artist: Ann Pibal; Painting: “XCRS”

outline3
Outline
  • Introduction
  • Theory
  • Results
    • Network monitoring example
    • Bioinformatics example
  • Summary
graph construction using d4m explode schema
Graph Construction Using D4M:Explode Schema

Raw Data

CSV Files

Distributed Database

Assoc.Arrays

Create columns for each unique type/value pair

Dense Table

Use as row indices

Exploded Table

graph construction using d4m storing exploded data as triples
Graph Construction Using D4M:Storing Exploded Data as Triples

Raw Data

CSV Files

Distributed Database

Assoc.Arrays

Exploded Table

D4M stores the triple data representing both the exploded table and its transpose

Table Triples

Table Transpose Triples

graph construction using d4m construct associative arrays
Graph Construction Using D4M:Construct Associative Arrays

Raw Data

CSV Files

Distributed Database

Assoc.Arrays

D4M Query #1

keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,);

(‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)

(‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)

(‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)

...

graph construction using d4m construct associative arrays1
Graph Construction Using D4M:Construct Associative Arrays

Raw Data

CSV Files

Distributed Database

Assoc.Arrays

D4M Query #1

keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,);

D4M Query #2

data = T(Row(keys), :);

(‘log_id|001’,‘server_ip|208.29.69.138’,1)

(‘log_id|001’,‘src_ip|128.0.0.1’,1)

(‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)

...

(‘log_id|002’,‘server_ip|157.166.255.18’,1)

(‘log_id|002’,‘src_ip|192.168.1.2’,1)

(‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)

...

(‘log_id|003’,‘server_ip|74.125.224.72’,1)

(‘log_id|003’,‘src_ip|128.0.0.1’,1)

(‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)

...

graph construction using d4m construct associative arrays2
Graph Construction Using D4M:Construct Associative Arrays

Raw Data

Raw Data

CSV Files

CSV Files

Distributed Database

Distributed Database

Assoc.Arrays

Assoc.Arrays

D4M Query #1

keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,);

D4M Query #2

data = T(Row(keys), :);

Associative Array Algebra

G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’);

(‘src_ip|128.0.0.1’,‘server_ip|208.29.69.138’,1)

(‘src_ip|128.0.0.1’,‘server_ip|74.125.224.72’,1)

(‘src_ip|192.168.1.2’,‘server_ip|157.166.255.18’,1)

...

graph construction using d4m construct associative arrays3
Graph Construction Using D4M:Construct Associative Arrays

Raw Data

CSV Files

Distributed Database

Assoc.Arrays

D4M Query #1

keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ... ’time_stamp|13/May/2011:23:59:59’,);

D4M Query #2

data = T(Row(keys), :);

Associative Array Algebra

G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’);

Adj(G);

  • Graphs can be constructed with minimal effort using D4M queries and associative array algebra
accumulo ingestion scalability study llgrid mapreduce with a python application
Accumulo Ingestion Scalability StudyLLGridMapReduce With A Python Application

Accumulo Database: 1 Master + 7 Tablet servers

4 Mil e/s

Data #1:

5 GB of 200 files

Data #2:

30 GB of 1000 files

outline4
Outline
  • Introduction
  • Theory
  • Results
    • Network monitoring example
    • Bioinformatics example
  • Summary
relative cost per dna sequence
Relative Cost per DNA Sequence

Big Data

Energy Efficient

Portable

Sequencer

High Volume Sequencer

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: www.genome.gov/sequencingcosts. Accessed 03/08/2012

example disease outbreak
Example Disease Outbreak

May-July 2011 - Virulent E. Coli Outbreak Germany

Outbreak

identified

Spanish

Cucumbers

implicated

diarrhea

kidney

DNA

Sequence

released

Sprouts

Identified

Deaths

www.rki.de EHEC final report

Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses

Publishing sequence data allowed for broad community to fully characterize pathogen

Sequencing and crowd source analysis showed promising potential -> Still too slow

sequence matching graph sparse matrix multiply in d4m
Sequence Matching  Graph  Sparse Matrix Multiply in D4M

Collected Sample

RNA Reference Set

reference

bacteria

unknown

bacteria

A1

A2

unknown

sequence ID

reference

sequence ID

A1 A2\'

sequence word (10mer)

sequence word (10mer)

reference

sequence ID

unknown sequence ID

  • Associative arrays provide a natural framework for sequence matching
database automatically computes reference 10mer distribution
Database Automatically ComputesReference 10mer Distribution

5%

0.5%

50%

  • Using 10mer distribution can quickly select reference 10mers that maximally differentiate sample sequences and eliminate most 10mers
leveraging big data technologies for high speed sequence m atching
Leveraging “Big Data” Technologies for High Speed Sequence Matching

100x smaller

D4M

  • High performance triple store database trades computations for lookups
  • Used Apache Accumulo database to accelerate comparison by 100x
  • Used Lincoln D4M software to reduce code size by 100x

BLAST

100x faster

D4M +

Triple Store

summary
Summary
  • Big data is found across a wide range of areas
    • Document analysis
    • Computer network analysis
    • DNA Sequencing
  • Currently there is a gap in big data analysis tools for algorithm developers
  • D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation
ad