Embracing the Data Deluge: Data-Intensive Computing for the Masses

Embracing the Data Deluge:Data-Intensive Computing for the Masses Jimmy Lin University of Maryland Tuesday, July 13, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Introduction • We live in a world of large data… • Staying relevant requires embracing it! • In text processing… • Emergence and dominance of empirical, data-driven research • Constant danger: uninteresting conclusions on “toy” datasets(or, experiments taking forever) • In the natural sciences… • Emergence of the 4th Paradigm: data-intensive eScience • Difficult computer science problems! • How do we practically scale to large datasets? • Case study in text processing: statistical machine translation • Case study in bioinformatics: DNA sequence alignment

How much data? • Google processes 20 PB a day (2008) • Wayback Machine has 3 PB + 100 TB/month (3/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009) • Facebook has 36 PB of user data + 80-90 TB/day (6/2010) • CERN’s LHC: 15 PB a year (any day now) • LSST: 6-10 PB a year (~2015) 640Kought to be enough for anybody.

(Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) No data like more data! s/knowledge/data/g; How do we get here if we’re not Google?

Path to data nirvana? cheap commodity clusters (or utility computing) + simple, distributed programming models = data-intensive computing for the masses! Why is this different? Source: flickr (turtlemom_nancy/2046347762)

Parallel computing is hard! Fundamental issues Different programming models scheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, … Message Passing Shared Memory Memory Architectural issues P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Flynn’s taxonomy (SIMD, MIMD, etc.),network typology, bisection bandwidthUMA vs. NUMA, cache coherence Different programming constructs mutexes, conditional variables, barriers, … masters/slaves, producers/consumers, work queues, … Common problems livelock, deadlock, data starvation, priority inversion… dining philosophers, sleeping barbers, cigarette smokers, … The reality: programmer shoulders the burden of managing concurrency… (I want my students developing new machine learning algorithms, not debugging race conditions)

Source: Ricardo Guimarães Herrmann

Source: MIT Open Courseware

The datacenter is the computer! Source: NY Times (6/14/2006)

MapReduce

MapReduce • Functional programming meets distributed processing • Independent per-record processing in parallel • Aggregation of intermediate results to generate final output • Programmers specify two functions: map (k, v) → <k’, v’>* reduce (k’, v’) → <k’, v’>* • All values with the same key are sent to the same reducer • The execution framework handles everything else… • Handles scheduling • Handles data management, transport, etc. • Handles synchronization • Handles errors and faults

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3

UserProgram (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output file 0 (5) remote read worker split 1 (3) read split 2 (4) local write worker split 3 output file 1 split 4 worker worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files (I want my students developing new machine learning algorithms, not debugging race conditions) Adapted from (Dean and Ghemawat, OSDI 2004)

MapReduce Implementations • Google has a proprietary implementation in C++ • Bindings in Java, Python • Hadoop is an open-source implementation in Java • Development led by Yahoo, used in production • Now an Apache project • Rapidly expanding software ecosystem • Lots of custom research implementations • For GPUs, cell processors, etc.

Case Study #1Statistical Machine Translation Chris Dyer (Linguistics Ph.D., 2010)

Statistical Machine Translation Word Alignment Phrase Extraction Training Data (vi, i saw) (la mesa pequeña, the small table) … i saw the small table vi la mesa pequeña Parallel Sentences he sat at the table the service was good LanguageModel Translation Model Target-Language Text Decoder maria no daba una bofetada a la bruja verde mary did not slap the green witch Foreign Input Sentence English Output Sentence

Translation as a Tiling Problem a Maria no dio una bofetada la bruja verde Mary not give a slap to the witch green Mary did not by a slap green witch did not green witch to the no slap slap did not give to the the slap the witch

The Data Bottleneck “Every time I fire a linguist, the performance of our … system goes up.” - Fred Jelinek

Statistical Machine Translation We’ve built MapReduce implementations of these two components! Word Alignment Phrase Extraction Training Data (vi, i saw) (la mesa pequeña, the small table) … i saw the small table vi la mesa pequeña Parallel Sentences he sat at the table the service was good LanguageModel Translation Model Target-Language Text Decoder maria no daba una bofetada a la bruja verde mary did not slap the green witch Foreign Input Sentence English Output Sentence

HMM Alignment: Giza Single-core commodity server

HMM Alignment: MapReduce Single-core commodity server 38 processor cluster

HMM Alignment: MapReduce 38 processor cluster 1/38 Single-core commodity server

What’s the point? • The optimally-parallelized version doesn’t exist! • MapReduce occupies a sweet spot in the design space for a large class of problems: • Fast… in terms of running time + scaling characteristics • Easy… in terms of programming effort • Cheap… in terms of hardware costs Chris Dyer, Aaron Cordova, Alex Mont, and Jimmy Lin. Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce. Proceedings of the Third Workshop on Statistical Machine Translation at ACL 2008

Case Study #2DNA Sequence Alignment Michael Schatz (Computer Science Ph.D., 2010)

Strangely-Formatted Manuscript • Dickens: A Tale of Two Cities • Text written on a long spool It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

… With Duplicates • Dickens: A Tale of Two Cities • “Backup” on four more copies It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Shredded Book Reconstruction It was the best of It was the best of times, it was the worst times, it was the worst of times, it was the of times, it was the age of wisdom, it was age of wisdom, it was the age of foolishness, … the age of foolishness, … • Dickens accidently shreds the manuscript • How can he reconstruct the text? • 5 copies x 138,656 words / 5 words per fragment = 138k fragments • The short fragments from every copy are mixed together • Some fragments are identical It was the best It was the best of times, it was the of times, it was the worst of times, it was worst of times, it was the age of wisdom, it the age of wisdom, it was the age of foolishness, was the age of foolishness, It was the It was the best of times, it was best of times, it was the worst of times, it the worst of times, it was the age of wisdom, was the age of wisdom, it was the age of it was the age of foolishness, … foolishness, … It was It was the best of times, it the best of times, it was the worst of times, was the worst of times, it was the age of it was the age of wisdom, it was the age wisdom, it was the age of foolishness, … of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It It was the best of times, was the best of times, it was the worst of it was the worst of times, it was the age times, it was the age of wisdom, it was the of wisdom, it was the age of foolishness, … age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of Greedy Assembly age of wisdom, it was best of times, it was it was the age of it was the age of It was the best of it was the worst of was the best of times, the best of times, it of times, it was the best of times, it was of times, it was the of times, it was the of wisdom, it was the of times, it was the the age of wisdom, it times, it was the worst the best of times, it times, it was the age the worst of times, it times, it was the age The repeated sequence make the correct reconstruction ambiguous! Alternative: model sequence reconstruction as a graph problem… times, it was the worst was the age of wisdom, was the age of foolishness, was the best of times,

de Bruijn Graph Construction • Dk = (V,E) • V = All length-ksubfragments (k < l) • E = Directed edges between consecutive subfragments(Nodes overlap by k-1 words) • Locally constructed graph reveals the global structure • Overlaps between sequences implicitly computed Original Fragment Directed Edge It was the best of It was the best was the best of de Bruijn, 1946 Idury and Waterman, 1995 Pevzner, Tang, Waterman, 2001

de Bruijn Graph Assembly It was the best was the best of the best of times, best of times, it of times, it was times, it was the it was the worst it was the age A unique Eulerian tour of the graph reconstructs the original text If a unique tour does not exist, try to simplify the graph as much as possible was the worst of the age of foolishness was the age of the worst of times, worst of times, it the age of wisdom, age of wisdom, it of wisdom, it was wisdom, it was the

de Bruijn Graph Assembly It was the best of times, it it was the worst of times, it of times, it was the the age of foolishness it was the age of A unique Eulerian tour of the graph reconstructs the original text If a unique tour does not exist, try to simplify the graph as much as possible the age of wisdom, it was the

GATGCTTACTATGCGGGCCCC CGGTCTAATGCTTACTATGC GCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT ? TAATGCTTACTATGC AATGCTTAGCTATGCGGGC AATGCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT CGGTCTAGATGCTTACTATGC AATGCTTACTATGCGGGCCCCTT CGGTCTAATGCTTAGCTATGC ATGCTTACTATGCGGGCCCCTT Reads Human genome: 3 gbp A few billion short reads (~100 GB compressed data) Subject genome Sequencer Present solutions: large-shared memory machines or clusters with high-speed interconnects Can we get by with MapReduce on cheap commodity clusters?

Graph Compression Randomized Solution • Randomly assign H / T to each compressible node • Compress H  T links Challenges • Nodes stored on different machines • Nodes can only access direct neighbors

Fast Graph Compression Initial Graph: 42 nodes

Fast Graph Compression Round 1: 26 nodes (38% savings)

Contrail • De Novo Assembly of the Human Genome • Genome: African male NA18507 (SRA000271, Bentley et al., 2008) • Input: 3.5B 36bp reads, 210bp insert (~40x coverage) B’ A B B’ Clip Tips Pop Bubbles Initial Compressed C A B N Max >7 B 27 bp >1 B 303 bp 5.0 M 14,007 bp 4.2 M 20,594 bp Assembly of Large Genomes with Cloud Computing. Schatz MC, Sommer D, Kelley D, Pop M, et al. In Preparation.

Source: flickr (fatboyke/2918399820)

Source: flickr (60in3/2338247189)

Best thing since sliced bread? • Distributed programming models: • MapReduce is the first • Definitely not the only • And probably not even the best • Alternatives: Pig, Dryad/DryadLINQ, Pregel, etc. • It’s all about the right level of abstraction • The von Neumann architecture won’t cut it anymore • Separating the what from how • Developer specifies the computation that needs to be performed • Execution framework handles actual execution • Framework hides system-level details from the developers

The datacenter is the computer! What are the appropriate abstractions for the datacenter computer? Source: NY Times (6/14/2006)

Source: flickr (infidelic/3008675635)

Commoditization of large-data processing capabilities allows us to ride the rising tide! Source: Wikipedia (Tide)

Questions?

Embracing the Data Deluge: Data-Intensive Computing for the Masses