BigData Data Structures and Algorithms

BigData Data Structures and Algorithms
NetApp University Day Student Workshop – March 2013 SaiSusarla Y GiridharAppajiNag NetApp Confidential - Internal Use Only

What is BigData – the 4 “V”s Volume: The size and volume of information is very large Velocity: Arrives fast and decisions based on the data maybe time sensitive Variety: Need not be structured, may have been gathered from different sources Veracity: The ability to trust the information enough to be able to make decisions A promise of useful information that can be derived from looking at the data as a whole Rather than as independent parts containing the same amount of total data NetApp Confidential - Internal Use Only

Examples of BigData Applications NetApp Confidential - Internal Use Only

The Large Hadron Collider Experimental setup for answering physics questions concerning interactions and forces between fundamental particles 150 million sensors delivering data 40 million times each second from 600 million collisions per second High Velocity of data Scientists work with about 0.001% of the sensor stream data and store it. About 25 PBs of unique data per year, 200 PB after replication High Volume of data NetApp Confidential - Internal Use Only

Linked Life Data Platform for semantic data integration for efficient reasoning in bio-medical and pharmaceutical data Uses Resource Description Framework – RDF – in the form of [subject, object, predicate]. Billions of triples are common High Volume of data Distributed, redundant and syndicates tons of heterogeneous bio-medical knowledge High Veracity and Variety of data NetApp Confidential - Internal Use Only

FICO Fraud Manager Solution for pro-active payment card fraud detection Adaptive real-time analytics to detect credit card fraud High Velocity of data Protects over 2 billion active accounts and a multiple of transactions over those High Volume of data NetApp Confidential - Internal Use Only

Deduplication of Data Deduplication examines data-sets or I/O streams at a sub-file and/or cross-file level to identify duplicate data and store only the unique data Supports very high throughput for ingesting data into the system High Volume of data “Inline” deduplication removes duplicate data as it is ingested into the system High Velocity of data processing Should not miss storing new unique data High Veracity of data NetApp Confidential - Internal Use Only

Importance of BigData Data Structures and Algorithms NetApp Confidential - Internal Use Only

Fast Indexing and Small Indices I'm trying to create indexes on a table with 308 million rows. It took ~20 minutes to load the table but 10 days to build indexes on it. The table's MYD file is 3.2G and its MYI file is 7.7G. http://bugs.mysql.com/9544 NetApp Confidential - Internal Use Only

Write Optimization is Important Select queries were slow until I added an index onto the timestamp field...Adding the index really helped our reporting, BUT now the inserts are taking forever. A comment on mysqlperformanceblog.com NetApp Confidential - Internal Use Only

Fast Incremental Updates By converting the indexing system to an incremental system, we are able to process individual documents as they are crawled. This reduced the average document processing latency by a factor of 100, and the average age of a document appearing in a search result dropped by nearly 50 percent http://research.google.com/pubs/pub36726.html NetApp Confidential - Internal Use Only

Data Structures and Algorithms With Emphasis on the Deduplication Problem NetApp Confidential - Internal Use Only

Problems with BigData Characteristics of BigData Can’t usually possibly fit all the data in memory Can’t process all of it using the CPU power in one machine Can’t possibly move all of it to one place for processing These characteristics are applicable to non-Big Data too but the 4 “V”s make solutions to these problems even harder We will look at these in detail in the context of the deduplication problem NetApp Confidential - Internal Use Only

Background De-duplication Scan the data written to disk to compute checksums of all the new data blocks Lookup newly computed checksums within an existing checksum database Replace references to duplicate blocks with references to existing blocks and delete the “new” duplicate blocks Merge unique new checksum entries into the checksum database

Background Deduplication via MergeSort Merge Sort first time data Merge Sort for new data NetApp Confidential - Internal Use Only

Inline De-duplication Compute checksums of data blocks in incoming data streams Lookupcomputed checksum in an existing checksum database In case of a match, write meta-data to disk pointing to existing data location and discard the data block In case of a mismatch, write the data to disk and add the new checksum entry to the database

Bloom Filters Space efficient probabilistic data structure for set membership “n” objects are hashed into a “m”-bit filter using “k” hash functions All “k” bits set to 1 => object maybe present Any one of the “k” bits == 0 => not present False positives are possible, with probability No false negatives NetApp Confidential - Internal Use Only

Basic Bloom Filter Illustration Bloom filter with m = 18 and k = 3 Given “n” and an allowed false-positive probability “p” NetApp Confidential - Internal Use Only

Bloom Filter in a Key-Value Store NetApp Confidential - Internal Use Only

False Positive Rates in Bloom Filters False positive rates for 1 PB of data lookup with various bloom filter - block size combinations NetApp Confidential - Internal Use Only

BigData Algorithms Across Networks NetApp Confidential - Internal Use Only

Motivation Large scale data processing needs hundreds or thousands of CPUs Across many machines that hold the data And these machines may fail routinely Many tasks process huge amounts of data and produce large amounts of data Parallel programming paradigms (e.g. MPI) are specialized in nature and difficult to use NetApp Confidential - Internal Use Only

Enter Map-Reduce A new programming model Input and Output are a set of key/value pairs Programmer specifies two functions map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values Inspired by similar primitives in LISP etc. NetApp Confidential - Internal Use Only

Map-Reduce Illustration NetApp Confidential - Internal Use Only

Map Reduce for Reverse Web Link Graph Map: outputs <target, source> pairs for each link to a target URL found in a page named source. Reduce: concatenates the list of all source URLs associated with a given target URL and emits the pair <target, list(source)> An intermediate “sort by target” step Exercise: Can the background deduplication problem be solved using Map-Reduce? NetApp Confidential - Internal Use Only

Map-Reduce Over the Network NetApp Confidential - Internal Use Only

Fault Tolerance in Map-Reduce On worker failure Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through a master Master failure: Assumed to be unlikely NetApp Confidential - Internal Use Only

Problems for the Break Out Session NetApp Confidential - Internal Use Only

Problem – Data Volume If the volume of data processed for your research work suddenly increases a 1000 (or perhaps 100000) times, what would you do? How does the run-time complexity measure of your algorithms begin to change and matter? How would you handle the imbalances in the availability of resources in your problem area E.g. lots of memory, but not on one machine E.g. lots of machines but only limited electricity and ability to power only few of them at a time NetApp Confidential - Internal Use Only

Problems – Data Organisation If you want to build a customized system to hold and process your data, what requirements would you place on a system like that? Which semantics (e.g. from ACID semantics of a database) would you be willing to relax? What if the reliability of that data doesn’t matter “beyond a point”? How would you deal with unreliable components in your system E.g. unreliable memory, borrowed software that can occasionally crash, big changes in bandwidth of network links NetApp Confidential - Internal Use Only

Problems – Approximate Answers Is there a chance that you don’t care if you get the right answers to your questions 99% of time? Are there parameters that you measure which need not be correct all the time? How about if you could get 80% of the answers right using only 20% “processing time” Can you get the rest of the 20% answers using less than the “remaining 80% time” How would you aggregate results, from say two or more runs, to get 100% of the answers right? NetApp Confidential - Internal Use Only

Reference Material NetApp Confidential - Internal Use Only

Distributed Graph Algorithms Map-Reduce is suitable for “aggregation of data” and SQL-like queries Computation over graphs is more amenable to a “message passing model” Exercise: Read the “Pregel: A System for Large-Scale Graph Processing” paper NetApp Confidential - Internal Use Only

Other Interesting Work FastBit: Efficient Compressed Bitmap Indices for Fast Queries Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications Dremel: Interactive Analysis of Web-Scale Datasets NetApp Confidential - Internal Use Only

Few Conferences/Workshops Workshop on Algorithms for Modern Massive Data Sets – MMDS http://mmds.stanford.edu/ Algorithms and Data Structures Symposium http://www.wads.org/ ACM SIGSPATIAL Workshop on Analytics for Big GeoSpatial data http://www.sigspatial.org/ IEEE Big Data Congress http://www.ieeebigdata.org NetApp Confidential - Internal Use Only

Credits Wikimedia Commons for various public domain pictures TokuDB presentation for MySQL anecdotes OSDI 2004 MapReduce presentation NetApp Confidential - Internal Use Only

NetApp Confidential - Internal Use Only

BigData Data Structures and Algorithms

BigData Data Structures and Algorithms

Presentation Transcript

BigData Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Algorithms and Data Structures

DATA STRUCTURES AND ALGORITHMS

Algorithms and Data Structures

Data Structures and Algorithms

Data Structures and Algorithms

Algorithms and Data Structures

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Algorithms and Data Structures