1 / 0

BigData Data Structures and Algorithms

BigData Data Structures and Algorithms. NetApp University Day Student Workshop – March 2013 Sai Susarla Y Giridhar Appaji Nag. What is BigData – the 4 “V”s. Volume : The size and volume of information is very large

franz
Download Presentation

BigData Data Structures and Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BigData Data Structures and Algorithms

    NetApp University Day Student Workshop – March 2013 SaiSusarla Y GiridharAppajiNag NetApp Confidential - Internal Use Only
  2. What is BigData – the 4 “V”s Volume: The size and volume of information is very large Velocity: Arrives fast and decisions based on the data maybe time sensitive Variety: Need not be structured, may have been gathered from different sources Veracity: The ability to trust the information enough to be able to make decisions A promise of useful information that can be derived from looking at the data as a whole Rather than as independent parts containing the same amount of total data NetApp Confidential - Internal Use Only
  3. Examples of BigData Applications NetApp Confidential - Internal Use Only
  4. The Large Hadron Collider Experimental setup for answering physics questions concerning interactions and forces between fundamental particles 150 million sensors delivering data 40 million times each second from 600 million collisions per second High Velocity of data Scientists work with about 0.001% of the sensor stream data and store it. About 25 PBs of unique data per year, 200 PB after replication High Volume of data NetApp Confidential - Internal Use Only
  5. Linked Life Data Platform for semantic data integration for efficient reasoning in bio-medical and pharmaceutical data Uses Resource Description Framework – RDF – in the form of [subject, object, predicate]. Billions of triples are common High Volume of data Distributed, redundant and syndicates tons of heterogeneous bio-medical knowledge High Veracity and Variety of data NetApp Confidential - Internal Use Only
  6. FICO Fraud Manager Solution for pro-active payment card fraud detection Adaptive real-time analytics to detect credit card fraud High Velocity of data Protects over 2 billion active accounts and a multiple of transactions over those High Volume of data NetApp Confidential - Internal Use Only
  7. Deduplication of Data Deduplication examines data-sets or I/O streams at a sub-file and/or cross-file level to identify duplicate data and store only the unique data Supports very high throughput for ingesting data into the system High Volume of data “Inline” deduplication removes duplicate data as it is ingested into the system High Velocity of data processing Should not miss storing new unique data High Veracity of data NetApp Confidential - Internal Use Only
  8. Importance of BigData Data Structures and Algorithms NetApp Confidential - Internal Use Only
  9. Fast Indexing and Small Indices I'm trying to create indexes on a table with 308 million rows. It took ~20 minutes to load the table but 10 days to build indexes on it. The table's MYD file is 3.2G and its MYI file is 7.7G. http://bugs.mysql.com/9544 NetApp Confidential - Internal Use Only
  10. Write Optimization is Important Select queries were slow until I added an index onto the timestamp field...Adding the index really helped our reporting, BUT now the inserts are taking forever. A comment on mysqlperformanceblog.com NetApp Confidential - Internal Use Only
  11. Fast Incremental Updates By converting the indexing system to an incremental system, we are able to process individual documents as they are crawled. This reduced the average document processing latency by a factor of 100, and the average age of a document appearing in a search result dropped by nearly 50 percent http://research.google.com/pubs/pub36726.html NetApp Confidential - Internal Use Only
  12. Data Structures and Algorithms With Emphasis on the Deduplication Problem NetApp Confidential - Internal Use Only
  13. Problems with BigData Characteristics of BigData Can’t usually possibly fit all the data in memory Can’t process all of it using the CPU power in one machine Can’t possibly move all of it to one place for processing These characteristics are applicable to non-Big Data too but the 4 “V”s make solutions to these problems even harder We will look at these in detail in the context of the deduplication problem NetApp Confidential - Internal Use Only
  14. Background De-duplication Scan the data written to disk to compute checksums of all the new data blocks Lookup newly computed checksums within an existing checksum database Replace references to duplicate blocks with references to existing blocks and delete the “new” duplicate blocks Merge unique new checksum entries into the checksum database
  15. Background Deduplication via MergeSort Merge Sort first time data Merge Sort for new data NetApp Confidential - Internal Use Only
  16. Inline De-duplication Compute checksums of data blocks in incoming data streams Lookupcomputed checksum in an existing checksum database In case of a match, write meta-data to disk pointing to existing data location and discard the data block In case of a mismatch, write the data to disk and add the new checksum entry to the database
  17. Bloom Filters Space efficient probabilistic data structure for set membership “n” objects are hashed into a “m”-bit filter using “k” hash functions All “k” bits set to 1 => object maybe present Any one of the “k” bits == 0 => not present False positives are possible, with probability No false negatives NetApp Confidential - Internal Use Only
  18. Basic Bloom Filter Illustration Bloom filter with m = 18 and k = 3 Given “n” and an allowed false-positive probability “p” NetApp Confidential - Internal Use Only
  19. Bloom Filter in a Key-Value Store NetApp Confidential - Internal Use Only
  20. False Positive Rates in Bloom Filters False positive rates for 1 PB of data lookup with various bloom filter - block size combinations NetApp Confidential - Internal Use Only
  21. BigData Algorithms Across Networks NetApp Confidential - Internal Use Only
  22. Motivation Large scale data processing needs hundreds or thousands of CPUs Across many machines that hold the data And these machines may fail routinely Many tasks process huge amounts of data and produce large amounts of data Parallel programming paradigms (e.g. MPI) are specialized in nature and difficult to use NetApp Confidential - Internal Use Only
  23. Enter Map-Reduce A new programming model Input and Output are a set of key/value pairs Programmer specifies two functions map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values Inspired by similar primitives in LISP etc. NetApp Confidential - Internal Use Only
  24. Map-Reduce Illustration NetApp Confidential - Internal Use Only
  25. Map Reduce for Reverse Web Link Graph Map: outputs <target, source> pairs for each link to a target URL found in a page named source. Reduce: concatenates the list of all source URLs associated with a given target URL and emits the pair <target, list(source)> An intermediate “sort by target” step Exercise: Can the background deduplication problem be solved using Map-Reduce? NetApp Confidential - Internal Use Only
  26. Map-Reduce Over the Network NetApp Confidential - Internal Use Only
  27. Fault Tolerance in Map-Reduce On worker failure Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through a master Master failure: Assumed to be unlikely NetApp Confidential - Internal Use Only
  28. Problems for the Break Out Session NetApp Confidential - Internal Use Only
  29. Problem – Data Volume If the volume of data processed for your research work suddenly increases a 1000 (or perhaps 100000) times, what would you do? How does the run-time complexity measure of your algorithms begin to change and matter? How would you handle the imbalances in the availability of resources in your problem area E.g. lots of memory, but not on one machine E.g. lots of machines but only limited electricity and ability to power only few of them at a time NetApp Confidential - Internal Use Only
  30. Problems – Data Organisation If you want to build a customized system to hold and process your data, what requirements would you place on a system like that? Which semantics (e.g. from ACID semantics of a database) would you be willing to relax? What if the reliability of that data doesn’t matter “beyond a point”? How would you deal with unreliable components in your system E.g. unreliable memory, borrowed software that can occasionally crash, big changes in bandwidth of network links NetApp Confidential - Internal Use Only
  31. Problems – Approximate Answers Is there a chance that you don’t care if you get the right answers to your questions 99% of time? Are there parameters that you measure which need not be correct all the time? How about if you could get 80% of the answers right using only 20% “processing time” Can you get the rest of the 20% answers using less than the “remaining 80% time” How would you aggregate results, from say two or more runs, to get 100% of the answers right? NetApp Confidential - Internal Use Only
  32. Reference Material NetApp Confidential - Internal Use Only
  33. Distributed Graph Algorithms Map-Reduce is suitable for “aggregation of data” and SQL-like queries Computation over graphs is more amenable to a “message passing model” Exercise: Read the “Pregel: A System for Large-Scale Graph Processing” paper NetApp Confidential - Internal Use Only
  34. Other Interesting Work FastBit: Efficient Compressed Bitmap Indices for Fast Queries Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications Dremel: Interactive Analysis of Web-Scale Datasets NetApp Confidential - Internal Use Only
  35. Few Conferences/Workshops Workshop on Algorithms for Modern Massive Data Sets – MMDS http://mmds.stanford.edu/ Algorithms and Data Structures Symposium http://www.wads.org/ ACM SIGSPATIAL Workshop on Analytics for Big GeoSpatial data http://www.sigspatial.org/ IEEE Big Data Congress http://www.ieeebigdata.org NetApp Confidential - Internal Use Only
  36. Credits Wikimedia Commons for various public domain pictures TokuDB presentation for MySQL anecdotes OSDI 2004 MapReduce presentation NetApp Confidential - Internal Use Only
  37. NetApp Confidential - Internal Use Only
More Related