1 / 20

Map/Reduce in Practice

Map/Reduce in Practice. Hadoop , Hbase , MongoDB , Accumulo , and related Map/Reduce-enabled data stores. How we got here. Google. Uses. To Provide. Map/Reduce. GFS. BigTable. Uses. To Provide. Hadoop. HDFS. HBase. Related Stuff…. Accumulo. Cassandra. MongoDB.

dava
Download Presentation

Map/Reduce in Practice

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce-enabled data stores

  2. How we got here Google Uses To Provide Map/Reduce GFS BigTable Uses To Provide Hadoop HDFS HBase Related Stuff… Accumulo Cassandra MongoDB

  3. In the beginning was the Google • Larry and Sergey had a lot of data • Needed fast distributed large files • Needed location awareness • GFS was born:

  4. Processing that data • Needed some way to process it all efficiently • Move processing to the data • Distributed processing • Only transfer minimal results • Map/Reduce

  5. Files are good, structure is better • Map/Reduce naturally produces and functions on structured data (key => value pairs) • Needed a way to efficiently store and access data • BigTable • Compressed, sparse, distributed, multidimensional

  6. Open, sortof • Google told the world about this great stuff: • Dean, Jeffrey and Ghemawat, Sanjay. “MapReduce: Simplified Data Processing on Large Clusters,” OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. • Chang, Fay et al. “Bigtable: A Distributed Storage System for Structured Data,” OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006. • But they weren’t sharing the implementations

  7. Hadoop: Map/Reduce for the masses • Open source Apache project • Derived from Google papers • Consists of Hadoop Kernel, MapReduce, and HDFS • Also related projects Hive, Hbase, Zookeeper, etc.

  8. Hadoop Architecture

  9. MapReduce Layer • Takes Jobs, which are split into Tasks • Tasks are executed on worker nodes that, ideally, store the data the task needs to process • If that’s not possible, the task attempts to execute on a worker node in the same rack as the data • Tasks might be map tasks or reduce tasks, depending on what the job tracker needs at the time

  10. HDFS Layer • Consists of namenode, secondary namenode for replication, and datanodes • Datanodes contain redundant copies of data, generally 2 copies on one rack, and a third copy on a different rack • Exposes data location information to Jobtracker so tasks can be distributed to workers close to the data • Not a POSIX file system, and can’t be mounted directly

  11. Other Storage • Hadoop is flexible about what storage system is used • Alternatives are Amazon S3, CloudStore, FTP Filesystem, and read-only HTTP(S) file systems • Only HDFS and CloudStore are rack-aware, though • Multiple data store implementations • Also, HDFS isn’t restricted to Hadoop. Hbase and other projects use it as storage

  12. HBase • Basically open-source BigTable • Non-relational, distributed, sparse, multi-dimensional, compressed data • Tables can be input/output for MapReduce jobs run in Hadoop • Support Bloom filters • Another thing borrowed from BigTable • Can tell you if something isn’t in the column, but not necessarily if it is there

  13. Data Model • Data is stored as rows with a single key, timestamp, and multiple columnfamilies • Data is sorted based on the key, but otherwise there aren’t any indexes • Supports 4 operations: Get, Put, Scan, Delete • Deletes don’t actually delete, they just mark a row as dead, for later compactions to clean up

  14. Digression: Bloom Filters • Maintains a bit array like a hash table • Each item, when inserted to the column, is hashed with k different algorithms, and the resulting index bit is set to 1. • To determine if a value is in the table, hash it with the k algorithms and see if all the indexes are set to 1. If one or more is missing, the value isn’t in there • But if there is a non-zero probability that all will be 1 and the value won’t be there. • Write-only, since you never know which entries duplicated a bit

  15. So, why bother? • Column scans are expensive, and that’s about the only way to find stuff in a column that’s not the key

  16. Accumulo • Hbase for the NSA • Provides basically the same functionality of Hbase, but with security • Adds a new element to the key, Column Visibility • Stores a logical combination of security labels that must be satisfied at query time for the key/value to be returned • Hence a single table can store data with various security levels, and users only see what they’re allowed to see

  17. Cassandra • A lot like Hbase, with BigTable inspiration, but also inspired by Amazon Dynamo (cloud key/value store) • Also has columnfamilies (and even supercolumns), but allows secondary indexes • Distribution and replication are tunable • Writes faster than reads, so good for logging, etc.

  18. Cassandra vs. HBase • Basically comes down to the CAP theorem: • You have to pick two of Consistency, Availability, and Partition tolerance. You can’t have all 3. • Cassandra chooses AP, though you can get consistency if you can tolerate greater latency. • By default provides weak consistency • Hbase values CP, but availability may suffer. In the event of a partition (node failure), the data won’t be available if it can’t be guaranteed to be consistent with committed operation.

  19. MongoDB • Document-Oriented Storage • Full index support • Replication and high availability • Auto-sharding to scale horizontally • Javascript-based querying • Map/Reduce • GridFS storage

  20. Conclusion • There are a lot of options out there, and more all the time • RDBMS offers the most functionality, but stumbles at the scalability problem • Key/value stores scale, but require different processing model • Best option will be determined by a combination of data and task

More Related