Download
cs525 big data analytics n.
Skip this Video
Loading SlideShow in 5 Seconds..
CS525 : Big Data Analytics PowerPoint Presentation
Download Presentation
CS525 : Big Data Analytics

CS525 : Big Data Analytics

158 Views Download Presentation
Download Presentation

CS525 : Big Data Analytics

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. CS525:Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner

  2. Large-Scale Data Analytics • Many enterprises turn to Hadoop computing paradigm for big data applications : vs. Database Scalability (petabytes of data, thousands of machines) Performance (indexing, tuning, data organization tech.) Flexibility in accepting all data formats (no schema) Focus on read + write, concurrency, correctness, convenience, high-level access Efficient fault tolerance support Advanced Features: - Full query support - Clever optimizers - Views and security - Data consistency - …. Commodity inexpensive hardware

  3. What is Hadoop • Hadoop is a simple software framework for distributed processing of large datasets across huge clusters of (commodity hardware) computers : • Large datasets  Terabytes or petabytes of data • Large clusters  Hundreds or thousands of nodes • Open-source implementation for Google MapReduce • Simple programming model : MapReduce • Simple data model: flexible for any data

  4. Hadoop Framework • Two main layers: • Distributed file system (HDFS) • Execution engine (MapReduce) Hadoop is designed as a master-slave shared-nothing architecture

  5. Key Ideas of Hadoop • Automatic parallelization & distribution • Hidden from end-user • Fault tolerance and automatic recovery • Failed nodes/tasks recover automatically • Simple programming abstraction • Users provide two functions “map” and “reduce”

  6. Who Uses Hadoop ? • Google: Invent MapReduce computing paradigm • Yahoo: Develop Hadoop open-source of MapReduce • Integrators: IBM, Microsoft, Oracle, Greenplum • Adopters: Facebook, Amazon, AOL, NetFlex,LinkedIn • Many others …

  7. Hadoop Distributed File System (HDFS) 1 2 3 4 5 Centralized namenode - Maintains metadata info about files File F Blocks (64 MB) Many datanodes (1000s) - Store actual data - Files are divided into blocks - Each block is replicated N times (Default = 3)

  8. HDFS File System Properties • Large Space: An HDFS instance may consist of thousands of server machines for storage • Replication: Each data block is replicated • Failure: Failure is norm rather than exception • Fault Tolerance: Automated detection of faults and recovery

  9. Map-Reduce Execution Engine(Example: Color Count) Produces (k, v) ( , 1) Shuffle & Sorting based on k Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Input blocks on HDFS Produces(k’, v’) ( , 100) Users only provide the “Map” and “Reduce” functions

  10. MapReduce Engine • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Decides on how many tasks will run (number of mappers) • Decides on where to run each mapper (locality) Node 3 Node 1 Node 2 • This file has 5 Blocks  run 5 map tasks • Run task reading block “1” on Node 1 or 3.

  11. MapReduce Engine • Task Tracker is the slave node (runs on each datanode) • Receives the task from Job Tracker • Runs task to completion (either map or reduce task) • Communicates with Job Tracker to report its progress 1 map-reduce job consists of 4 map tasks and 3 reduce tasks

  12. About Key-Value Pairs • Developer provides Mapper and Reducer functions • Developer decides what is key and what is value • Developer must follow the key-value pair interface • Mappers: • Consume <key, value> pairs • Produce <key, value> pairs • Shuffling and Sorting: • Groups all similar keys from all mappers, • sorts and passes them to a certain reducer • in the form of <key, <list of values>> • Reducers: • Consume <key, <list of values>> • Produce <key, value>

  13. MapReduce Phases

  14. Another Example : Word Count • Job: Count occurrences of each word in a data set Reduce Tasks Map Tasks

  15. Summary : Hadoop vs. Typical DB