CPS216: Advanced Database Systems (Data-intensive Computing Systems)
This presentation is the property of its rightful owner.
Sponsored Links
1 / 15

Shivnath Babu PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on
  • Presentation posted in: General

CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop. Shivnath Babu. Word Count over a Given Set of Web Pages. see1 bob1 throw 1 see 1 spot 1 run 1. bob1 run 1 see 2 spot 1 throw1. see bob throw.

Download Presentation

Shivnath Babu

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Shivnath babu

CPS216: Advanced Database Systems (Data-intensive Computing Systems)Introduction to MapReduce and Hadoop

Shivnath Babu


Word count over a given set of web pages

Word Count over a Given Set of Web Pages

  • see1

  • bob1

  • throw 1

  • see 1

  • spot 1

  • run 1

  • bob1

  • run 1

  • see 2

  • spot 1

  • throw1

  • see bob throw

  • see spot run

Can we do word count in parallel?


The mapreduce framework pioneered by google

The MapReduce Framework (pioneered by Google)


Automatic parallel execution in mapreduce google

Automatic Parallel Execution in MapReduce (Google)

Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to avoid a slow task slowing down the whole job


Mapreduce in hadoop 1

MapReduce in Hadoop (1)


Mapreduce in hadoop 2

MapReduce in Hadoop (2)


Mapreduce in hadoop 3

MapReduce in Hadoop (3)


Data flow in a mapreduce program in hadoop

Data Flow in a MapReduce Program in Hadoop

 1:many

  • InputFormat

  • Map function

  • Partitioner

  • Sorting & Merging

  • Combiner

  • Shuffling

  • Merging

  • Reduce function

  • OutputFormat


Lifecycle of a mapreduce job

Map function

Reduce function

Run this program as a

MapReduce job

Lifecycle of a MapReduce Job


Lifecycle of a mapreduce job1

Map function

Reduce function

Run this program as a

MapReduce job

Lifecycle of a MapReduce Job


Shivnath babu

Lifecycle of a MapReduce Job

Time

Reduce

Wave 1

Input

Splits

Reduce

Wave 2

Map

Wave 1

Map

Wave 2

How are the number of splits, number of map and reduce

tasks, memory allocation to tasks, etc., determined?


Job configuration parameters

190+ parameters in Hadoop

Set manually or defaults are used

Job Configuration Parameters


How to sort data using hadoop

How to sort data using Hadoop?


Let us look at a complete example mapreduce program in hadoop

Let us look at a complete example MapReduce program in Hadoop


  • Login