Http://www.excelonlineclasses.co.nr/
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

excelonlineclasses.co.nr/ [email protected] PowerPoint PPT Presentation


  • 54 Views
  • Uploaded on
  • Presentation posted in: General

http://www.excelonlineclasses.co.nr/ [email protected] Excel Online Classes offers following services :. Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector. Nagarjuna K. MapReduce Anatomy. AGENDA.

Download Presentation

excelonlineclasses.co.nr/ [email protected]

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Excelonlineclasses co nr excel onlineclasses gmail

http://www.excelonlineclasses.co.nr/

[email protected]

http://www.excelonlineclasses.co.nr/


Excelonlineclasses co nr excel onlineclasses gmail

Excel Online Classes offers following services:

  • Online Training

  • Development

  • Testing

  • Job support

  • Technical Guidance

  • Job Consultancy

  • Any needs of IT Sector


Nagarjuna k

Nagarjuna K

MapReduce Anatomy


Agenda

AGENDA

  • Anatomy of MapReduce

    • MR work flow

    • Hadoop data types

    • Mapper

    • Reducer

    • Partitioner

    • Combiner

  • Input Split vs Block Size


Anatomy of mr

Anatomy of MR

Partitioning

Shuffling

.

INPUT

DATA

NODE 2

NODE 2

NODE 1

Map

Map

Map

Interim data

Interim data

Interim data

Reduce

Reduce

Reduce

Node to store output

Node to store output

Node to store output


Hadoop data types

Hadoop data types

  • MR has a defined way of keys and values types  for it to move across cluster

  • Values  Writable

  • Keys  WritableComparable<T>

    • WritableComparable = Writable+Comparable<T>


Frequently used key value

Frequently used key/value


Custom writable

Custom Writable

  • For any class to be value, ithas to implement org.apache.hadoop.io.Writable

    • write(DataOutput out)

    • readFields(DataInput in)


Custom key

Custom key

  • For any class to be key, it has to implement org.apache.hadoop.io.WritableComparable<T>

    • +

    • compareTo(T o)


Checkout writables

Checkout Writables

  • Check out few of the writables and writable comparable

  • Time to write your own writables


Mapreduce libraries

MapReduce libraries

  • Two libraries in Hadoop

    • org.apache.hadoop.mapred.*

    • org.apache.hadoop.mapreduce.*


Mapper

Mapper

  • Should implement

    org.apache.hadoop.mapred.Mapper<K1,V1,K2,V2>

    • Void configure(JobConf job)

      • All the parameters specified in the xmls are available here.

      • Any parameter explicitly set are also available

      • Call before data processing starts

    • Void map (K1 key,V1 value, OutputCollector<K2,V2> output,Reporter reporter)

      • Data process starts

    • Void Close()

      • Should close any files, db connections etc.,

    • Reporter provides extra information of mapper to TT


Mappers default

Mappers -default


Reducer

Reducer

  • Should implement

    org.apache.hadoop.mapred.Redcuer

    • Sorts the incoming data based on key and groups together all the values for a key

    • Reduce function is called for every key in the sorted order

      • void reduce(K2 key, Iterator<V2> values,OutputCollector<K3,V3> output, Reporter reporter)

    • Reporter provides extra information of mapper to TT


Reducer default

Reducer -default


Partitioner

Partitioner

  • implements Partitioner<K,V>

    • configure()

    • intgetPartition ( … )

      • 0< return<no.of.reducers

  • Generally, implement Partitioner so same keys go to one reducer


Reading and writing

Reading and Writing

  • Generally two kinds of files in Hadoop

    • Text (plain , XML, html …. )

    • Binary (Sequence)

      • It is a hadoop specific compressed binary file format.

      • Optimized to transfer output from one MR to MR

    • We can customize


Input format

Input Format

  • HDFS block size

  • Input splits


Blocks in hdfs

Blocks in HDFS

  • Big File is divided into multiple blocks and stored in hdfs.

  • This is a physical division of data

  • dfs.block.size(64MB default size)

LARGE FILE

BLOCK 1

BLOCK 2

BLOCK 3

BLOCK 4


Input splits and records

Input Splits and Records

LOGICAL DIVISION

  • Input split

    • A chunk of data processed by a mapper

    • Further divided into records

    • Map process these records

      • Record = key + value

    • How to correlate to a DB table

      • Group of rows  split

      • Row  record


Inputsplit

InputSplit

public interface InputSplit extends Writable {

long getLength() throws IOException;

String[] getLocations() throws IOException;

}

  • It doesn’t contain the data

    • Only locations where the data is present

    • Helps jobtracker to arrange tasktrackers (data locality).

  • getLength greater length split will be executed 


Inputformat

InputFormat

  • How we get the data to mapper

    • Inputsplits and how the splits are divided into records will be taken care by inputformat.

      public interface InputFormat<K, V> {InputSplit[] getSplits(JobConf job, intnumSplits) throws IOException;

      RecordReader<K, V> getRecordReader(InputSplit split, JobConfjob, Reporter reporter) throws IOException;

      }


Inputformat1

InputFormat

  • Mapper

    • getRecordReader() is called to get RecordReader

    • Once the record reader is obtained,

      • Map method is called recursively until the end of the split


Recordreader

RecordReader

K key = reader.createKey();V value = reader.createValue();

while (reader.next(key, value)) {

mapper.map(key, value, output, reporter);

}


Job submission retrospection

Job Submission -- retrospection

  • JobClient running the job

    • Gets inputsplits by calling getSplits() in InputFormat

    • Determines data locations for the splits

    • Sends these locations to the JobTracker

    • JobTracker assigns mappers appropriately.

      • Data locality


I nbuilt inputformats

InBuiltInputFormats


Fileinputformat

FileInputFormat

  • Base class for all implementations of InputFormat, which uses files as input

  • Defines

    • Which files to include for the job

    • Implementation for generating splits


Fileinputformat1

FileInputFormat

  • Set of Files  converts to no.of splits

    • Splits only large files…. HOW LARGE ?

    • Larger than BlockSize

  • Can we control it ?


Calculating split size

Calculating Split Size

  • Application may impose minimum split size greater than Block Size.

  • There is no good reason to that

    • Data locality is lost


Fileinputformat2

FileInputFormat

  • Min split size

    • We might set it to larger than block size

    • But concept of data locality may be lost to some extent

  • Split size calculated by formula

    • max(minimumSize, min(maximumSize, blockSize))

    • By default

      • minimumSize < blockSize < maximumSize


File information in the mapper

File Information in the mapper

  • Configure(JobConf job)


Textinputformat

TextInputFormat

  • Default FileInputFormat

    • Each line is a value

    • Byte offset is a key

  • Example

    • Run identity mapper program


Input splits and hdfs blocks

Input Splits and HDFS Blocks

  • Logical Records defined by FileInputFormat doesn’t usually fit it into HDFS blocks.

    • EveryFileis written is written as sequence of bytes.

    • 64 MB reached ? then start the new block

    • When 64 MB reached, the logical record may be half written

    • So, the other half of logical record goes into the next HDFS block.


Input splits and hdfs blocks1

Input Splits and HDFS Blocks

  • So even in data locality some remote reading is done.. a slight overhead.

    • Split gives logical record boundaries

    • Blocks – physical boundaries (size)


Small files

Small Files

  • Files which are very small are inefficient in mapper phase

  • Imagine 1GB

    • 64Mb – 16 files – 16 mappers

    • 100kb – 1000 files – 1000 mappers 


Combinefileinputformat

CombineFileInputFormat

  • Packs many files into single split

    • Data locality is taken into consideration

  • MR accelerates best if operated at disk transfer rate not at seek rate

  • This helps in processing large files also


Nlineinputformat

NLineInputFormat

  • Same as TextInputFormat

  • Each split guarenteed to have N lines

  • mapred.line.input.format.linespermap


Keyvaluetextinputformat

KeyValueTextInputFormat

  • Each line in text file is a record

  • First separator character divides key and value

    • Default is ‘\t’

  • Controller property

    • key.value.separator.in.input.line


Sequencefileinputformat k v

SequenceFileInputFormat<K,V>

  • InputFormat for reading sequence files

  • User defined Key K

  • User defined Value V

  • They are splittable files.

    • WellSuited for MR

    • They store compression

    • They can store arbitrary types


Outputformat

OutputFormat


Textoutformat

TextOutFormat

  • key,values stored as \t separated by default.

    • mapred.textoutputformat.separator -- parameter

      CounterPart for KeyValueTextInputFormat

  • Can suppress key/value by using NullWritable


  • Login