slide1
Download
Skip this Video
Download Presentation
excelonlineclasses.co.nr/ [email protected]

Loading in 2 Seconds...

play fullscreen
1 / 41

excelonlineclasses.co.nr/ [email protected] - PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on

http://www.excelonlineclasses.co.nr/ [email protected] Excel Online Classes offers following services :. Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector. Nagarjuna K. MapReduce Anatomy. AGENDA.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' excelonlineclasses.co.nr/ [email protected]' - milek


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

http://www.excelonlineclasses.co.nr/

[email protected]

http://www.excelonlineclasses.co.nr/

slide2

Excel Online Classes offers following services:

  • Online Training
  • Development
  • Testing
  • Job support
  • Technical Guidance
  • Job Consultancy
  • Any needs of IT Sector
agenda
AGENDA
  • Anatomy of MapReduce
    • MR work flow
    • Hadoop data types
    • Mapper
    • Reducer
    • Partitioner
    • Combiner
  • Input Split vs Block Size
anatomy of mr
Anatomy of MR

Partitioning

Shuffling

.

INPUT

DATA

NODE 2

NODE 2

NODE 1

Map

Map

Map

Interim data

Interim data

Interim data

Reduce

Reduce

Reduce

Node to store output

Node to store output

Node to store output

hadoop data types
Hadoop data types
  • MR has a defined way of keys and values types  for it to move across cluster
  • Values  Writable
  • Keys  WritableComparable<T>
    • WritableComparable = Writable+Comparable<T>
custom writable
Custom Writable
  • For any class to be value, ithas to implement org.apache.hadoop.io.Writable
    • write(DataOutput out)
    • readFields(DataInput in)
custom key
Custom key
  • For any class to be key, it has to implement org.apache.hadoop.io.WritableComparable<T>
    • +
    • compareTo(T o)
checkout writables
Checkout Writables
  • Check out few of the writables and writable comparable
  • Time to write your own writables
mapreduce libraries
MapReduce libraries
  • Two libraries in Hadoop
    • org.apache.hadoop.mapred.*
    • org.apache.hadoop.mapreduce.*
mapper
Mapper
  • Should implement

org.apache.hadoop.mapred.Mapper<K1,V1,K2,V2>

      • Void configure(JobConf job)
        • All the parameters specified in the xmls are available here.
        • Any parameter explicitly set are also available
        • Call before data processing starts
      • Void map (K1 key,V1 value, OutputCollector<K2,V2> output,Reporter reporter)
        • Data process starts
      • Void Close()
        • Should close any files, db connections etc.,
      • Reporter provides extra information of mapper to TT
reducer
Reducer
  • Should implement

org.apache.hadoop.mapred.Redcuer

    • Sorts the incoming data based on key and groups together all the values for a key
    • Reduce function is called for every key in the sorted order
      • void reduce(K2 key, Iterator<V2> values,OutputCollector<K3,V3> output, Reporter reporter)
    • Reporter provides extra information of mapper to TT
partitioner
Partitioner
  • implements Partitioner<K,V>
    • configure()
    • intgetPartition ( … )
      • 0< return<no.of.reducers
  • Generally, implement Partitioner so same keys go to one reducer
reading and writing
Reading and Writing
  • Generally two kinds of files in Hadoop
    • Text (plain , XML, html …. )
    • Binary (Sequence)
      • It is a hadoop specific compressed binary file format.
      • Optimized to transfer output from one MR to MR
    • We can customize
input format
Input Format
  • HDFS block size
  • Input splits
blocks in hdfs
Blocks in HDFS
  • Big File is divided into multiple blocks and stored in hdfs.
  • This is a physical division of data
  • dfs.block.size(64MB default size)

LARGE FILE

BLOCK 1

BLOCK 2

BLOCK 3

BLOCK 4

input splits and records
Input Splits and Records

LOGICAL DIVISION

  • Input split
    • A chunk of data processed by a mapper
    • Further divided into records
    • Map process these records
      • Record = key + value
    • How to correlate to a DB table
      • Group of rows  split
      • Row  record
inputsplit
InputSplit

public interface InputSplit extends Writable {

long getLength() throws IOException;

String[] getLocations() throws IOException;

}

  • It doesn’t contain the data
    • Only locations where the data is present
    • Helps jobtracker to arrange tasktrackers (data locality).
  • getLength greater length split will be executed 
inputformat
InputFormat
  • How we get the data to mapper
    • Inputsplits and how the splits are divided into records will be taken care by inputformat.

public interface InputFormat<K, V> {InputSplit[] getSplits(JobConf job, intnumSplits) throws IOException;

RecordReader<K, V> getRecordReader(InputSplit split, JobConfjob, Reporter reporter) throws IOException;

}

inputformat1
InputFormat
  • Mapper
    • getRecordReader() is called to get RecordReader
    • Once the record reader is obtained,
      • Map method is called recursively until the end of the split
recordreader
RecordReader

K key = reader.createKey();V value = reader.createValue();

while (reader.next(key, value)) {

mapper.map(key, value, output, reporter);

}

job submission retrospection
Job Submission -- retrospection
  • JobClient running the job
    • Gets inputsplits by calling getSplits() in InputFormat
    • Determines data locations for the splits
    • Sends these locations to the JobTracker
    • JobTracker assigns mappers appropriately.
      • Data locality
fileinputformat
FileInputFormat
  • Base class for all implementations of InputFormat, which uses files as input
  • Defines
    • Which files to include for the job
    • Implementation for generating splits
fileinputformat1
FileInputFormat
  • Set of Files  converts to no.of splits
    • Splits only large files…. HOW LARGE ?
    • Larger than BlockSize
  • Can we control it ?
calculating split size
Calculating Split Size
  • Application may impose minimum split size greater than Block Size.
  • There is no good reason to that
    • Data locality is lost
fileinputformat2
FileInputFormat
  • Min split size
    • We might set it to larger than block size
    • But concept of data locality may be lost to some extent
  • Split size calculated by formula
    • max(minimumSize, min(maximumSize, blockSize))
    • By default
      • minimumSize < blockSize < maximumSize
file information in the mapper
File Information in the mapper
  • Configure(JobConf job)
textinputformat
TextInputFormat
  • Default FileInputFormat
    • Each line is a value
    • Byte offset is a key
  • Example
    • Run identity mapper program
input splits and hdfs blocks
Input Splits and HDFS Blocks
  • Logical Records defined by FileInputFormat doesn’t usually fit it into HDFS blocks.
    • EveryFileis written is written as sequence of bytes.
    • 64 MB reached ? then start the new block
    • When 64 MB reached, the logical record may be half written
    • So, the other half of logical record goes into the next HDFS block.
input splits and hdfs blocks1
Input Splits and HDFS Blocks
  • So even in data locality some remote reading is done.. a slight overhead.
    • Split gives logical record boundaries
    • Blocks – physical boundaries (size)
small files
Small Files
  • Files which are very small are inefficient in mapper phase
  • Imagine 1GB
    • 64Mb – 16 files – 16 mappers
    • 100kb – 1000 files – 1000 mappers 
combinefileinputformat
CombineFileInputFormat
  • Packs many files into single split
    • Data locality is taken into consideration
  • MR accelerates best if operated at disk transfer rate not at seek rate
  • This helps in processing large files also
nlineinputformat
NLineInputFormat
  • Same as TextInputFormat
  • Each split guarenteed to have N lines
  • mapred.line.input.format.linespermap
keyvaluetextinputformat
KeyValueTextInputFormat
  • Each line in text file is a record
  • First separator character divides key and value
    • Default is ‘\t’
  • Controller property
    • key.value.separator.in.input.line
sequencefileinputformat k v
SequenceFileInputFormat<K,V>
  • InputFormat for reading sequence files
  • User defined Key K
  • User defined Value V
  • They are splittable files.
    • WellSuited for MR
    • They store compression
    • They can store arbitrary types
textoutformat
TextOutFormat
  • key,values stored as \t separated by default.
    • mapred.textoutputformat.separator -- parameter

CounterPart for KeyValueTextInputFormat

  • Can suppress key/value by using NullWritable
ad