1 / 32

http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com

http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com. Excel Online Classes offers following services :. Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector. Nagarjuna K. MapReduce Anatomy. AGENDA.

korene
Download Presentation

http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com

  2. Excel Online Classes offers following services: • Online Training • Development • Testing • Job support • Technical Guidance • Job Consultancy • Any needs of IT Sector

  3. Nagarjuna K MapReduce Anatomy

  4. AGENDA • Anatomy of MapReduce • MR work flow • Hadoop data types • Mapper • Reducer • Partitioner • Combiner • Input Split vs Block Size

  5. Anatomy of MR Partitioning Shuffling . INPUT DATA NODE 2 NODE 2 NODE 1 Map Map Map Interim data Interim data Interim data Reduce Reduce Reduce Node to store output Node to store output Node to store output

  6. Hadoop data types • MR has a defined way of keys and values types  for it to move across cluster • Values  Writable • Keys  WritableComparable<T> • WritableComparable = Writable+Comparable<T>

  7. Frequently used key/value

  8. Custom Writable • For any class to be value, ithas to implement org.apache.hadoop.io.Writable • write(DataOutput out) • readFields(DataInput in)

  9. Custom key • For any class to be key, it has to implement org.apache.hadoop.io.WritableComparable<T> • + • compareTo(T o)

  10. Checkout Writables • Check out few of the writables and writable comparable • Time to write your own writables

  11. MapReduce libraries • Two libraries in Hadoop • org.apache.hadoop.mapred.* • org.apache.hadoop.mapreduce.*

  12. Mapper • Should implement org.apache.hadoop.mapred.Mapper<K1,V1,K2,V2> • Void configure(JobConf job) • All the parameters specified in the xmls are available here. • Any parameter explicitly set are also available • Call before data processing starts • Void Mapper(K1 key,V1 value, OutputCollector<K2,V2> output,Reporter reporter) • Data process starts • Void Close() • Should close any files, db connections etc., • Reporter provides extra information of mapper to TT

  13. Mappers -default

  14. Reducer • Should implement org.apache.hadoop.mapred.Redcuer • Sorts the incoming data based on key and groups together all the values for a key • Reduce function is called for every key in the sorted order • void reduce(K2 key, Iterator<V2> values,OutputCollector<K3,V3> output, Reporter reporter) • Reporter provides extra information of mapper to TT

  15. Reducer -default

  16. Partitioner • implements Partitioner<K,V> • configure() • intgetPartition ( … ) • 0< return<no.of.reducers • Generally, implement Partitioner so same keys go to one reducer

  17. Reading and Writing • Generally two kinds of files in Hadoop • Text (plain , XML, html …. ) • Binary (Sequence) • It is a hadoopspecific compressed binary file format. • Optimized to transfer output from one MR to MR • We can customize

  18. Input Format • HDFS block size • Input splits

  19. Blocks in HDFS • Big File is divided into multiple blocks and stored in hdfs. • This is a physical division of data • dfs.block.size(64MB default size) LARGE FILE BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4

  20. Input Splits and Records LOGICAL DIVISION • Input split • A chunk of data processed by a mapper • Further divided into records • Map process these records • Record = key + value • How to correlate to a DB table • Group of rows  split • Row  record

  21. InputSplit public interface InputSplit extends Writable { long getLength() throws IOException; String[] getLocations() throws IOException; } • It doesn’t contain the data • Only locations where the data is present • Helps jobtracker to arrange tasktrackers (data locality). • getLength greater length split will be executed 

  22. InputFormat • How we get the data to mapper • Inputsplits and how the splits are divided into records will be taken care by inputformat. public interface InputFormat<K, V> {InputSplit[] getSplits(JobConf job, intnumSplits) throws IOException; RecordReader<K, V> getRecordReader(InputSplit split, JobConfjob, Reporter reporter) throws IOException; }

  23. RecordReader K key = reader.createKey();V value = reader.createValue(); while (reader.next(key, value)) { mapper.map(key, value, output, reporter); }

  24. FileInputFormat • Base class for all implementations of InputFormat, which uses files as input • Defines • Which files to include for the job • Implementation for generating splits

  25. FileInputFormat • Set of Files  converts to no.of splits • Splits only large files…. HOW LARGE ? • Larger than BlockSize • Can we control it ?

  26. FileInputFormat • Min split size • We might set it to larger than block size • But concept of data locality may be lost to some extent • Split size calculated by formula • max(minimumSize, min(maximumSize, blockSize)) • By default • minimumSize < blockSize < maximumSize

  27. Calculating Split Size

  28. File Information in the mapper • Configure(JobConf job)

  29. TextInputFormat • Default FileInputFormat • Each line is a value • Byte offset is a key • Example • Run identity mapper program

  30. Input Splits and HDFS Blocks • Logical Records defined by FileInputFormat doesn’t usually fit it into HDFS blocks. • EveryFileis written is written as sequence of bytes. • 64 MB reached ? then start the new block • When 64 MB reached, the logical record may be half written • So, the other half of logical record goes into the next HDFS block.

  31. Input Splits and HDFS Blocks • So even in data locality some remote reading is done.. a slight overhead. • Split gives logical record boundaries • Blocks – physical boundaries (size)

  32. Other default InputFormats

More Related