1 / 41

excelonlineclasses.co.nr/ excel.onlineclasses@gmail

http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com. Excel Online Classes offers following services :. Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector. Nagarjuna K. MapReduce Anatomy. AGENDA.

milek
Download Presentation

excelonlineclasses.co.nr/ excel.onlineclasses@gmail

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com http://www.excelonlineclasses.co.nr/

  2. Excel Online Classes offers following services: • Online Training • Development • Testing • Job support • Technical Guidance • Job Consultancy • Any needs of IT Sector

  3. Nagarjuna K MapReduce Anatomy

  4. AGENDA • Anatomy of MapReduce • MR work flow • Hadoop data types • Mapper • Reducer • Partitioner • Combiner • Input Split vs Block Size

  5. Anatomy of MR Partitioning Shuffling . INPUT DATA NODE 2 NODE 2 NODE 1 Map Map Map Interim data Interim data Interim data Reduce Reduce Reduce Node to store output Node to store output Node to store output

  6. Hadoop data types • MR has a defined way of keys and values types  for it to move across cluster • Values  Writable • Keys  WritableComparable<T> • WritableComparable = Writable+Comparable<T>

  7. Frequently used key/value

  8. Custom Writable • For any class to be value, ithas to implement org.apache.hadoop.io.Writable • write(DataOutput out) • readFields(DataInput in)

  9. Custom key • For any class to be key, it has to implement org.apache.hadoop.io.WritableComparable<T> • + • compareTo(T o)

  10. Checkout Writables • Check out few of the writables and writable comparable • Time to write your own writables

  11. MapReduce libraries • Two libraries in Hadoop • org.apache.hadoop.mapred.* • org.apache.hadoop.mapreduce.*

  12. Mapper • Should implement org.apache.hadoop.mapred.Mapper<K1,V1,K2,V2> • Void configure(JobConf job) • All the parameters specified in the xmls are available here. • Any parameter explicitly set are also available • Call before data processing starts • Void map (K1 key,V1 value, OutputCollector<K2,V2> output,Reporter reporter) • Data process starts • Void Close() • Should close any files, db connections etc., • Reporter provides extra information of mapper to TT

  13. Mappers -default

  14. Reducer • Should implement org.apache.hadoop.mapred.Redcuer • Sorts the incoming data based on key and groups together all the values for a key • Reduce function is called for every key in the sorted order • void reduce(K2 key, Iterator<V2> values,OutputCollector<K3,V3> output, Reporter reporter) • Reporter provides extra information of mapper to TT

  15. Reducer -default

  16. Partitioner • implements Partitioner<K,V> • configure() • intgetPartition ( … ) • 0< return<no.of.reducers • Generally, implement Partitioner so same keys go to one reducer

  17. Reading and Writing • Generally two kinds of files in Hadoop • Text (plain , XML, html …. ) • Binary (Sequence) • It is a hadoop specific compressed binary file format. • Optimized to transfer output from one MR to MR • We can customize

  18. Input Format • HDFS block size • Input splits

  19. Blocks in HDFS • Big File is divided into multiple blocks and stored in hdfs. • This is a physical division of data • dfs.block.size(64MB default size) LARGE FILE BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4

  20. Input Splits and Records LOGICAL DIVISION • Input split • A chunk of data processed by a mapper • Further divided into records • Map process these records • Record = key + value • How to correlate to a DB table • Group of rows  split • Row  record

  21. InputSplit public interface InputSplit extends Writable { long getLength() throws IOException; String[] getLocations() throws IOException; } • It doesn’t contain the data • Only locations where the data is present • Helps jobtracker to arrange tasktrackers (data locality). • getLength greater length split will be executed 

  22. InputFormat • How we get the data to mapper • Inputsplits and how the splits are divided into records will be taken care by inputformat. public interface InputFormat<K, V> {InputSplit[] getSplits(JobConf job, intnumSplits) throws IOException; RecordReader<K, V> getRecordReader(InputSplit split, JobConfjob, Reporter reporter) throws IOException; }

  23. InputFormat • Mapper • getRecordReader() is called to get RecordReader • Once the record reader is obtained, • Map method is called recursively until the end of the split

  24. RecordReader K key = reader.createKey();V value = reader.createValue(); while (reader.next(key, value)) { mapper.map(key, value, output, reporter); }

  25. Job Submission -- retrospection • JobClient running the job • Gets inputsplits by calling getSplits() in InputFormat • Determines data locations for the splits • Sends these locations to the JobTracker • JobTracker assigns mappers appropriately. • Data locality

  26. InBuiltInputFormats

  27. FileInputFormat • Base class for all implementations of InputFormat, which uses files as input • Defines • Which files to include for the job • Implementation for generating splits

  28. FileInputFormat • Set of Files  converts to no.of splits • Splits only large files…. HOW LARGE ? • Larger than BlockSize • Can we control it ?

  29. Calculating Split Size • Application may impose minimum split size greater than Block Size. • There is no good reason to that • Data locality is lost

  30. FileInputFormat • Min split size • We might set it to larger than block size • But concept of data locality may be lost to some extent • Split size calculated by formula • max(minimumSize, min(maximumSize, blockSize)) • By default • minimumSize < blockSize < maximumSize

  31. File Information in the mapper • Configure(JobConf job)

  32. TextInputFormat • Default FileInputFormat • Each line is a value • Byte offset is a key • Example • Run identity mapper program

  33. Input Splits and HDFS Blocks • Logical Records defined by FileInputFormat doesn’t usually fit it into HDFS blocks. • EveryFileis written is written as sequence of bytes. • 64 MB reached ? then start the new block • When 64 MB reached, the logical record may be half written • So, the other half of logical record goes into the next HDFS block.

  34. Input Splits and HDFS Blocks • So even in data locality some remote reading is done.. a slight overhead. • Split gives logical record boundaries • Blocks – physical boundaries (size)

  35. Small Files • Files which are very small are inefficient in mapper phase • Imagine 1GB • 64Mb – 16 files – 16 mappers • 100kb – 1000 files – 1000 mappers 

  36. CombineFileInputFormat • Packs many files into single split • Data locality is taken into consideration • MR accelerates best if operated at disk transfer rate not at seek rate • This helps in processing large files also

  37. NLineInputFormat • Same as TextInputFormat • Each split guarenteed to have N lines • mapred.line.input.format.linespermap

  38. KeyValueTextInputFormat • Each line in text file is a record • First separator character divides key and value • Default is ‘\t’ • Controller property • key.value.separator.in.input.line

  39. SequenceFileInputFormat<K,V> • InputFormat for reading sequence files • User defined Key K • User defined Value V • They are splittable files. • WellSuited for MR • They store compression • They can store arbitrary types

  40. OutputFormat

  41. TextOutFormat • key,values stored as \t separated by default. • mapred.textoutputformat.separator -- parameter CounterPart for KeyValueTextInputFormat • Can suppress key/value by using NullWritable

More Related