1 / 27

MapReduce in Action

MapReduce in Action. 数据挖掘研究组 Data Mining Group @ Xiamen University. College of Information Science and Technology. Team 306 Led by Chen Lin. Contents. 1. Basic MapReduce Programs. 2. Advanced MapReduce. 3. Beyond the horizon. 4. discussion. Job Configuration. Master

gigi
Download Presentation

MapReduce in Action

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce in Action 数据挖掘研究组 Data Mining Group @ Xiamen University College of Information Science and Technology Team 306 Led by Chen Lin

  2. Contents 1. Basic MapReduce Programs 2. Advanced MapReduce 3. Beyond the horizon 4. discussion

  3. Job Configuration Master Jobtracker Master Jobtracker Job Basic MapReduce Programs

  4. Implement Interface Java Class Environment Configuration Basic MapReduce Programs Job Configuration?

  5. Combiner InputFormat OutputFormat Mapper Reducer Partitioner Interface

  6. jvm: Mapred.child.java.opts {mapred.local.dir} InputPath OutputPath How many Map/Reduce Tasks? Configure

  7. InputFormat Map OutputFormat Reduce Basic MapReduceProgram <K1,V2> Inputsplit K1,List<V1> List<K1,V1> Text

  8. Basic MapReduce

  9. PARTITIONERS AND COMBINERS • Combiners an optimization in MapReduce that allow for local aggregation before the shue and sort phase • Partitioner determines which reducer will be responsible for processing a particular key, and the execution framework uses this information to copy the data to the right location during the shue and sort phase

  10. Basic MapReduce Program InputFormat CREATING CUSTOM INPUTFORMAT KeyValue Text Text Input Format Sequence File NLine

  11. InputFormat • TextInputFormat - Each line in the text fi les is a record. Key is the byte offset of the line, and value is the content of the line. • KeyValueTextInputFormat - Each line in the text fi les is a record. The fi rst separator character divides each line. Everything before the separator is the key, and everything after is the value. The separator is set by the key.value.separator.in.input.line property, and the default is the tab (\t) character. • NLineInputFormat - Same as TextInputFormat, but each split is guaranteed to have exactly N lines. The mapred.line.input.format. Lines/map property, which defaults to one, sets N.

  12. Basic MapReduce Program types for the key/value pairs 4

  13. code for mapper, reducer, combiner, partitioner, along with job conguration parameters The execution framework handles everything else Summary for basic Program What’s a complete MapReducejob ??

  14. Advanced MapReduce Chaining MapReducejobs LOCAL AGGREGATION SECONDARY SORTING Work on Hadoop Files

  15. Chaining MapReduce jobs • You’ve been doing data processing tasks which a single MapReduce job can accomplish. • But…… • As you get more comfortable writing MapReduce programs and take on more ambitious data processing tasks • you’ll find many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce job

  16. LOCAL AGGREGATION • in Hadoop, intermediate results are written to local disk before being sent over the network. • Reductions in the amount of intermediate data translate should increase in algorithmic efficiency • use of the combiner is possible to substantially reduce both the number and size of key-value pairs that need to be shuffled from the mappers to the reducers

  17. seudo-code for computing the mean of values associated with the same string.

  18. LOCAL AGGREGATION , Is it right ??

  19. LOCAL AGGREGATION • 1. combiners must have the same input and output key-value type • 2. Combiners are optimizations that cannot change the correctness of the algorithm Hadoopmakes no guarantees on how many times combiners are called; it could be zero, one, or multiple times

  20. LOCAL AGGREGATION , right usage !

  21. SECONDARY SORTING • we also need to sort by value sometimes • (k1;m1; v8) • (k1;m2; v1) • (k1;m3; v7) • ::: • (k2;m1; v2) • (k2;m2; v6) • (k2;m3; v9) • k1 (m1; k8) • (k1; m1) (k8)

  22. Beyond the horizon • It’s a shame • The rest I will talk about Plays an important role in MapReduce, but, they are beyond my horizon. • So, need all your help, to master them together….

  23. Beyond the horizon Creat user custom Inputformat Creat user custom Partitioner Manipulate local file Streaming other language Pipes for C++

  24. Beyond the horizon Joining data from different sources Hive Pig Multiple File output HBase

  25. Joining data from different sources Customers file CSV format record fields: (Customer ID, Name, and Phone Number) Orders files CSV format fields: (Customer ID, Order ID, Price, and Purchase Date)

  26. Joey Leung,555-555-55 Edward,123-456-7890 Jose Madriz,281-330-8004 David Stork,408-555-0000 ….... Joey Leung,555-555-5555,B,88.25,20-May-2008 Edward,123-456-7890,C,32.00,30-Nov-2007 Jose Madriz,281-330-8004,A,12.95,02-Jun-2008 Jose Madriz,281-330-8004,D,25.02,22-Jan-2009 A,12.95,02-Jun-2008 B,88.25,20-may-2008 C,32.00,30-Nov-2007 D,25.02,22-Jan-2009 Joining data from different sources

  27. 数据挖掘研究组 Data Mining Group @ Xiamen University Thank you!

More Related