1 / 25

MapReduce

MapReduce. Outline. Purpose Example Method Advanced. purpose. Purpose. Data mining Data processing. example. Example. Find the maximum temperature of year National Climatic Data Center(NCDC) The data is stored using a line-oriented ASCII format , in which each line is a record

nida
Download Presentation

MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce 資工碩一 黃威凱

  2. Outline • Purpose • Example • Method • Advanced 資工碩一 黃威凱

  3. purpose 資工碩一 黃威凱

  4. Purpose • Data mining • Data processing 資工碩一 黃威凱

  5. example 資工碩一 黃威凱

  6. Example • Find the maximum temperature of year • National Climatic Data Center(NCDC) • The data is stored using a line-oriented ASCII format , in which each line is a record • There is a directory for each year from 1901 to 2001 ,each containing a gzipped file for each weather station with its readings for that year 資工碩一 黃威凱

  7. Example(Data format) 資工碩一 黃威凱

  8. Example(Gzipped file, example for 1990) • % ls raw/1990 | head • 010010-99999-1990.gz • 010014-99999-1990.gz • 010015-99999-1990.gz • 010016-99999-1990.gz • 010017-99999-1990.gz • 010030-99999-1990.gz • 010040-99999-1990.gz • 010080-99999-1990.gz • 010100-99999-1990.gz • 010150-99999-1990.gz 資工碩一 黃威凱

  9. Method 資工碩一 黃威凱

  10. Method • Analzing the data with Unix tools • Analzing the data with Hadoop 資工碩一 黃威凱

  11. Method(Unix tools) 資工碩一 黃威凱

  12. Method(Unix tools) • Here is the beginning of a run: • % ./max_temperature.sh • 1901 317 • 1902 244 • 1903 289 • 1904 256 • 1905 283 • ... • The complete run for the century took 42 minutes in one run single EC2 High-CPU Extra Large Instance. 資工碩一 黃威凱

  13. Method(Hadoop) • Use MapReduce • Map • Shuffle • Reduce 資工碩一 黃威凱

  14. Method(Hadoop) • Map function • Pull out the year and the air temperature • Transform key-value pairs 資工碩一 黃威凱

  15. Method(Hadoop) • Map function • The shuffle • Each reduce task is fed by many map tasks. 資工碩一 黃威凱

  16. Method(Hadoop) • Reduce function • Iterate through the list and pick up the maximum reading • Input • (1949, [111, 78]) • (1950, [0, 22, -11]) • Output: • (1949, 111) • (1950, 22) 資工碩一 黃威凱

  17. Method(Hadoop) • Data flow 資工碩一 黃威凱

  18. Method(Hadoop) • Java MapReduce-Mapper example 資工碩一 黃威凱

  19. Method(Hadoop) • Java MapReduce-Reduce example 資工碩一 黃威凱

  20. Method(Hadoop) • Java MapReduce-Job example Support multiple path 資工碩一 黃威凱

  21. Advanced 資工碩一 黃威凱

  22. Advanced • Case1 資工碩一 黃威凱

  23. Advanced • Case2 資工碩一 黃威凱

  24. Advanced • Case3 資工碩一 黃威凱

  25. Advanced • Combiner Functions on Map output • Example • Map input1: (1950, 0), (1950, 20), (1950, 10) • Map input2: (1950, 25), (1950, 15) • After shuffle: • Map1: (1950, [0,20,10]) • Map2: (1950, [25,15]) • No UseCombiner to reduce input • (1950, [0, 20, 10, 25, 15]) • Use Combiner to reduce input • (1950, [20, 25]) 資工碩一 黃威凱

More Related