1 / 10

15-826: Multimedia Databases and Data Mining

15-826: Multimedia Databases and Data Mining. Extra: intro to hadoop C. Faloutsos. Resources. Software http://hadoop.apache.org/ Map/reduce paper [Dean & Ghemawat] http://labs.google.com/papers/mapreduce.html Tutorial: see part1, foils #9-20

imaran
Download Presentation

15-826: Multimedia Databases and Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 15-826: Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos

  2. Resources • Software • http://hadoop.apache.org/ • Map/reduce paper [Dean & Ghemawat] • http://labs.google.com/papers/mapreduce.html • Tutorial: see part1, foils #9-20 • videolectures.net/kdd2010_papadimitriou_sun_yan_lsdm/ (c) 2012 C. Faloutsos

  3. Motivation: Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http://hadoop.apache.org/ (c) 2012 C. Faloutsos

  4. 2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id details (c) 2012 C. Faloutsos

  5. 2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id details reduce map (c) 2012 C. Faloutsos

  6. details fork fork fork Mapper Output File0 write Split0 read Mapper Split1 Output File1 Split2 Mapper User Program Master assign map assign reduce InputData (on HDFS) Reducer local write Reducer remote read, sort By default: 3-way replication; Late/dead machines: ignored, transparently (!) (c) 2012 C. Faloutsos

  7. More details: • (thanks to U Kang for the animations) (c) 2012 C. Faloutsos

  8. Background: MapReduce MapReduce/Hadoop Framework HDFS HDFS: fault tolerant, scalable, distributed storage system Mapper: read data from HDFS, output (k,v) pair Map 0 Map 1 Map 2 Output sorted by the key Shuffle Reducer: read output from mappers, output a new (k,v) pair to HDFS Reduce 0 Reduce 1 Programmers need to provide only map() and reduce() functions HDFS (c) 2012 C. Faloutsos 8

  9. Background: MapReduce Example: histogram of fruit names HDFS map( fruit ) { output(fruit, 1); } Map 0 Map 1 Map 2 (apple, 1) (apple, 1) (strawberry,1) (orange, 1) Shuffle reduce( fruit, v[1..n] ) { for(i=1; i <=n; i++) sum = sum + v[i]; output(fruit, sum); } Reduce 0 Reduce 1 (orange, 1) (strawberry, 1) (apple, 2) HDFS (c) 2012 C. Faloutsos 9

  10. Conclusions • Convenient mechanism to process Tera- and Peta-bytes of data, on >hundreds of machines • User provides map() and reduce() functions • Hadoop handles the rest (eg., machine failures etc) (c) 2012 C. Faloutsos

More Related