1 / 32

Distributed MapReduce Team B

Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar. Distributed MapReduce Team B. Introduction. Functional Languages. What makes MapReduce Special?. Map function Lisp – McCarthy et al in 1958

Download Presentation

Distributed MapReduce Team B

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presented by: • Christian Bryan • Matthew Dailey • Greg Opperman • Nate Piper • Brett Ponsler • Samuel Song • Alex Ostapenko • Keilin Bickar Distributed MapReduce Team B

  2. Introduction

  3. Functional Languages

  4. What makes MapReduce Special? Map function Lisp – McCarthy et al in 1958 Reduce function (paper example) = Summing up occurrences The combination? Behind the scenes action One user to n computers, where the only insight into n is the speed at which computation is completed.

  5. Example of an abstraction (appendEven myL) (define appendEven (lambda (x) (cond ((empty? x) empty) (else (begin (cond ((= 0 (remainder (car x) 2)) (cons (car x) (appendEven (cdr x)))) (else (cons (* 2 (car x)) (appendEven (cdr x)) )))))))) (define appendEvenMap (lambda (x) (cond ((= 0 (remainder x 2)) x) (else (* 2 x))))) (map appendEvenMap myL) (list 2 2 6 4 10 6 14 8 18 10 22 12) (list 2 2 6 4 10 6 14 8 18 10 22 12)

  6. SCALABLE???

  7. Goals of Distributed System • Transparency • Scalable • More fault tolerant than standalone system Problems when scaling • Monotonicity – Can’t retract statements • Which computer is correct? • Many points of failure

  8. Naturally Distributable

  9. The 'map' and 'reduce' functions themselves. • 'map' takes in a function and a set of data. • That set of data is partitioned and ready to go. • Function + Data = Convenient Why?

  10. More... • 'reduce' is less convenient. • Takes in an operation and a dataset. • GFS helps out alot.

  11. User writes Map function • (k1,v1) → list(k2,v2) • Next, user rights Reduce program • (k2,list(v2)) → list(v2) • Specification file defines inputs, outputs, and tuning parameters • Passed to MapReduce function • MapReduce library handles the rest! Distributing Map and Reduce

  12. Productivity Improvements • Programmers no longer have to program for the network • Simplified library to make a program distributed, can be reused • Can focus on problem instead of distributed implementation of it • Quote from Google: "Fun to use" • Programmers having fun are more productive

  13. MapReduce Performance

  14. ~1800 2GHz processors with 4Gb of RAM used • First test task – search through ~1Tb of data for a particular pattern • Second test task – sort ~1Tb of data Measured Performance

  15. Input split into 64Mb pieces • Machines assigned until all are working @55sec • Sources of delay: startup, opening files, locality optimization Test 1 (searching)

  16. Test 2 (sorting)

  17. Very old concepts used • Poor implementation (indices) • Limited set of features (idea of views) Criticism from Database Systems Community

  18. Fault Tolerance

  19. Worker Failure Master pings workers periodically Worker “fails” if it does not respond within a certain amount of time. All map tasks completed or in progress by worker are reset to idle state Eligible for rescheduling

  20. Worker Failure Completed Reduce tasks are not reset because their output is stored in a global file system and not locally on the Failed Machine. All workers are notified of the changes in workers Resilient to large scale worker failure.

  21. Master Failure Periodic checkpoints Upon Failure: a new copy starts from last checkpoint Failure of master is unlikely Current implementation aborts upon Master Failure

  22. Large clusters of commodity PCs connected together with switched Ethernet • Typically dual-processor x86 processors running Linux, 2-4 GB of memory • Inexpensive IDE disks attached directly to individual machines • Commodity networking hardware is used. Typically either 100 megabits/second or 1 gigabit/second at the machine level Google Cluster Configuration

  23. Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster. • A distributed file system (GFS) is used to manage the data stored on the disks. • Uses replication to provide availability and reliability on top of unreliable hardware. Google Cluster Operation

  24. Networking

  25. April 2004, Google spent about $250 million on hardware equipment • includes other equipment than CPUs such as routers and firewalls • Approximately • 63, 272 machines • 126,554 CPUs • 253, 088 GHz of processing power • 126,544 Gb of RAM • 5,062 TB of Hard Drive Space • About 253 teraflops (trillion floating point operations per second) Cost Efficiency

  26. Cost Efficiency • January 2005, Japan's NEC's Earth Simulator supercomputer • $250 million • 41 teraflops • Much more expensive compared to a large cluster of personal computers

  27. Cost Efficiency • 2003, Virgina Tech used 1,100 Apple computers • cost $5 million • 10 teraflops • 3rd most powerful at the time • supercomputer would have cost much more

  28. Cost Efficiency • Disadvantages • deal with network bandwidth • constantly monitor for hardware failure

  29. Conclusion

  30. Questions?

More Related