1 / 20

Introduction to Data Center Computing

Introduction to Data Center Computing. Derek Murray October 2010. What we’ll cover. Techniques for handling “big data” Distributed storage Distributed computation Focus on recent papers describing real systems. Example: web search. Crawling. Indexing. Querying. WWW.

draked
Download Presentation

Introduction to Data Center Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Data Center Computing Derek Murray October 2010

  2. What we’ll cover • Techniques for handling “big data” • Distributed storage • Distributed computation • Focus on recent papers describing real systems

  3. Example: web search Crawling Indexing Querying WWW

  4. A system architecture? Network Computers Storage

  5. Data Center architecture Core switch Rack switch Rack switch Rack switch Server Server Server Server Server Server Server Server Server

  6. Distributed storage • High volume of data • High volume of read/writerequests • Fault tolerance

  7. Brewer’s CAP theorem (2000)

  8. The Google file system (2003) GFS Master Client Chunk server Chunk server Chunk server

  9. Dynamo (2007) Client

  10. Distributed computation • Parallel distributed processing • Single Program, Multiple Data (SPMD) • Fault tolerance • Applications

  11. Task farming Master Worker Worker Worker Storage

  12. MapReduce(2004)

  13. Dryad (2007) • Arbitrarydirected acyclic graph (DAG) • Vertices and channels • Topological ordering

  14. DryadLINQ(2008) • Language Integrated Query (LINQ) vartable = PartitionedTable.Get<int>(“…”); varresult = from x in table selectx * x; intsumSquares = result.Sum();

  15. Scheduling issues • Heterogeneous performance • Sharing a cluster fairly • Data locality

  16. Percolator (2010) • Built on Google BigTable • Transactions via snapshot isolation • Per-column notifications (triggers)

  17. Skywriting and Ciel(2010) • Universal distributed execution engine • Script language for distributed programs • Opportunities for student projects…

  18. References • Storage • Ghemawatet al., “The Google File System”, Proceedings of SOSP 2003 • DeCandiaet al., “Dynamo: Amazon’s Highly-Available Key-value Store”, Proceedings of SOSP 2007 • Computation • Dean and Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Proceedings of OSDI 2004 • Isardet al., “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks”, Proceedings of EuroSys 2007 • Yu et al., “DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language”, Proceedings of OSDI 2008 • Olstonet al., “Pig Latin: A Not-So-Foreign Language for Data Processing”, Proceedings of SIGMOD 2008 • Murray and Hand, “Scripting the Cloud with Skywriting”, Proceedings of HotCloud 2010 • Scheduling • Zahariaet al., “Improving MapReduce Performance in Heterogeneous Environments”, Proceedings of OSDI 2008 • Isardet al., “Quincy: Fair Scheduling for Distributed Computing Clusters”, Proceedings of SOSP 2009 • Zahariaet al., “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling”, Proceedings of EuroSys2010 • Transactions • Peng and Dabek, “Large-Scale Incremental Processing using Distributed Transactions and Notifications”, Proceedings of OSDI 2010

  19. Conclusions • Data centers achieve high performance with commodity parts • Efficient storage requires application-specific trade-offs • Data-parallelism simplifies distributed computation on the data

  20. Questions • Now or after the lecture • Email • Derek.Murray@cl.cam.ac.uk • Web • http://www.cl.cam.ac.uk/~dgm36/

More Related