1 / 43

What is System Research? Why does it Matter?

What is System Research? Why does it Matter?. Zheng Zhang Research Manager System Research Group Microsoft Research Asia. Outline. A perspective of computer system research By Roy Levin (Manager Director of MSR-SVC) Overview of MSRA/SRG activities Example projects.

kevork
Download Presentation

What is System Research? Why does it Matter?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is System Research?Why does it Matter? Zheng Zhang Research Manager System Research Group Microsoft Research Asia

  2. Outline • A perspective of computer system research • By Roy Levin (Manager Director of MSR-SVC) • Overview of MSRA/SRG activities • Example projects

  3. What is Systems Research? • What makes it research? • Something no one has done before. • Something that might not work. • Universities are known for doing Systems Research, but Industry does it too: • VAX Clusters (DEC) • Cedar (Xerox) • System R (IBM) • NonStop (Tandem) • And, more recently, Microsoft

  4. What is Systems Research? • What specialties does it encompass? Computer Architecture Networks Operating Systems Protocols Programming Languages Databases Distributed Applications Measurement Security Simulation (and others...) • Design + implementation + validation • (implementation includes simulation)

  5. What’s Different AboutSystems Research Now? • Scale! • Geographic extent • “Machine room” systems are a research niche. • Administrative extent (multiple domains) • Validation at scale • Single organization often can’t do it. • Consortia, collaborations • “Test beds for hire” (e.g., PlanetLab) • Industrial systems as data sources • And perhaps as test beds?

  6. What’s hot today, and will continue to be? • Distributed system! • Web2.0 is all about distributed system: • Protocol/presentation revolution: • Html  but XML, DHTML  AJAX, http  RSS/Atom… • Service mash-up is really happening • Infrastructure: • Huge cluster inside (e.g. MSN, Google, Yahoo) • Even bigger network outside (e.g. P2P, social network) • Very complex to understand • Fertile ground for advancing theory/algorithmic aspect • Very challenging to build and test • That’s what research is about, isn’t it?

  7. MSRA/SRG Activity Overview

  8. problems The brain: Basic research The hand: Systems/tools solutions “InspectorMorse” The tools “Beauty contest”problems Systems Exploratorysystems/experiments Low hangingfruits Large-scale wide-areap2p file sharing - Maze • “Practical” theory work: • Failure model • Membership protocol • Distributed data structure • DHT spec • … Distributed systembuilding package -WiDS • Machine roomstorage systemand its applications • BitVault • HPC & end user use improve SRG Research focus The theory& practice of distributed system research

  9. Summary of Results (last 9 months)

  10. Some projects in SRG • Building large-scale system • BitVault, WiDS and BSR involvement • Large-scale P2P system • A collaboration project with Beijing University • My view on Grid and P2P computing • Which one would you like to hear?

  11. BitVault and WiDS Plus contribution from BSR

  12. SOMO monitor Check-out Check-in Catalog Load balance delete Soft-state dist. index Object replication & placement Repair Protocol Membership and Routing Layer (MRL) Scalable broadcast Anti-entropy Leafset protocol BitVault: brick-based reliable storage for huge amount of reference data Design points • Low TCO • Highly reliable • Simple architecture $400 a piece • Entirely developed/maintained with WiDS • Adequate performance

  13. BitVault: repair speed • Google File System: 440MB/s in a 227-server cluster • Comparable or better Performance under failure Repair rate vs # of servers

  14. Small-scale simulation Implementation (v.001) Debug Implementation (v.01) Debug Implementation (v.1) Performance debug The “black art” of building a dist. system Pseudo-code/Protocol spec (TLA+, Spec#, SPIN) • Unscalable “distributed” human “log mining” • Non-deterministic bugs - Code divergence - 1/1000 of the real deployment scale (esp. P2P)

  15. Goal: a generic toolkit for integrated distributed system/protocol development • Reduce debug pain • Spend as much energy in single address space as possible • Isolate non-deterministic bugs and reproduce them in simulation • Remove the human body in the log mining business • … • Eliminatecode divergence • Use the same code across development stages (e.g. simulation/executable) • Scale the performance study • Implement efficient ultra-large scale simulation platform • Interface with formal methods • Ultimately: TLA+ spec  Implementation

  16. Application logic • State machine • Event-driven • Object-oriented Protocol Instance • Periodic and one-time timer • PostMessage, async callback Isolate implementation from runtime WiDS: an API set for programmer Timer Msg. Handler

  17. One address space Protocol Instance Protocol Instance Protocol Instance Timer Timer Timer Msg. Handler Msg. Handler Msg. Handler WiDS-Dev WiDS-Dev WiDS-Dev Eventwheel WiDS: as development environment Network Model • Single address space debugging of multiple instances • Small scale simulation (~10K)

  18. Protocol Instance Protocol Instance Protocol Instance Timer Timer Timer Msg. Handler Msg. Handler Msg. Handler WiDS-Comm WiDS-Comm WiDS-Comm Network WiDS: as deployment environment • Ready to run!

  19. Implementation WiDS starts here Debug The WiDS-enabled process SPEC • No code divergence • Large scale study • Virtualize dist. system debugging process Protocol WiDS-Dev WiDS-Par Performance Evaluation (include large scale) Deployment Optimization WiDS-Comm WiDS-par has been used to test 2Million real instances Using 250+ PCs

  20. What makes a storage system reliable? • MTTDL: Mean Time To Data Loss“After the system is loaded with data objects, how long on average the system can sustain before it permanently loses the first data object.” • Two factors: • Data repair speed • Sensitivity to concurrent failures

  21. Sequential Placement • Pro: Low likelihood of data loss when concurrent failures occur

  22. Repair in Sequential Placement • Con: Low parallel repair degree leading to relatively high likelihood of concurrent failures

  23. Random Placement • Con: sensitive to concurrent failures

  24. Repair in Random Placement • Pro: High parallel repair degree leading to low likelihood of concurrent failures

  25. Random placement is better with large object size Random placement is bad with small object size Comparison MTTF=1000days, B=3GB/s, b=20MB/s, c=500GB,user data 1PB

  26. ICDCS’05 result summary • Established the first framework of object placement’s impact on reliability • Upshot: • Spread your replicas as widely as you can • Up to the point when BW is fully utilized for repair • More than that will hurt reliability • Core algorithm adopted by many MSN large scale storage projects/products

  27. Ongoing work: • More on object placement: • We are looking at system with extremely longevity • Heterogeneity capacity and other dynamics have not been factored in • Improving WiDS further: • It is still so hard to debug!! • Idea: • Replay facility to take logs from deployment • Time-travel inside simulation • Use model and invariance checker to identify fault location and path • See SOSP’05 poster

  28. Maze With Beijing Univ Maze Team

  29. Maze File Sharing System • The largest in China • On CERNET, popular with college students • Population: 1.4 million registered accounts; 30,000+ online users • More than 200 million files • More than 13TB (!) transfer everyday • Completely developed, operated and deployed by an academic team • Logs added since the collaboration w/ MSRA last year • Enable detailed study at all angles

  30. Rare System for Academic Studies • WORLD’04: system architecture • IPTPS’05: The “free-rider” problem • AEPP’05: Statistics of shared objects and traffic pattern • Incentive to promote sharing  collusion and cheating • Trust and fairness  Can we defeat collusion?

  31. Maze Architecture: the Server Side • Just like Napster… • Historic reason: P2P sharing add-ons for T-net FTP search engine Not a DHT!

  32. Maze:Incentive Policies • New users: points == 4096 • Point change: • Uploads: +1.5 points per/MB • Downloads: at most -1.0 point/MB • Gives user more motivation to contribute • Service differentiation: • Order download requests by T = Now – 3log(Point) • First-come-first-serve + large-points-first-serve • Users with P < 512 have a download band-width of 200Kb/s • Available in Maze5.0.3; extensively discussed in Maze forum before implemented

  33. Collusion Behavior in Maze (Partial Result) 221,000 pairs whose duplication degree > 1 the top 100 links with most redundant traffic • The first ever study of such kind • Modeling, simulation done • Deployed and measurement in two months

  34. Problem/ Application System0 System0 deploy log log Simulation/ Development Model1 Model0 ? The Ecosystem of Maze Research • Common to all system research • More difficult in a live system: you can’t go back!

  35. Grid and P2P Computing

  36. Knowing the Gap is Often More Important(Or you risk falling off the cliff!) • The gap is often a manifest of LAW (speed of light) • The gap between Wide-Area (GRID) and cluster/HPC can be just as wide as between HPC and sensor network • Many impossibility theory exist • Negative results are not a bad thing • The bad thing is that many are unaware! • Example: • The impossibility of consensus in asynchronous network • The impossibility of achieving consistency, availability and partition resilient simultaneously

  37. What is Grid? Or, the Problem that I see • Historically associated with HPC • Dressed up when running short on gas • Problematically borrowing concept from an environment governed by a different law • Internet as a grand JVM unlikely • Need to extract commonsystem infrastructure after gaining enough application experiences • Sharing and collaboration are labels applied without careful investigation • Where/what is the 80-20 sweet-spot? • Likewise, adding the P2P spin should be carefully done

  38. What is Grid? (cont.) • Grid <= HPC + Web Service • HPC isn’t done yet; check google • Why? • You need HPC to run the apps, or store the data • Service has clear boundary • Interoperable protocol bound the services

  39. P2P computing: inspiration from Cellular Automata [A New Kind of Science, Wolfram, 2002] Program Computation Similar to traditional parallel computing logic: read input data for a while { compute output data region input edge regions }

  40. Many Applications Follow the Same Model • Enterprise computing • MapReduce and similar tasks in data processing • Sorting and querying • Coarse-Grain Scientific Computing • Engineering and product design • Meteorological simulation • Molecular biology simulation • Bioinformatics computation • Hunting for the next low-hanging fruits after seti@home, folding@home

  41. WAN/LAN does not Matter when C/B is large More processes matter! Is it feasible? Consider: 2D CA; LAN and WAN N N n n LAN WAN N N>>n n Computing Density /Traffic (instr/byte) C0/B0

  42. What’s the point of Grid, or not-Grid? • Copy ready concepts across context is easy • But it often does not work • Context is governed by the law of physics • Need to start building and testing applications • Then we can define what “Grid OS” is truly about

  43. Thanks

More Related