1 / 29

Deconstructing Commodity Storage Clusters

Deconstructing Commodity Storage Clusters. Haryadi S. Gunawi , Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. Univ. of Wisconsin - Madison. Jiri Schindler. Corporation. Storage system. Storage system Important components of large-scale systems

rigel-crane
Download Presentation

Deconstructing Commodity Storage Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison Jiri Schindler Corporation

  2. Storage system • Storage system • Important components of large-scale systems • Multi-billion dollar industry • Often comprised of high-end storage servers • A big box with lots of disks inside • The simple question • How does storage server work? • Simple but hard – closed storage subsystem design

  3. Why need to know? • Better modeling • How system behaves under different workload • Example in storage industry: capacity model for capacity planning • Model is limited if the information is limited • Product validation • Validate what product specs say • Performance numbers cannot confirm • Critical evaluation of design and implementation choices • Control what is occurring inside

  4. Storage System Traditionally black box • Highly customized and proprietary hardware and OS • Hitachi Lightning, NetApp Filers, EMC Symmetrix • EMC Symmetrix: disk/cache manager, proprietary OS • Internal information is hidden behind standard interfaces ? Client Acks

  5. Modern graybox storage system • Cluster of commodity PCs running commodity OS • Google FS cluster, HP FAB, EMC Centera • Advantages of commodity storage clusters • Direct internal observation – visible probe points • Leverage existing standardized tools Storage System Update DB Update DB PC Commodity PC PC Client Switch PC Switch PC PC

  6. Intra-box Techniques • Two “Intra-box” techniques • Observation • System perturbation • Two components of analysis • Deduce structureof main communication protocol • Object Read and Write protocol • Internal policy decisions • Caching, prefetching, write buffering, load balancing, etc.

  7. Goal and EMC Author • Objectives • Feasibility of deconstructing commodity storage clusters, no source code • Results achieved without EMC assistance • EMC Author • Evaluate correctness of our findings • Give insights behind their design decisions

  8. Outline • Introduction • EMC Centera Overview • Intra-box tools • Deducing Protocol • Observation and Delay Pertubation • Inferring Policies • System Perturbation • Conclusion

  9. Centera Topology Storage Nodes Access Nodes Client SN 1 LAN WAN SN 2 AN 1 SN 3 Client AN 2 SN 4 SN 5 SN 6

  10. Commodity OS Client Access Node Storage Node Centera Software Centera Software Client SDK Linux Linux TCP TCP/UDP Reiserfs TCP/UDP Reiserfs IDE driver IDE driver WAN LAN

  11. Probe Points – Observation • Internal probe points • Trace traffic using standardized tools • tcpdump: trace network traffic • Pseudo Device Driver: trace disk traffic Client Access Node Storage Node Client SDK Centera SW. Centera Software TCP TCP/UDP TCP/UDP Reiserfs tcpdump tcpdump tcpdump Pseudo Dev. Driver IDE drives

  12. Probe Points – Perturbation Storage Node • Perturbing system at probe points • Modified NistNet: delay particular messages • Pseudo Dev. Driver: delay disk I/O traffic • Additional Load • CPU Load: High priority while loop • Disk Load: File copy User-level Process Client Access Node Centera Software Add CPU Load: while(1) {..} Add Disk Load: cp fX fY Client SDK Centera SW TCP TCP/UDP TCP/UDP Reiserfs Mod. NistNet Mod. NistNet Mod. NistNet Pseudo Dev. + Delay tcpdump tcpdump tcpdump IDE drives

  13. Outline • Introduction • EMC Centera Overview • Deducing Protocol • Observation and Delay Perturbation • Inferring Policies • System Perturbation • Conclusion

  14. Understanding the protocol • Understanding Read/Write protocol • Read and Write implementations in big distributed storage systems are not simple • Deconstruct the protocol structure • Which pieces are involved? • Where data is sent to? • Data reliably stored, mirrored, striped?

  15. write( ) Observing Write Protocol • Deconstruct protocol using passive observation • Run a series of write workload • Observe network and disk traffic • Correlation tools: convert traces into protocol structure EMC Centera Client an1 sn1 sn2 sn3 an2 sn4 sn5 Access Nodes sn6 Storage Nodes

  16. Software ACKs Software ACKs Observation Results Access Node Primary SN Secondary SN • Object Write Protocol findings • Phase 1: Write request establishment • Phase 2: Data transfer • Phase 3: Disk write, notify other SNs, commit • Phase 4: Series of acknowledgement • Determine general properties • Primary SN handles generation of 2nd copy • Two new TCP connections / object write Client R Write Req. TCP Setup R Write Req TCP Setup R Request Ack. Request Ack. Request Ack. Data Transfer Transfer Ack. SNx SNy SNv SNw Write-Commit Write-Commit Write Complete time

  17. Secondary Commit (sc) Resolving Dependencies • Cannot conclude dependencies from observation only • B after A != B depends on A • Must delay A, and see if B is delayed Primary SN Secondary SN AN From observation only: Primary commit depends on secondary commit and sync. disk write Primary commit (pc) • Conclude causality by delaying: • disk write traffic and • secondary commit

  18. Primary SN CentraStar Linux TCP/UDP no if size= 90 yes Mod. NistNet delay queue incoming packet Delaying a Particular Message • Need to delay a particular message • Leverage packet sizes • Modify NistNet • Delay specific message, not link • Ex: delay sc (90 bytes) Access Node Primary SN Secondary SN Client 299 bytes 509 509 161 161 161 289 375 321 321 sc 90 bytes prim. commit 539 4 4 4 4

  19. Secondary commit Primary commit Delaying secondary-commit • Resolving first dependency • Delay secondary commit  primary commit also gets delayed • Primary commit depends on the receipt of secondary commit Primary SN Secondary SN AN + delay

  20. Primary SN CentraStar Primary-commit ReiserFS disk req if WRITE yes Pseudo-Dev delay queue no IDE Driver Delaying disk I/O traffic Primary SN • Delay disk writes at primary storage node Secondary-commit + Delay Disk Write From observation and delay: Primary commit depends on secondary commit message and sync. disk write

  21. 1 2 Client AN SN1 SN2 1 SN1 Client AN 2 SN2 Ability to analyze internal designs • Intra-box techniques: Observation and perturbation by delay • Able to deduce Object Write protocol • Give ability to analyze internal design decisions • Serial vs. Parallel • Primary SN handles the generation of 2nd copy (Serial) vs. AN handles both 1st and 2nd (Parallel) • EMC Centera: write throughput is more important • Decrease load on access nodes – increase write throughput • New TCP connections (internally) / object write • vs. using persistent connection to remove TCP setup cost • Prefer simplicity – no need to manage persistent conn. for all requests

  22. Outline • Introduction • EMC Centera Overview • Deducing Protocol • Inferring Policies • Various system perturbation • Conclusion

  23. Inferring internal policies • Write policies • Level of replication, Load balancing, Caching/buffering • Read policies • Caching, Prefetching, Load balancing • Try to infer • Is particular policy implemented? • At which level it is being implemented? • Ex: Read Caching at Client, Access Node, Storage Node?

  24. Client write() Access Node ? ? ? ? SN 1 SN 2 SN 3 SN … Active TCP CPU CPU CPU CPU System Pertubation • Perturb the system • Delay and extra load • 4 common load-balancing factors: • CPU load • High priority while loop • Disk load • Background file copy • Active TCP connection • Network delay + net delay

  25. Write Load Balancing • What factors determined which storage nodes are selected? • Experiment: • Observe which primary storage nodes selected • Without load: writes are balanced • With load: writes skew toward unloaded nodes ? sn#1 Unloaded AN ? sn#2 Unloaded sn#2 Loaded

  26. Write Load Balancing Results Normal No Perturb Additional CPU Load Disk Load Network Load Incoming Net. Delay sn#1 sn#1 sn#1 sn#1 sn#1 sn#2 +CPU +Disk +TCP +Delay

  27. Summary of findings EMC Centera: Simplicity and Reliability

  28. Conclusion • Intra-box: • Observe and perturb • Deconstruct protocol and infer policies • No access to source code • Power of probe points • More observation places • Ability to control the system • Systems built with more externally visible probe points • Systems more readily understood, analyzed, and debugged • Higher-performing, more robust and reliable computer systems

  29. Questions?

More Related