1 / 31

Before Terabytes Fall Disk reliability in Windows Vista and beyond

Before Terabytes Fall Disk reliability in Windows Vista and beyond. Matthew Kerner Program Manager Windows Diagnosis Microsoft Corporation. Frank Shu Program Manager WDEG-Storage Microsoft Corporation. Storage Fabrics Server/Enterprise. Personal Storage Client/Consumer.

ervin
Download Presentation

Before Terabytes Fall Disk reliability in Windows Vista and beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Before Terabytes FallDisk reliability in Windows Vista and beyond Matthew KernerProgram ManagerWindows DiagnosisMicrosoft Corporation Frank ShuProgram ManagerWDEG-StorageMicrosoft Corporation

  2. Storage Fabrics Server/Enterprise Personal Storage Client/Consumer Optical Platform Client/Consumer PreferredStorage PlatformPartner/Customer Windows Storage DevicesStrategic pillars Leading platform enablingstorage fabric adoption Optimized platform features enabling your Windows experience, here and now Timely, comprehensive, quality platform support for optical devices Preferred platform for developing, deploying, and using storage devices

  3. Session Outline • Introduction (Frank Shu) • Windows Vista Disk Diagnostics (Matthew Kerner) • Future Technology (Frank Shu) • Demo (Microsoft and Samsung)

  4. What Matters MostTo Our Users? • A consumer bought a new computer and it works great at work and at home. She couldn’t do her everyday tasks without it. What matters most to her? • CPU power • Network connection • Battery life • Something else…

  5. The Answer Is… The Data

  6. Protecting Data: Windows Vista disk diagnostics Matthew Kerner

  7. Quantifying Disk Failures • Catastrophic disk failures • ~200 disks replaced per week at Microsoft in 2003 • Top driver of Microsoft support’s hardware-related support calls in both client and server • Based on Microsoft figures, disk failures cost many millions of dollars per year in enterprises • Localized failures (bad blocks) • Kernel and user-mode crashes • 1.7% of customer-report Microsoft Online Crash Analysis crashes are due to disk errors • Application hangs during read recovery

  8. Disk Failure Mitigations • Prevention • Hybrid hard disks (mobile systems) • RAID • Catastrophic failure recovery • Data backup • Disk replacement • Localized failure recovery • Repair from redundant copy • Restore from backup

  9. Windows Vista Disk Diagnostics • Purpose: Save user data before catastrophic disk failure • Client SKUs • Self Monitoring And Reporting Technology (S.M.A.R.T.) polling triggers diagnostic • Uses S.M.A.R.T. trip status – no threshold/attribute comparison • Warns user of impending failure and walks them through backup and replacement • Windows Vista backup improvements

  10. Disk Diagnostics Details • Disk class driver polls S.M.A.R.T. status hourly as it has done since Windows 2000 • Based on industry feedback, no use of Disk Self-Test or attribute comparison • Failure triggers user-mode code • Filter out duplicate failures • Log SMART READ LOG details to OS event log • Device error count from summary error log sector • Life timestamp from most recent error log entry • Trigger user-context interactive resolution • Customizable by Group Policy • Print instructions, walk user through backup

  11. Startup Repair/Windows Recovery Environment • Purpose: Recover from non-bootable states, including those caused by disk failures • Automatic failover on boot failureto recovery partition • Optionally deployed by OEM • Available on installation media • Hands-free diagnosis and repairof top non-boot issues

  12. Corrupted File Recovery • Purpose: Turn repeat user-mode crashes caused by corrupted system binaries into one-time crash with silent repair from cache • Windows Error Reporting crash handler triggers diagnostic on inpage error crashes due to bad blocks • Diagnoses corrupted system files • Silent repair from System File Cache

  13. Windows Vista Disk Diagnostics Matthew KernerProgram ManagerWindows Diagnosis

  14. Opportunities For Future Technology • Proactive failure prevention • Reduce scenario pain by enabling resolutions other than just data recovery • Requires finer-grained failure descriptionto help host choose the best resolution • Increase warning time before failuresto allow users to save data

  15. Frank Shu Future Technology:Protecting User DataAnd Preventing HardDrive Failure Proactively

  16. What Is PRCS? • Proactive Reporting and Correcting Safeguard (PRCS) enables a device and host to correct failure conditions proactively • Device can report hostile conditions before damage or failure occurs • Host reacts to a device event in real time based on policy and user preference • A proposal for the PRCS protocol hasbeen submitted to T13

  17. Why Is PRCS Important? • User’s digital data is more valuable than ever before • Disk drive capacity continue to increase • Not every PC user can afford RAID • Deliver on opportunities for improvements beyond S.M.A.R.T.

  18. Goals Of PRCS • Proactively protect user data • Improve the user experiencewhen data is at risk • Reduce OEM’s customer support costs • Reduce warranty costs for disk drive vendors

  19. PRCS Features • Device monitors its own conditionsin real time • Reduce host monitoring performance impact • Device sends meaningful PRCS events to the host for correction of hostile conditions and data protection • No translations or guesses required • Host acts on device’s PRCS event proactively according to policy and user preference

  20. PRCS Advantages • PRCS is proactive • Taking a corrective action before errors occur • Protecting data when it is at risk • PRCS is designed for end users, not just computer experts • No need to understand a cryptic message tobenefit from PRCS. For example: “The previousself-test completed having the electrical elementof the test failed” • PRCS enables transparent mitigation of a hostile condition or a recovery process • Users do not need to configure a self-test mode or reporting method • Users control policy as desired

  21. Proactive Disk Diagnostics Debasis BaralVice President of EngineeringSamsung

  22. HDD Reliability 101 • HDD reliability and performanceis negatively impacted by extremesin the following operating conditions • Temperature Demo • Vibration Demo • Shock Demo • Duty cycle • Altitude • Humidity • A combination of the above conditions • A history of the above combinations

  23. Ref.: Samsung reliability tests Reliability Versus Temperature • HDD life decreases with temperature • Failure rates increase exponentially with temperaturefor all HDD suppliers • Environmental temperature increase from 25C to 100C could translate into 10 – 50x shorter life Samsung HDD Lab Engineering Sample Data

  24. Performance Versus Vibration • Data throughput or drive performance can besignificantly affected in the presence of vibration • Effect of vibration is reversible • Cumulative effects of vibration on long term drive reliability is a subject of ongoing research Samsung HDD Lab Engineering Sample Data

  25. Reliability Versus Shock Shock Modeling Operating shock damage Excessive shock is the major cause of failure in both PCand consumer electronics environments Op. Shock Scratches Damage by corners, leading edge, and side edges of the slider. Non-operating shock damage Courtesy: E. Jayson and Frank Talke, UC San Diego

  26. Reliability Design Guidelines • Failure modes and failure rates of disk drives depend on their operating environments • Temperature and Handling(shock and vibration) are major factors impacting HDD reliability • HDD reliability will be enhanced if OS detects and manages reliability risksand stress events intelligently (PRCS) • Users can improve HDD data reliabilityby correctly responding to PRCS events

  27. PRCS Kai ChenMicrosoft Corporation Debasis BaralSamsung

  28. Call To Action • Test your drives with Windows Vista Disk Diagnostics and send feedback • Ensure your drives comply with ATA-7 specs to surface device error count and life timestamp • Engage with the Startup Repair team to build a plan for Startup Repair in OEM factory processes • Participate in T13 discussions on PRCS • Plan your device designs in line with PRCS guidelines

  29. Additional Resources • Whitepapers • Windows Recovery Environment/Startup Repair/Built-in Diagnostics: http://www.microsoft.com/technet/windowsvista/evaluate/feat/relperf.mspx • Feedback/Questions • Windows Vista Disk Diagnosis: • Corrupt File Recovery: • Windows Recovery Environment/Startup Repair: • PRCS: Dfdfeed @ microsoft.com Dfdfeed @ microsoft.com Recovery @ microsoft.com Prcsdisc @ microsoft.com

  30. © 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Related