1 / 35

System Management Planners: Transforming (High-level Specifications)  (Configuration Actions)

System Management Planners: Transforming (High-level Specifications)  (Configuration Actions). Sandeep Uttamchandani IBM Almaden Research Center. Jim Gray's Turing award speech “What next? - A dozen IT research goals”, 1999. Build a system used by millions of people each day

shakti
Download Presentation

System Management Planners: Transforming (High-level Specifications)  (Configuration Actions)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. System Management Planners: Transforming (High-level Specifications)  (Configuration Actions) Sandeep Uttamchandani IBM Almaden Research Center

  2. Jim Gray's Turing award speech “What next? - A dozen IT research goals”, 1999 Build a system • used by millions of people each day • administered and managed by a ½ time person. • On hardware fault, order replacement part • On overload, adjust automatically

  3. Automated Management: A Growing Necessity! • Demand for IT Management : • Growing number of applications and data footprint (988 exabytes in 2010 compared to 161 in 2006 -- IDC) • Government regulatory compliance (e.g., HIPAA, Sarbanes-Oxley), Disaster Recovery Planning, Application Performance requirements, Provisioning Planning • Growing number of heterogeneous devices, management protocols, application requirements and policies • Supply of Administrators • 1 Storage Admin manages approximately 300GB- 1000GB of storage -- enterprises moving towards petabyte scale systems • Lack of end-to-end knowledge: Application + Servers + Networks + Storage • Skilled administrators are scarce and costly

  4. Talk Outline • Problem Drill-down: Understanding the System Administrative tasks • Taxonomy of Approaches for Automation • Management Planners • Building Blocks • Putting it together

  5. Deploying a new Application within an Enterprise Data-Center • Find a server – create a new Virtual Machine • Install and configure application • Select a Storage controller • Find a storage pool -- create a new volume with required capacity • Select a FC switch with available ports -- connect to server and storage controller • Zone the switch followed by LUN Masking and mapping

  6. - read/write ratio - rand/seq ratio - request-size - … Heterogeneous Component Models Application Performance Management (I) DAS NAS Workload access variations iSCSI Failures - Hardware failures - Software bugs - Operator errors Observe Analyze Request size Act SPC: OLTP Load Surges Time IOPS Time

  7. Application Performance Management (II) • Problem Determination • Impact Analysis • Root-cause diagnosis • Event mining and configuration changes • Load balancing • Which application to move? Where? When? • Adding Hardware • Servers, Network, Storage? Where?

  8. Post deployment Tasks • Performance Management • Disaster Recovery (Availability Management) • Regulatory Compliance • Security • Hardware changes • Changing applications and IT goals • …

  9. Administrator’s Dream ? System Configuration Details/Corrective Actions Current State Task Requirements Objective Functions

  10. Taxonomy of Existing Approaches

  11. Expert Systems: Capturing Human Problem Solving Mycin Expert System Pin-point Bacteria/ Medication Series of disease symptoms Rule-based Inference IF the infection is pimary-bacteremia AND the site of the culture is one of the sterile sites AND the suspected portal of entry is the gastrointestinal tract THEN there is suggestive evidence (0.7) that infection is bacteroid.

  12. Taxonomy of existing approaches

  13. Management Planners:Model-based + Declarative Specifications

  14. Self-evolving predictors using machine learning Declarative Specification Minimize high-priority workloads violating SLOs Knowledge-base Predictors of System Behavior Reasoning Engine Objective Constrained Optimizer Configuration/ Action Selection - Component capabilities - Workload dependencies on individual components - Effects of action invocation Capability Current Status Corrective Action Trigger Managed [Storage] System

  15. Building Blocks • Requirements (Declarative Specifications) • Collecting data from devices • Creating device performance models • Formalizing the optimization problem

  16. Knowledge-base: Intuition for generating models Models are mathematical functions e.g. r = ax + by Input variables = {x, y}; Output variables = {r}; Constants {a, b} Generating a function (curve-fitting approach): • Step 1: Designer Specification: Enumerates related parameters (r is a function of x, y, and z) • Step 2: Creating a Baseline model: Off-line data collection; for values of r, x, y, z, determine a best-fit curve (i.e. values of coefficients, and the form of function) • Step 3: Continuous on-line refinement of the functions with additional monitored data

  17. Quadratic Fit: S= 3.284, r = 0.838 Component Models • Representation: Response time = c( req_size, r/w_ratio, rand/seq_ratio, req_rate, cache_hit_rate) • Bootstrapping: • Offline calibration tests OR • Performance specifications from vendor • Challenges: • Interleaving of workload streams • Caching effects due to sharing • Related Work: • CART model [CMU] • Table-based approach [HP] Linear fit (Non-saturated case): S = 0.2509, r = 0.989

  18. Capturing Mean Workload Models • Representation: Component Load = wn(application request_rate) e.g. load at the storage controller and switch for 1000 database (OLTP) transactions • Bootstrapping: • Initial monitoring phase OR • Libraries for application workloads (e.g. OLTP, Decision support, E-mail) • Challenges: • Mean-value not sufficient for real-world workloads**; Using Cummulative Distribution Functions (CDF) • Related Work: • ClockWork (trend prediction) [IBM] • Using ARIMA for Predictive IO prefetching [UIUC] Capturing Variance SPC OLTP Harvard Campus

  19. Optimization Formalism: Linear Programming • Objective function: While solving the SLA violation, minimize the throttling of high priority workloads • Variables: Throttle value for each workload • Constraints: • The response time of the components for a given component load • The request-rate at the component arriving from the workload streams • Change in the application request-rate with throttling • Latency SLA of the workloads

  20. Management Planners: A reality! • Throttling Planner [Usenix’05] • SMART: Performance Management Planner [Usenix’06] • SAN Planner in IBM TotalStorage Productivity Center • Disaster Recovery Planner • End-to-end Provisioning Planner • …

  21. Ongoing RADLab Research

  22. Summary… • Data-centers are becoming growing to petabyte scale and beyond • Need for Automation • Administrative Tasks range from simple firmware upgrades to complex provisioning and disaster recovery planning • Back-of-the-envelop calculations are no longer feasible • Management Planners • Map high-level declarative specification to configuration commands • Hide the underlying device configuration, performance, event details • Automatically create and continuously refine device models

  23. Food for thought… “How accurate can the models be?” “How accurate the models need to be?” Research Spectrum • Representation & creation of domain knowledge - Understanding of system details - Feature-set selection - Machine learning techniques • Formalisms for selection & execution of actions - Constrained optimization techniques - Handling uncertainty and inaccuracies - Variably aggressive action execution • Pragmatic rules-of-thumb • - Models don’t need to perfectly accurate • - Not critical to select the most “optimal” action invocation, but rather to avoid the worst ones - Creating domain knowledge is not a one-time activity – incremental addition and evolution - Automate the common-case

  24. Thankyou! Sandeep Uttamchandani (sandeepu@us.ibm.com) http://www.almaden.ibm.com/StorageSystems/Storage_Management_and_Solutions/

  25. Backup

  26. Policy-Based Interface • Term Policy means different things to different people • Service Class • Goals • Constraints • Best Practices • Rules of thumb • If-Then-Scope-Priority (IBM’s PMAC model) • Some users want a lot of control over the policy specification • Other users want pre-packaged service classes (like Gold/Silver/Bronze) and they subsequently want to fine-tune the parameters and create customized service classes

  27. Aperi: Open standard initiatives • Aperi’s goal • Delivers an open-source common management platform through the contribution and development of actual code. • The common platform will implement SNIA’s SMI-S specification for management of heterogeneous devices. • Targeted Benefits • Improve speed to market of new advanced tools designed for ease of use • Reduce the need for customers to replace storage management platforms when purchasing new hardware or software • Encourage vendors to support industry standards for management in their hardware implementations Aperi An open-source storage management community Startups Academia Fabric “Innovators” Aperi CommonOpen SourcePlatform VC’s Storage System Initial members

  28. Latency(%) EXCEED FAILED 1 MEET LUCKY SLAi – a(current_throughputi, ti) if SLAi > a(current_throughputi, ti) 1 IOps(%) 0 Ai = 0 otherwise Optimization Formalism (cont.) Objective function: Minimize ∑paipbiAi /SLAi where pai= Workload priority pbi = Quadrant priority Minimize ∑paipbi[ SLAi – a(current_throughputi, ti)] SLAi Constraints: cachehiti*hittimei+(1-cachehiti) c(∑a(current_throughputi, ti) SLAi 0  ti 1

  29. Non-Procedural Specifications Research • “Procedural-is-best” controversy • Separation of facts and formalisms • Strategies such as “backtracking” to search the knowledge-base • Logic-based • Network-based • Relational model Machine learning Research • Correlating observed behavior with system parameters • Statistical Learning techniques: Neural Networks, SVMs • Gray-box approaches such as the Snowball project • Supervised/ Re-enforcement • Boosting Related Fields: A Lot to Learn! Expert Systems Research • Low Road: Dendral • Middle Road: Mycin, R1 • High Road: Sophie • Architecture: Knowledge-base & Reasoning engine • Knowledge-base encodes (generic) domain knowledge • Reasoning engine can interpret knowledge in multiple ways

  30. A Typical Data-center Application (SAP Application Server) Executables NTFS File System DB Server WINDOWS IBM DB2 (Database Managed Storage) DB Server AIX IBM DB2 (System Managed Storage) DB Server WINDOWS Oracle (Database Managed Storage) DB DB DB Logical Volume Manager Logical Volume Logical Volume JFS JFS Data Log Data Log Data Log Temp Volume Volume Volume Volume Volume Volume Volume

  31. Application Downtime = $$$ Losses Applications require 24 x 7 availability of business critical applications Application Availability = Ensuring availability of multiple tiers • Storage Controllers • SAN Appliances • Servers • Virtual Machines • Databases/File-systems Failures come in several flavors • Virus failures • Mis-configuration errors • Subsystem failures • Site failures (hurricanes, planes)

  32. Preparing for IT Disasters: Administrator’s Task-list • Planning • Understand DR requirements • Evaluate replication services available at storage and other levels • Analyze existing copy services configuration (if any); Generate a DR plan • Deployment • Configure various replication technologies from IBM and non-IBM vendors • Replication at different levels namely the database, server, operating system and storage level (e.g., RM, HACMP, SRDF, MSCS, VCS) • Validation • Validate DR plans for changes in configuration changes (e.g., changes in zoning) and application characteristics (e.g., write rate) • Continuous Optimization • Optimize DR plans for unused copy relationships • Recommend updates to existing configuration based on hardware and software changes

  33. Preparing for IT Disasters: A Consultant’s Gold-mine • Planning: Complex Search Space, Manual & Error-prone • DR Requirements (RMAF questionnaire); Storage and Server Characteristics • Replication Technology Characteristics • Constraints: Interoperability, # sites, dollar cost • Best Practices • Deployment: Requires expertise in multiple replication technologies • Vendor-specific CLI commands/API for creating/updating/deleting copy pairs, sessions, consistency groups; done manually today by administrators • Validation and Continuous Optimization: 24X7 impact analysis • Analyze impact of configuration changes and application properties • Periodic sampling at primary and secondary storage for Recovery Point Objective (RPO)

  34. y = f(x5,x11) f  monitored data MonitorMining Related Work: Creating models Spectrum y = ax5 + bx11 y = a1x1 + a2x2 +… a100x100 Analytical Approaches Black-box Approaches Representation of models? Evolution of models? Incomplete designer specifications? Brittle, error-prone Convergence, Accuracy • - John Wilkes Ecosystem: Minerva, Ergastrulum, Hippodrome - Modeling disk behavior, formulas for data prefetching , modelling migration - [Bre94], [Bre95], [Mat97], [Agr00], [Nob97], [Men01], [Laz84], [Vap95], [Vud01] - Case-base reasoning - Multi-relational mining - Table-based models - CART [CMU], CMiner [UIUC]

  35. Related Work: Policy-based Management (Pattern-based Procedure Invocation [Hewitt67]) Example: Rules for the Prefetch knob [Event]: Latency_violation [Condition]: If ((Memory_available > 70) && (access_pattern < 0.4 sequential) && (read/write > 0.4)) [Action]: Prefetch = 1.2*Prefetch Event: Latency_violation If {(15 < Memory_available > 70 && FC_interconnect_available > 60 ) && ( access_pattern > 0.7 sequential && read/write > 0.4)} Prefetch = 1.4*Prefetch  Event: Latency_violation If {(Memory_available > 70 && FC_interconnect_available > 60 ) && ( 0.4 < access_pattern < 0.7 sequential && read/write > 0.4)} Prefetch = 1.3*Prefetch  Event: Latency_violation If {(Memory_available < 15) && ( access_pattern > 0.8 sequential && read/write > 0.4)} Prefetch = 1.2*Prefetch  Event: Latency_not_met If {(Memory_available < 15 ) && ( access_pattern < 0.8 sequential && read/write > 0.4)} Prefetch = 1.05*Prefetch Event: Latency_not_met If {(FC_interconnect_available < 20) && ( access_pattern > 0.8 sequential && read/write > 0.4)} Prefetch = 1.3*Prefetch  ………AND MORE………………….. • Complexity • Level of details in terms of thresholds and invocation values • Deciding among the action set • Number of rules and conflicts analysis: O( Resource-state x Workload x Action-sets x Current-behavior) • Brittleness • Closely tied with system configurations, workloads andaction-sets • No systematic model/approach for refining specifications

More Related