1 / 44

Autonomic Runtime System: Design and Evaluation for SAMR Applications *

Autonomic Runtime System: Design and Evaluation for SAMR Applications *. Salim Hariri High Performance Distributed Computing Laboratory The University of Arizona http://www.ece.arizona.edu/~hpdc Supported by: NSF, DOE, DARPA, Intel, Raytheon and AOL grants. Outline.

cliff
Download Presentation

Autonomic Runtime System: Design and Evaluation for SAMR Applications *

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Autonomic Runtime System: Design and Evaluationfor SAMR Applications* Salim Hariri High Performance Distributed Computing Laboratory The University of Arizona http://www.ece.arizona.edu/~hpdc Supported by: NSF, DOE, DARPA, Intel, Raytheon and AOL grants

  2. Outline • Motivation and objectives • Autonomia: An Autonomic Control and Management Environment • Self-Optimization • Self-Protection • Conclusion Remarks

  3. Information Technology and Biology Convergence • Our system design methods and management tools seem to be inadequate for handling the complexity, size, and heterogeneity of today and future Information systems • Biological systems have evolved strategies to cope with dynamic, complex, highly uncertain constraints

  4. Current Design and Development of Computing Systems Different fields evolved separately and Targeted few domains/applications

  5. New System Construction:Part to The Whole Approach Adds Complexity High-Cost Interoperability Issues

  6. Autonomic Computing System: Wholestic Approach Secure , Fault-Tolerant System High-Performance, Fault-Tolerant System Autonomic Building Block Self - Healing Component Self - Optimizing Component Self - Configuring Component Self - Protecting Component Autonomic Computing Systems

  7. Autonomia: An Autonomic Control and Management • Provide dynamically programmable control and management services to support the development and deployment of autonomic applications • Provide Autonomic Runtime Services (self-healing, self-configuring, self-protecting, self-optimizing) • Provide automated deployment, registration, discovery of autonomic components • Provide automated configuration of autonomic applications and system resources

  8. User’s Application Application Management Editor Autonomic Runtime System Autonomic Runtime Services • AIK Repository • ACA Specifications • Policy • Component State • Resource State Self Configuring Self Healing Self Optimizing Self Protecting Policy Engine Application Runtime Manager (ARM) Planning Engine CRM: Component Runtime Manager VEE: Virtual Execution Environment Monitoring &Analysis Engine Scheduling Engine Know- ledge EventServer Coordinator CRM ACA2 ACA1 ACA1 ACA2 ACA2 … ACA3 Computational Component … ACA3 ACAj ACAm ACA3 ACA1 … VEE(App1) VEE(App2) VEE(Appn) High Performance Computing Environment (HPCE)

  9. Autonomia Process Flow User’s Application Application Management Editor Autonomic Runtime System 1 Autonomic Runtime Services • AIK Repository • ACA Specifications • Policy • Component State • Resource State Self Configuring Self Healing Self Optimizing Self Protecting Policy Engine 3 3 Application Runtime Manager (ARM) Planning Engine CRM: Component Runtime Manager VEE: Virtual Execution Environment Monitoring &Analysis Engine Scheduling Engine Know- ledge EventServer Coordinator 4 4 2 2 3 CRM ACA2 ACA1 ACA1 ACA2 ACA2 4 2 … ACA3 Computational Component … ACA3 ACAj ACAm ACA3 ACA1 … VEE(App1) VEE(App2) VEE(Appn) High Performance Computing Environment (HPCE)

  10. Application Execution

  11. Self-Optimizing: Design and Evaluation

  12. Current Implementations Intractable for Large Problems

  13. Beowulf SP2 Cluster NC M t t IBM SP2 IBM SP2 Linux Georeferenced Distributed DB ResourceState ApplicationState System Capability Module Memory Bandwidth Monitor Availability Access Policy VCU VCU Virtual Computation Unit PlanningEngine Resource History Module MPP VCU Virtual Resource Unit Actual Wild Fire Model Development Environment TerrainCharacteristic Regional Weather Local WeatherTemp Humidity Wind Speed Wind Direction Clouds Precipitation Lightning Sensors Survey Flights Fuel Conditions Firefighting Activities Smoke Locationsand concentration GPS Satellite Predicted Fire BehaviorLocation Intensity Geometry Propagation Wildfire Autonomic Runtime Manager (WARM) AnalysisObjectives Dynamic Data Driven Wildfire Model Natural Region Characterization ActivePerformanceModel NR2 NR2Burning NR3Unburned NR1Burned CPU KnowledgeRepository Heterogeneous, Dynamic Computational Environment AutonomicScheduling Execution

  14. Forest Fire Cell Space:Dynamic Repartitioning Initial partitioning NR2 Burning zone finer gridding NR5 NR3 NR2 Burned zone coarser gridding NR3

  15. Wild Fire Simulation Physics • The entire area is represented as a 2-D cell-space.The weather and vegetation conditions are assumed to be uniform within a cell, but may vary in the entire cell space • When a cell is ignited, its state will change from “unburned” to “burning”. During its “burning” phase, the fire will propagate to its eight neighbors along the eight directions as shown below. • As the simulation time advances, the fire will propagate from the first ignition cell to other cells.

  16. Parallel Wild Fire Simulation Analysis • The composition of execution time at time step t for 4 processors. • To decrease T(t), make the computation time on each processor as even as possible, which minimizing the synchronization time. • Imbalance Ratio (IR) characterizes the imbalance situation

  17. Fire Simulation Example • The example above describes the imbalance ratio at different time steps. As the simulation advances, imbalance situation will get worse. t = 1 t = N t = 2N

  18. Self-Optimization • Monitors the state of fire simulation to obtain the computation load at any time step • Monitors the states of the underlying system to obtain the computation capacity • Monitor the imbalance ratio at any time step. • If the imbalance ratio is larger than a given threshold, dynamically adjust the workload among processors at run time.

  19. Self-Optimization Algorithm • Obtain the total workload at time t • Estimate the computation time of one burning cell on processor p with the consideration of system load Where L(p,t) is the length of CPU queueon processor p at time t • Calculate the average execution time of one burning cell

  20. Self-Optimization Algorithm(cont’d) • To balance the load on each processor, processor allocation factor (PAF) is defined as inversely proportional to the processor execution time with respect to the average execution time. • Calculate the Processor Load Ratio (PLR) that characterize the capacities of processors Note that: • Calculate the workload assigned to processor p at time step t, workload(p,t)

  21. Fire Simulation Example with Self-Optimization Algorithm • With the self-optimization algorithm, the imbalance situation will be dramatically decreased. t = N t = 2N t = 1

  22. Run (DDWM) 8 1 7 3 4 2 5 6 Wildfire Autonomic Runtime Manager (WARM) Online Planning Online Monitoring and Analysis ActivePerformanceModel Monitor NR2 NR2Burning Beowulf SP2 Cluster NR3Unburned NR1Burned NR1Burned NC M t t IBM SP2 IBM SP2 Linux PlanningEngine ResourceState KnowledgeRepository ApplicationState Autonomic Scheduling VCU CPU VCU System Capability Module Memory Virtual Computation Unit Bandwidth Scheduler Availability Resource History Module Access Policy Execution(DDWM VRUs) Heterogeneous, Dynamic Computational Environment VCU Virtual Resource Unit MPP Wildfire Autonomic Runtime Manager

  23. Experimental results • Problem size is 64K and number processors is 8 • With self-optimization, the imbalance ratio will be controlled as close to the threshold. But without self-optimization, the imbalance ration will get larger as the simulation advances

  24. Experimental results (cont’d) • Problem size is 64K and number processors is 8. • Without self-optimization, the execution times of processors for one time step will be heterogeneous as the simulation advances. • With self-optimization, the execution times of processors for one time step will be almost evenly distributed as the simulation advances.

  25. Number of Processors Number of Processors Execution Time with Static Partition (s) Execution Time with Static Partition (s) Execution Time with Dynamic Partition (s) Execution Time With Dynamic Partition (s) Percentage Improvement Percentage Improvement 8 8 16868.04 2441.88 1540.58 11244.40 36.91% 33.34% 16 16 11121.66 1824.43 1132.79 7859.89 37.91% 29.33% 32 9093.39 6092.23 33% Experimental results (cont’d) • Problem size (256*256 = 64K) • Problem size (512*512 = 256K)

  26. Memory-based Proactive Runtime Partitioning • Optimize performance using memory-based approach • minimize number of page faults and balance work among processors • Memory function model for RM3D • W is application workload, ai are PF-based heuristics • Memory-based processor grouping and workload partitioning • Lightly (X -), moderately (X), or heavily (X +) loaded groups based on 2-level threshold with N -, N, and N + processors respectively • Work in group X - transferred to X + with unit of work being • Sort processors in X + in ascending order of available memory • Checks are made for processors with corresponding least available memory • Threshold conditions for work transfers must be met • After work transfers, new memory-based work partitioning ratios are computed as

  27. Memory-based Proactive Runtime Partitioning • Better performance → moderately, heavily loaded scenarios • Most processors have less available memory • Frequent page faults resulting in long application delays • Memory-based algorithm yields better performance Memory-based proactive adaptation performance gain for RM3D application with base grid size 128*32*32 on 8 processors

  28. CPU-based Proactive Runtime Partitioning • Adaptive system sensitive partitioner uses system capacities and obtained performance function to compute the relative computational capacities of each processor • System Capacity Calculation • N processors, the total work to be assigned is L • Runtime monitors application and system state • Application state: level of refinement, number, shape and aspect ratio of refined patches • System state: computational load, memory availability, link bandwidth • Performance engine selects the appropriate performance function to predict the execution time of the application for next time step • is the execution time on processor k • The PF of RM3D on processor k for a given load X1 and AMR level X2 is empirically defined as:

  29. CPU Based Proactive System Sensitive Runtime Partitioning CPU-based proactive partitioning performance gain on 16 processors. (Base grid size: 641616)

  30. Event server Mobile Agent System APPLICATION FAULT MANAGER Self healing monitoring and analyzing engine Self healing planning and execution engine execution planning analyzer monitoring Knowledge Component FAult Manager Component FAult Manager Autonomia Self-Healing Application Management Editor User application AUTONOMIC RUNTIME SYSTEM Autonomic Middleware Services SELF-HEALING SERVICE AIK APPLICATION RUNTIME MANAGER Heterogeneous Environment Component FAult Manager

  31. Self-Healing Engine

  32. AdaptiveAnalysis OnlineMonitoring Self Healing Engine Data mining Statistic Engine Real Network Running Environment Self-Protection Methodology

  33. Measurement Attributes for Different Protocols • Inside a network element, the measurement attributes can be monitored at different protocol layers. • During the attack (DoS attack, SQL slammer worm, email worm, etc.), significant behaviors will be observed.

  34. 12 routers and 30 servers - server networks 150 clients, 30 routers -client networks Client Net 2 Client Net 3 Server Net 2 Traffic Configuration Legitimate client traffic through same interface as attack traffic to other servers Legitimate client traffic through different interface to attacked server Legitimate client traffic through same interface to attacked server and towards attack targets Legitimate server traffic (heavy) through different interface and towards other clients. Attack traffic Client Net 1 Server Net1 Client Net 0 Illustrative Network Example 100 Mbps, router to router links. Router to client node links are 30 Mbps and 10 Mbps

  35. ADTCP-out Packet Number Abnormality Distance (AD) • Abnormality Distance of measurement attributes is used as an abnormality metric for profile modeling of the component behavior. where and are the mean and variance under the normal operation condition corresponding to the online measurement of attribute k. Right figure shows the ADtcp_out based on the single measurement attribute measure where the larger magnitude of the ADtcp_out indicates the abnormal behavior that might be due to an attack.

  36. Multivariate Analysis Techniques on Network Attack Detection • Measurement Attributes • tcpOut: legitimate outgoing TCP segments rate • tcpTotal: legitimate outgoing and spoofed outgoing TCP segments rate • NRC: Normal Region Center, which is the baseline profile for the normal state • AD: Abnormality Distance Normal Region UCLtcptotal AD A tcpTotal NRC LCLtcptotal LCLtcpout UCLtcpout tcpOut

  37. Validation on Attacker Side – Spoofed TCP SYN Attack • Attack intensity and duration are adjustable • TCP SYN attack traffic is spoofed • Number of incoming/outgoing packets only won’t detect the attack existence • Jointly with the total TCP network activity analysis can reveal the attack.

  38. Autonomia Self-Protection Architecture Change Network Topology Online Monitoring Autonomic Runtime Engine Change Network Configuration Parameters Raw Traffic w.r.t. metric 1 Policy Translator Information Theory Raw Traffic w.r.t. metric 2 Abnormality function w.r.t metrics 1 .. m Normal/ Abnormal Characterization Raw Traffic w.r.t. metric n Analysis Engine

  39. Working Flow of the Analysis Engine • Information theory is used to identify the most important features that can be extracted from network data. • Genetic algorithm is used to train data and obtain the threshold and coefficients used by the linear rule for detection. • Threshold and coefficients are used to detect a wide range of attacks in the period of testing.

  40. Network Attack Feature Extraction Total Dataset Probe + Normal DoS + Normal R2L + Normal U2R+Normal • Discrete Features • Base dataset has a larger sample size • Discrete feature provides little semantics information

  41. Network Attack Feature Extraction (Cont.) Discrete Features on Total Dataset Continuous Features on Total Dataset • Continuous Features • Compared with the discrete features, some continuous features will provide more information to the final detection • Information provided by the continuous features is much more meaningful • Partition strategy is deployed in the discretization of the continuous features • Heuristic algorithms (e.g. Genetic Algorithm) is used to determine the optimal partition • Combining both discrete and continuous features will provide better detection rate

  42. Experimental Results • We compare our approach that is based on discrete features with fuzzy classifier evolved using Ctree and those of the winner group in the KDDCup’99 contest.

  43. Results – Discrete vs. Cont. & Combined • We compare the results of using discrete and continuous features respectively

  44. Summary and Concluding Remarks • Increased complexity, heterogeneity, uncertainty, and scale require new paradigms to design, control and manage systems and applications • Systems and Applications need to operate reliably, securely, efficiently and cost-effectively • Need Wholestic Approach that can dynamically integrate and address all these issues simultaneously at the layers of the system and application hierarchy • Autonomic Computing Provides an interesting, pragmatic approach to address these issues • Many challenges are ahead including composing and analyzing in real-time the operations and states of systems and applications need new bio-inspired metrics that accurately characterize and quantify the system and application normal and abnormal states

More Related