QoS-driven Lifecycle Management of Service-oriented Distributed Real-time & Embedded Systems

QoS-driven Lifecycle Management of Service-oriented Distributed Real-time & Embedded Systems Aniruddha Gokhale a.gokhale@vanderbilt.edu www.dre.vanderbilt.edu/~gokhale Assistant Professor ISIS, Dept. of EECS Vanderbilt University Nashville, Tennessee February 16th, 2006 www.dre.vanderbilt.edu

Service-oriented Style of Distributed Real-time & Embedded Systems • Regulating & adapting to (dis)continuous changes in runtime environments • e.g., online prognostics, dependable upgrades • Satisfying tradeoffs between multiple (often conflicting) QoS demands • e.g., secure, real-time, reliable, etc. • Satisfying QoS demands in face of fluctuating and/or insufficient resources • e.g., mobile ad hoc networks (MANETs)

Characteristics of SOA-style DRE Systems • Manifestation of Service-Oriented Architectures (SOA) in the distributed real-time & embedded (DRE) systems space • Applications composed of a one or more “operational string” of services • A service is a component or an assembly of components • Dynamic (re)deployment of services into operational strings is necessary • New class of QoS (performance + survivability) requirements • Realized using enabling component middleware technologies e.g., CCM, .NET and J2EE

QoS Issues for SOA-style DRE Systems • Per-component concern – choice of implementation • Depends of resources, compatibility with other components in assembly • Communication concern – choice of communication mechanism used • Assembly concerns – what components to assemble dynamically? What order? What configurations end-to-end are valid? • Failure recovery concern – what is the unit of failover? • Sharing concern – shared components will need proactive survivability since it affects several services simultaneously • Availability concern – what is the degree of redundancy? What replication styles to use? Does it apply to whole assembly? • Deployment concern – how to select resources? Risk alleviation?

Design-time Deployment-time Run-time Tangled Concerns in SOA-style DRE Systems • Demonstrates numerous tangled para-functional concerns • Significant sources of variability that affect end-to-end QoS (performance + survivability) Separation of Concerns & Managing Variability is the Key

(1) Design-time Variability Management in SOA-style DRE Systems Focus on Separation of Concerns “What if” Analysis Analytical methods Simulation methods Model-driven generative programming for “what if” Understanding the impact of individual concerns Students involved: Krishnakumar Balasubramanian, Jaiganesh Balasubramanian, Gan Deng, Amogh Kavimandan, James Hill, Sumant Tambe, Arundhati Kogekar, Dimple Kaul Work partly supported by DARPA PCES program (PI), DARPA ARMS Program, PI on subcontracts from Lockheed Martin ATL, & NSF CSR-SMA Program, PI

Separation of Concerns using CoSMIC • Project Lead and PI DARPA PCES program • CoSMIC project focuses on separation of deployment and configuration concerns • Model-driven generative programming framework • Complementary technology to CIAO and DAnCE middleware • www.dre.vanderbilt.edu/cosmic • CoSMIC tools e.g., PICML used for separation of concerns in operational strings • Captures the data model of the OMG D&C specification • Synthesis of static deployment plans for DRE components • New capabilities being added for static deployment planning Work supported by DARPA PCES Program, PI

Case Study for “What if” Analysis: Virtual Router • Network services need support for efficient (de)-multiplexing, dispatching and routing/forwarding • .e.g., VPN Service provided by a virtual router • Provides differentiated services to customers, e.g., prioritized service • VPN setup messages must be efficiently (de) multiplexed, serviced and forwarded • Implemented using middleware • Need to estimate capacity of the system at design-time Problem boils down to capacity planning and estimating performance of configured middleware

Performance Analysis of Reactor Pattern in VR • Customers send VPN setup messages to router • VPN setup messages manifest as events at the VR • VR must service these events (e.g., resource allocation) and honor the prioritized service, if any • Accepted messages are forwarded • Events could be dropped in overload conditions The Reactor architectural pattern allows event-driven applications to demultiplex & dispatch service requests that are delivered to an application from one or more clients. • Reactor pattern decouples the detection, demultiplexing, & dispatching of events from the handling of events • Participants include the Reactor, Event handle, Event demultiplexer, abstract and concrete event handlers

Modeling VR Capabilities in a Reactor • Consider VPN service for two customer classes • Reactor accepts and handles two types of input events • Differentiated services for two classes • Events are handled in prioritized order • Each event type has a separate queue to hold the incoming events. Buffer capacity for events of type one is N1 and of type two is N2. • Event arrivals are Poisson for type one and type two events with rates l1and l2, resp. • Event service time is exponential for type one and type two events with rates m1and m2, resp. Model of a single-threaded, select-based reactor implementation

Performance Metrics of Interest for Reactor • Throughput: • -Number of events that can be processed • -Applications such as telecommunications call processing. • Queue length: • -Queuing for the event handler queues. • -Appropriate scheduling policies for applications with real-time requirements. • Total number of events: • -Total number of events in the system. • -Scheduling decisions. • -Resource provisioning required to sustain system demands. • Probability of event loss: • -Events discarded due to lack of buffer space. • -Safety-critical systems. • -Levels of resource provisioning. • Response time: • -Time taken to service the incoming event. • -Bounded response time for real-time systems.

A2 A1 StSnpSht N2 N1 B2 B1 T_SrvSnpSht T_EndSnpSht Sn1 Sn2 S2 S1 SnpShtInProg Sr2 (a) (b) Sr1 Performance Analysis using Stochastic Reward Nets Transition Inhibitor arc Place Immediate transition Token • Stochastic Reward Nets (SRNs) are an extension to Generalized Stochastic Petri Nets (GSPNs) which are an extension to Petri Nets. • Extend the modeling power of GSPNs by allowing: • Guard functions • Marking-dependent arc multiplicities • General transition probabilities • Reward rates at the net level • Allow model specification at a level closer to intuition. • Solved using tools such as SPNP (Stochastic Petri Net Package).

A2 A1 StSnpSht N2 N1 B2 B1 T_SrvSnpSht T_EndSnpSht Sn1 Sn2 S2 S1 SnpShtInProg Sr2 (a) (b) Sr1 Modeling the Reactor using SRN (1/2) Event arr. Drop events on overflow Service queue Prioritized service Servicing the event Service completion • Models arrivals, queuing, and prioritized service of events. • Transitions A1 and A2: Event arrivals. • Places B1 and B2: Buffer/queues. • Places S1 and S2: Service of the events. • Transitions Sr1 and Sr2: Service completions. • Inhibitor arcs: Place B1and transition A1 with multiplicity N1 (B2, A2, N2) • - Prevents firing of transition A1 when there are N1 tokens in place B1. • Inhibitor arc from place S1 to transition Sr2: • - Offers prioritized service to an event of type one over event of type two. • - Prevents firing of transition Sr2 when there is a token in place S1.

A2 A1 StSnpSht N2 N1 B2 B1 T_SrvSnpSht T_EndSnpSht Sn1 Sn2 S2 S1 SnpShtInProg Sr2 (a) (b) Sr1 Modeling the Reactor using SRN (2/2) • Process of taking successive snapshots • Reactor waits for new events when currently enabled events are handled • Sn1 enabled: Token in StSnpSht & Tokens in B1 & No Token in S1. • Sn2 enabled: Token in StSnpSht & Tokens in B2 & No Token in S2. • T_SrvSnpSht enabled: Token in S1 and/or S2. • T_EndSnpSht enabled: No token in S1 and S2. • Sn1 and Sn2 have same priority • T_SrvSnpSht lower priority than Sn1 and Sn2

VR SRN: Performance Estimates N1 = N2 = 1 N1 = N2 = 5 Perf. metric #1 #2 #1 #2 Throughput 0.37/s 0.37/s 0.40/s 0.40/s Queue length 0.065 0.065 0.12 0.12 Total events 0.25 0.27 0.32 0.35 Loss probab. 0.065 0.065 .00026 .00026 • SRN model solved using Stochastic Petri Net Package (SPNP) to obtain estimates of performance metrics. • Parameter values:l1 = 0.5/sec, l2 =0.5/sec, m1 = 2.0/sec, m2 =2.0/sec. • Two cases: N1 = N2 = 1, and N1 = N2 = 5. • Observations: • Probability of event loss is higher when the buffer space is 1 • Total number of events of type two is higher than type one. • Events of type two stay in the system longer than events of type one. • May degrade the response time of event requests for class 2 customers compared to requests from class 1 customers

VR SRN: Sensitivity Analysis • Analyze the sensitivity of performance metrics to variations in input parameter values. • Vary l1 from 0.5/sec to 2.0/sec. • Values of other parameters:l2 =0.5/sec, m1 = 2.0/sec, m2 =2.0/sec, N1 = N2 = 5. • Compute performance measures for each one of the input values. • Observations: • Throughput of event requests from customer class #1 increases, but rate of increase declines. • Throughput of event requests from customer class #2 remains unchanged.

.ned files Mod Submod1 Submod2 Statistics Simulation kernel Output Vector File Output Scalar File Mod_n.h/.cpp Submod1.h/.cpp Submod2.h/.cpp Visualization and Animation OMNeT++ Initialization File UI Library OMNeT++ Message File Middleware Pattern Simulations in OMNeT++ • OMNeT++ is a discrete event simulator for networked systems • Developers write C++ code for simulation • www.omnetpp.org

The Simulation Model for Reactor Event Handlers with queues Synchronous Event Demultiplexer Statistics Collector Event Generator Reactor

Addressing Middleware Variability Challenges Although middleware provides reusable building blocks that capture commonalities, these blocks and their compositions incur variabilities that impact performance in significant ways. • Per-Block Configuration Variability • Incurred due to variations in implementations & configurations for a patterns-based building block • E.g., single threaded versus thread-pool based reactor implementation dimension that crosscuts the event demultiplexing strategy (e.g., select, poll, WaitForMultipleObjects • Compositional Variability • Incurred due to variations in the compositions of these building blocks • Need to address compatibility in the compositions and individual configurations • Dictated by needs of the domain • E.g., Leader-Follower makes no sense in a single threaded Reactor

workload workload Automation Goals for “What if” Analysis • Build and validate performance models for invariant parts of middleware building blocks • Weaving of variability concerns manifested in a building block into the performance models • Compose and validate performance models of building blocks mirroring the anticipated software design of DRE systems • Estimate end-to-end performance of composed system • Iterate until design meets performance requirements Applying design-time performance analysis techniques to estimate the impact of variability in middleware-based DRE systems Composed System Refined model of a pattern Refined model of a pattern Refined model of a pattern Invariant model of a pattern Refined model of a pattern Refined model of a pattern weave weave variability variability Refined model of a pattern Refined model of a pattern system

Automating & Scaling the “What if” Process • Model-driven Generative technologies • Developed the SRN Modeling Language (SRNML) in GME • Applied C-SAW framework (from Univ of Alabama, Birmingham) for model scalability R&D supported by NSF CSR-SMA Program in collaboration with Dr. Jeff Gray (UAB) and Dr. Swapna Gokhale (UConn)

Analyzing Impact of Individual Concerns • Borrow concepts from physical systems to analyze the impact of individual concerns on end-to-end system • Method of joints, method of sections, free body diagrams, equilibrium conditions Engineering Mechanics – Statics & Dynamics – for analyzing impact of concerns?

Engineering Mechanics for DRE Systems A concern is viewed as a “force” Challenges • Directionality – are concerns vectors? • Rigidity – are assemblies rigid or deformable? • Force distribution – does a concern have components along Cartesian axes • Well-defined structures – do software components have properties like trusses • Second order effects – transient effects showing up elsewhere • Notion of friction – these are probably the capacities of resources

(2) Deployment-time Intelligence Near optimal deployment planning decisions Specialized middleware stacks Students involved: Arvind Krishna (graduated), Jaiganesh Balasubramanian, Gan Deng, Dimple Kaul, Arundhati Kogekar, Amogh Kavimandan Work partly supported by DARPA ARMS Program, PI on subcontracts from Lockheed Martin ATL

Deployment Challenges • Service workloads and resource capacity issues – service placement depends on workloads and available resources • Component accessibility patterns -- component survivability depends on its sharing degree • Differentiated levels of service –affects resource provisioning and survivability strategies • Service failover – different failover possibilities e.g., as a whole or part assembly or one component at a time • Resource sharing – increases the risk of component(s) requiring proactive survivability strategy • No one-size-fits-all dependability strategy – cannot dictate one FT strategy on all services

C1 A1 C2 C4 S1 C3 S4 A2 S3 A3 Service Placement Problem • A resource configuration is a tuple RC = (C, D, HC, EC) where: • C: is a set of computation nodes each attributed by: • PI(c): processing index (capacity) • MI(c): memory index • RI(c): reliability index • D: is a set of Data access units of types in {Ai,Sj} • HC: C  (D): is a map associating each c in C with a set of data access units • EC C  C : is a set of comm. links each attributed by: • BI(e): bandwidth index • RI(e): reliability index • System performance can be measured in a variety of ways. Considering a task assignment TA: T  C: • Resource utilization: for processing it is defined as the average of all task processing utilization, given as • Memory utilization MU(TA) and link utilization LU(TA) can defined similarly • System utilization factor: The weighted sum percentage of utilizing the system resources • Reliability is more tricky to measure. In general, the reliability of a given computation string is the multiplication of the reliability indices of the underlying nodes and communication edges. • The reliability factor RF(TA) for a given task assignment, TA, depends on: • The reliability of all its computation strings. • The group reliability the underlying nodes (taking into account their relative distances). • The resource utilization of the systems. The more the system hardware are utilized the less reliable it is.

Specializations via Generative Programming • GME-based POSAML language for POSA2 pattern language • Generative programming to synthesize FOCUS and AspectC++ rules • Synthesize specialized middleware stacks for distributed deployment of operational strings.

Run-time QoS-aware Mechanisms Focus on Autonomic Mechanisms Survivability & Fault tolerance Students involved: Jaiganesh Balasubramanian, Sumant Tambe, Jules White, Nishanth Shankaran Work supported by DARPA ARMS Program, PI on subcontracts from Lockheed Martin ATL, BBN Technologies, & Telcordia

… … … … Distributed Virtual Container Approach • Virtual Container Concept for Component M/W • Based on a virtualization idea • Spans boundaries across all the replicas, which could be placed on different physical nodes • Provides a single point for resource provisioning & component programming • Seamless environment for configuring FT, LB, online swapping • Handles fine-grained checkpointing across all the replicas in virtual container • Reliable multicast & state synchronization confined to a virtual container • Maintains information about how the replicas are connected to the external component assemblies • Salient features • Provides an operating context for the components/assemblies requiring QoS • Relieves programmer from having to configure the middleware for QoS support • Clients are oblivious to replication • Normal container programming model • Middleware hides the virtualization details Virtual Container primary secondary

Run-time QoS & Survivability Mechanisms • A configurable approach to survivability including micro- (infrastructure) & macro- (assembly & operational string) level strategies • Micro-level strategies monitor infrastructure state to make proactive decisions at • Component level (swapping & migration) • Middleware level (configurations) • Component Server Level (process resource allocations) • Node level (multiple components) • Macro-level strategies monitor assembly health to make failover decisions • Failover based on type of failover unit • Affects service placement decisions • May involve load balancing • State synchronization issues • Replication styles (hidden by FT strategies) • Initial prototype developed using Component-Integrated ACE ORB (CIAO) & Deployment & Configuration Engine (DAnCE) (www.dre.vanderbilt.edu)

Applications Middleware OS & Protocols Hardware Research Summary R&D in new, holistic approaches to end-to-end QoS management in services-enabled distributed real-time & embedded systems

QoS-driven Lifecycle Management of Service-oriented Distributed Real-time & Embedded Systems