Architecture’s Role in Enterprise Transformation Programs

Architecture’s Role in Enterprise Transformation Programs Iain Mortimer & Rupert Brown V0.4 April 2008

Objectives for today • Overview of ML GIS Transformation Program • Application Availability Stream • Defining an SLDC • Application and Systems Monitoring

Merrill Lynch – a snapshot • Founded 1914 • Global Financial services company • Wealth management • Capital markets • Advisory • Operates in ~38 countries • Client assets of about 2 trillion US dollars • 22nd in Fortune 500 • 64,200 employees

Personal introductions 3

70/30 Investment Straight-Thru-Processing Application Availability Global Sourcing Client Satisfaction E-Channels ML Transformation Program These six goals deliver against what clients want and our businesses need

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable” Leslie Lamport ACM SIGACT News, 34, Mar. 2003. Application Availability

The Availability Problem Space Applications Infrastructure Application Portfolio Management Monitor Business Resource Management & Governance SDLC / QA ITIL Technology Service Management Grow the business Run the Business

Application availability by numbers • Engineering challenge • Establish >99.95% availability • 344 tier 0 & tier 1 systems identified in scope • AMRS, EMEA, PacRim == 2x IOCC • Colossal transaction volumes • Streams of work • Monitoring • Reporting • Systems Development Life Cycle • QA

SDLC Applications Infrastructure Business Resource Management & Governance SDLC / QA ITIL Technology Grow the business Run the Business

SDLC Objectives Introduction • Establish a common top level set of stages to guide all projects • Strengthen risk reviews to ensure our most critical projects have the correct technical focus • Standardise project reporting • Redefine key technical review points to increase quality and visibility of technology • Define a core set of key artefacts which are used by many teams across the life cycle • Support greater multi-team working • Reduce complexity and improve documentation, particularly for the support teams

What does an SDLC Define ? Introduction Our Systems Development Life Cycle (SDLC) establishes a simple to use and light-weight mechanism for managing projects. Through this globally common SDLC, senior management will have greater visibility and control over software engineering. At its heart an SDLC defines:- • Stages - What needs to be done • Reviews - What needs to be checked • Roles - Which teams are involved and why • Artefacts - What needs to be recorded/documented The SDLC is deliberately NOT a methodology. In fact in designing it a number of common methodologies were considered to ensure that SDLC support their use. Team’s should ensure that any particular methodology they wish to use, conforms to the standard artefacts, undertakes technical review and reports to the common stages.

The importance of Artefacts Technology Processes all create artefacts and their dependency relationships A software project can be measured mechanically on the completeness of its structure of artefacts and their dependencies

SDLC Routes Introduction – Routes through the SDLC A key factor in any SDLC is balancing the competing needs of risk control versus the weight of effort. Three broad routes are defined, each entailing a different level of work at each stage and depth and number of reviews. The choice of route is broadly based on : • the tier of the system • risk • the amount of work to be carried out • project manager’s judgement

SDLC Stages Introduction - stages The SDLC has seven key stages which help guide a project from its initial idea to a well supported running system. The specific activities within any stage are not prescriptive but they do highlight the major things that should be considered. Clearly projects using iterative methodologies will move up and down the stages with each iteration.

SDLC Reviews : • Large • Small • Esaf Introduction - reviews The SDLC has defined a small number of review points. These are to ensure projects remain under financial control, that their progress is reported in a consistent manner and that projects conform to our technical strategy and that risk is managed The reviews are categorised into business and technical audiences. The number of reviews a project will undertake depends on the SDLC route. Obviously a project manager may undertake further reviews at other points in the lifecycle, if the project warrants them. There are three broad ways which the reviews take place:- • Delegated to the project for self-service • Mostly delegated to the project for self-service but with exception reporting for a few key aspects • Full review by a panel It is expected that by far (>90%) of all projects will undertake self service reviews. Sample review to ensure project review is adequate?? Initial Authorisation Outline Authorisation Full Authorisation Accept System High level design Detailed Design Go/No Go Technical reviews replace existing PTB/PTO reviews

Monitoring Applications Infrastructure Monitor Business Resource Management & Governance ITIL Technology Grow the business Run the Business

Monitoring:-Overall problem scope Greater clarity of business impact will lead to improved processes and applications The mapping of applications to flows is unique to ML Today our monitoring is based on platform specific tools Data and transaction rates are continuing to increase which in turn will drive event volume We need an accurate catalogue and topology map of our platforms

Scope of business activities to be monitored

External pressures on banks • Volume, Latency, Reference Data • Market Data Rates and other major feeds continue to increase exponentially • Low latency, DMA and Algorithmic Trading are combining to cause significant feedback loops with subsequent volume spikes. • Latency metrics from Monitoring and Order Book systems are becoming as significant as the prices and volume quotes on them • System event rates are approaching those of major telcos

Marketplace Observations • Marketplace is weak • None of the leading enterprise platform vendors have been able to demonstrate large scale “dogfood” implementations of their own technology platforms in 2007 • All enterprise monitoring solutions seem to be grown by an acquisition “strategy” rather than core engineering effort. • Technical solutions exist to many Finance Sector problems but recent focus is on Market data and Trading Systems Package monitoring.

Business Activity Monitoring Most vendors are pushing little more than Web 2.x widgets coupled to back end data SQL or warehouse sources to draw nice pictures Credit Crunch Money Saving Tip :- Much of this “limited value” can be replaced by Excel 2007 services and Sharepoint Business Service Monitoring Vendors are naively assuming that organizations can or have converted their entire enterprise to very limited N tier (Where 2<=N<=4) “SOA” architectures Neither mechanism contributes significantly to the improvement of: Root Cause Problem Determination Application and Enterprise Architecture Improvement Business Process Improvement B*M Confusion

Industry Landscape

Define a unifying, extensible, technology-proof fabric to embrace existing and future monitoring tools Provide a single, high performance event space to support multiple application and infrastructure support roles and processes Enable ML to focus on best of breed monitoring tools Enable continuous, systematic process improvement supported by a consistent, extensible range of dashboards Our Architecture Objectives

There is no Silver Bullet There are no significant reference architectures, widely recognized industry best practices or academic research at the enterprise level in Finance. There are no recognised enterprise monitoring solution consultancy practices in the Financial Services Sector Macro observations

Many Dashboards – One Source of Data There are many different operational roles & processes that all need to derive their actions from a single source of the truth There are many different dashboards required to support these “Mission Specialists” Roles, Processes and Organizations change independently of the data The CMDB Bottleneck Analysis of current CMDB offerings has determined that they will struggle to sustain the reverse lookup rates we will require to map device events to platforms and then to applications Some challenges: Dashboards and the “truth”

Models and Dashboards

We have to be able to fully understand the impact of technology solutions and issues on Business Flows and Processes so that they can be continuously optimized to maximum ROI Data Content and Latency Monitoring Data must contain sufficient detail to determine root cause of technical problems wherever possible Monitoring Platforms must be able to provide detailed insight into our lowest latency and highest volume flows ahead of business demands and at sub 1ms granularity. Automation Monitoring Events must be able to correctly trigger transactional automated break-fix processes and dynamic capacity fulfilment Standardization The monitoring platform data will factually direct the future technical strategy and standards Some Basic Design Tenets

Our Implementation approach is to codify existing server inventory and classify servers into 5 application tiers Service/UI Façade Distribution General Purpose Compute Database Tx Gateway Basic Application Dashboards Will give a uniform view of each application by functional tier Can then measure availability of each tier of each application in a consistent fashion Identify and triage weak architectures Some Detail:-Normalized Physical Architecture

Need to deal with Event Rates approaching Market Data Rates (>20K Events per second Globally) System events now driven by business (market events) Be mindful of scale growth in events and trading Need to decide on most appropriate blend of technology Cannot buy rulesets off the shelf May be able to use multivariate analysis to determine significant correlations to validate rulesets Event model, transport and correlation – Logical view

NB Correlation occurs at multiple layers Components within a physical or virtual server Architectural tiers of an application Components of a Business Process Components of an end to end Flow of connected processes. We correlate both technology events and business KPI metrics. Multivariate analysis can be applied to these data flows to heuristically surface the key correlations. Further precedence and correlation

Event model namespace

There is a clear precedence to the classes of monitoring event. Most monitoring tools are clustered at the software and session monitoring end of the spectrum A physical error on a device or platform will produce a cascade of events for all the software modules that run on it and all the platforms that are connected to it. The precedence hierarchy is used by the correlation engines distributed across the monitoring infrastructure to identify the most significant “root cause” event The projected data rates mean that a significant amount of compute resource will be needed to perform the event correlation actions. Monitoring event precedence model

Overall architecture layers + stores view

Solution heat map • In order to define the solution architecture a number of PoC’s were carried out to provide point solutions and assess technology capabilities • At this point we have established coverage of all the technical components necessary however we will need to carry out a broader end-to-end integration PoC

Summary Our Monitoring architecture surfaces the structural performance of an application in the context of the governing business process The combination of performance and development metrics will allow us optimize the resources we need to satisfy our business demand Applications Infrastructure Application Portfolio Management Monitor Business Resource Management & Governance Our SDLC surfaces the structural components of an application from its business requirements SDLC / QA ITIL Technology Service Management Grow the business Run the Business

The need to know Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns - the ones we don't know we don't know Donald Rumsfeld

Architecture’s Role in Enterprise Transformation Programs