Fault Tolerant and Resilient Web Services By Terry B. Bobbie Systems Engineer, Raytheon ITSS Bobbie@usgs.gov April 18, 2002 Raytheon Contractor for the USGS at the EROS Data Center
Goals of this briefing • Examine “Fault Tolerant and Resilient” • Introduce an approach to mapping your requirements to service offerings • Foster “out of the cube” thinking • Learn from open discussions • Gather some feedback and have some fun
So why all the hub-bub ? • Your data and service needs may have an elevated importance not known before • Importance of hazard and emergency response information • Protection of life and property • Business continuity • Matured technology and services • Improved and reliable services
Have our requirements changed ?(suppliers and consumers) Private Industry, Academia, and Government • Needs and requirements are diverse, unique, varied, and may not lend themselves to “stove-piped” solutions • Diverse – community of users • Uniqueness of use, data, and user elements (I.e. sophistication, access requirements, delivery requirements) • “We’ve always done it this way” approach may no longer be valid
Fault Tolerant and Resilient • What is “Fault Tolerant and Resilient” ? No Fault Web … where packets collide with each other (injury) on the information super highway without individual ownership of responsibility (no individual packet liability) Resilient Web … where injured packets repair themselves to “good as new” while “on-the-fly” (and sue the switches for pain, suffering and BIG $$$)
How about a design based on Replication of … < ? > aa->xxhost.domain aa->xxhost.domain aa->xxhost.domain
Example technologies used3 Servers One Interface • 3 Sun quad 450s • Replicated File systems (Andrew File System) • DNS configurations • CISCO Distributed Director will provide uninterrupted access to mirrored information • Load balance between available National modules • Only available modules remain in pick list
Benefits of Fault Tolerance and Resiliency • Improved reliability: Geographically distribute public access to content • Improved customer service:Serve the public from high bandwidth sites and reclaim bandwidth for data transfer • Improved management of content: • Allow for distributed content management where appropriate while consolidating physical location
Benefits of Fault Tolerance and Resiliency • Improved security: • Authentication and firewalls • Sophisticated file access • Kerberos authenticated editing of web pages from any system with an AFS client: desktop, laptop or server. At the office, home or away! • Reduced System Administration Requirements • Near 100% reliability for data and information • Protects against • network failures • server failure • natural disaster
An approach to analyzing service opportunities • Phased approach • 1st phase is discovery, understanding, and translation of Web Service Requirements • 2nd phase is discovery, understanding, and translation of vendor market opportunities • 3rd phase is cross walking (mapping) requirements to vendor services available
Phase 1Analyzing requirements • Characterize Web hosting requirements • Examples include • Real-Time gathering and reporting • WWW pages • Images • Flat files • Databases • Each may differ in their characteristics relative to • Data • Manipulation • Access
Phase 1 - Web hosting requirements - Real-Time • Real-Time • An event or series of events that by its nature and mission characteristic require periodic data collection and subsequent delivery in a timely manner. In some cases, this could be described as “on-demand” whereas a master process is executing for the collection of changing data and a corresponding slave process is made available for query and delivery by returning a element or series of data, collected at a specific moment in real time and delivered in a quick, efficient fashion. Should another request of the data collection be made, with all parameters equal, one could expect delivery content to be different. An example of this would be to sample a digital clock. At each second, the new time is passed to a query and delivery staging area. This area is made available to query and when queried, delivers its content(s) in real time (no delay). Each second may overwrite the previous or may be concatenated in order to construct a series. The query process is repeated with the parameters allowing for possible responses ranging from the single entry of current time to a series of collections ranging from current to oldest or any subset inclusive.
Phase 1 - Web hosting requirements – WWW pages • WWW pages • Delivery of content within WWW pages may describe textual based information, documents, or graphics that are vital to the basis of information and research, but can be generally referred to as static. Each user request or “hit” returns the same front page information. • Front pages of WWW servers that act as “directories or portals” of information may be static, requiring updates only as often as listing requires change. One would say that this page (a listing of directories) is static until a new directory is added or deleted. The actual content of the directory may not be hosted by the same source as the directory, thereby possibly not being described as static.
Phase 1 - Web hosting requirements - Images • Images • Images can be large or small, compressed and uncompressed, and of different formats. Many images are jpeg (or other common format) and are used for logos, pictures for hosting, graphic representations, etc. Other images may be of different formats. Images are pre-generated (like a logo) while other images can be generated dynamically by user input. • Images may be static – they are generated one time and rarely change (most often attached to WWW pages as static graphics delivered on each request or hit) • Some images may be dynamically created – where user input defines criteria for graphic generation. (I.e. geo-spatial data and rendering)
Phase 1 - Web hosting requirements - Databases • Databases • Many of today’s WWW pages contain user selectable parameters that may change and differ by user subject matter or interest. Custom user input may describe broad, open-ended, (like an infinite number of) input parameters much like a query based upon a key word. A good example of custom user input would be where results are returned based upon input parameters selected, chosen or otherwise obtained from a very large number of choices or selections. WWW search engines are designed and built with the idea that user input may not be entirely predictable (i.e. key word search and the key word could be any word (or combination of) used in the English language of over one million words). • One may counter this concept with relating infinite to having a known set of boundaries (i.e. everything has an end limit or boundary). In the context of this definition, we should assume that infinite relates to a very large order of magnitude. • USGS has many examples of this requirement today. One example is where user selectable boundaries are used as input criteria to deliver geo-spatial data. The same database, populated with a known set of data files, is queried with different input parameters and combinations and a different geo-spatial information is delivered for each unique query.
Phase 1 - Web hosting Data characteristics • Data Characteristics • Frequency of update requirements • How often the data requires updating, modification or deletion. (I.e. hourly, weekly, monthly, dynamic) • Volume of data • Quantity as it relates to storage requirements • Geographic Scope and Context • Data may be relevant to global, national, regional, or local needs and may require service from multiple locations
Phase 1 - Web hosting manipulation requirements • Manipulation Characteristics • None (text-like) • Data is served as a flat file without manipulation • On-the-fly graphics generation • Generation or rendering of graphics before presentation to a user • Database query • Lookup is executed based upon user input parameters • Other special (Java based, map object rendering, etc.)
Phase 1 - Web hosting access requirements • Access characteristics • Frequency of use (hits, files served, etc.) • How often are requests serviced in a period • Fault tolerance limit (Low, medium, high) • Importance of availability (L,M,H) • Volume of units served per period • 150 WWW page (25KB ea.) deliveries hourly • 250 Images (500MB ea.) delivered per 24 hr day • 500 Database queries & responses per 8 hr business day • 350 “Gif-on-the-fly” deliveries per 24 hr day • Expected delivery time per request
An approach to analyzing service opportunities • Phased approach – Phase 2 • 1st phase is discovery, understanding, and translation of Web Service Requirements • 2nd phase is discovery, understanding, and translation of vendor market opportunities • 3rd phase is cross walking (mapping) requirements to vendor services available
Phase 2 - Gain an understanding of services available • Web Services opportunities • Vendor supplied • Host site supplied • Combinations of any or all • Others ?
Phase 2 - Characteristics of Service Opportunities • Key Characteristic Descriptions • Data • Local storage capability / capacity • Responsiveness to (period or cycle) changes in source data (i.e. new www page or content, add/delete/change image files, database content and architecture, real-time data gathering • Change Management Strategy and Plans (out-of-service maintenance, scheduled maintenance, access permissions, content change, software and platform changes, etc.) • Geographic context (local, regional, national, global)
Phase 2 - Characteristics of Service Opportunities • Key Characteristic Descriptions • Manipulation • Local processing capability/capacity • Scalability of end-to-end response to events (i.e. excess capacity or headroom of resources, networks, CPU, memory, I/O interfaces, storage, other surge capability, etc.)
Phase 2 - Characteristics of Service Opportunities • Key Characteristic Descriptions • Access • Bandwidth capability/capacity • Service redundancy (networks, platforms, other infrastructure) • Responsiveness (response time) to requests for serving data to end user • Geographic context (locations are local, regional, national, global) • Delivery of Data guarantee
Phase 2 - Characteristics of Service Opportunities • Key Characteristic Descriptions • Misc. (may apply of any or all of the categories) • Uptime guarantee • Security Management Strategy and Plans (system level, content, customer identity, etc.) • Prioritized Users (i.e. can the vendor render a schema to priority users based upon volume, frequency, emergency response, etc.) • Operations and Service Level agreements (backup strategies, 24x7 system monitoring, trouble analysis and resolution, network management, technical support to end users and customer, contingency plans, etc.)
An approach to analyzing service opportunities • Phased approach – Phase 3 • 1st phase is discovery, understanding, and translation of Web Service Requirements • 2nd phase is discovery, understanding, and translation of vendor market opportunities • 3rd phase is cross walking (mapping) requirements to vendor services available
Phase 3 - a crosswalk analysis of requirements and services • Case Study • Requirement # 1 – WWW pages • Data Requirements • Frequency of update (monthly) • Volume is 500 MB (stored pages, graphics, work area) • Manipulation • None (text based page with small static graphics)
Phase 3 - a crosswalk analysis of requirements and services • Case Study – Requirement # 1 - Con’t • Access • 180 pages served per hour (history = 3 per min) • Fault tolerance is high (outages are ok) • Importance of availability is low (not required to safeguard human life and property) • Volume is 180 x 50KB or 9000KB (9MB) per hour or 150KB/min or 2500 Bytes per sec (sustained rate) • If there is an expected delivery time of 5 sec … (delivery rate requirement = 10KB/sec)
Phase 3 - a crosswalk analysis of requirements and services • Case Study • Requirement # 2 – Image file generation • Data Requirements • Frequency of update (hourly updates required) • Volume is 300 TB (image graphics) • Manipulation • High (Gif-on-the-fly generation graphics)
Phase 3 - a crosswalk analysis of requirements and services • Case Study – Requirement # 2 - Con’t • Access • 10 files served per hour (history) • Fault tolerance is low (very few outages) • Importance of availability is medium (some requirement to safeguard human life and property) • Volume is 10 x 500MB per hour or 1.389MB/sec (sustained rate) • An expected delivery time of 1hr/file (delivery rate requirement)
The Magic Algorithm Matrix Score Overall Score = Pass
The Magic Algorithm Matrix Score Overall Score = Fail
Cross walk Matrix In this case, only Vendor B meets all requirements
Summary • Using facts and data, characterize your requirements • Analyze vendor service offerings and opportunities • Map requirements to vendor services • Perform cost analysis • Explore other options (½ full or ½ empty) • Expect that not all needs can be fully met by vendors • Analyze the cost benefit and tradeoffs