Societal Scale Information Systems Jim Demmel, Chief Scientist EECS and Math Depts.

UC Santa Cruz Societal Scale Information Systems Jim Demmel, Chief Scientist EECS and Math Depts. www.citris.berkeley.edu

Outline • Scope • Problems to be solved • Overview of projects

Massive Cluster Clusters Gigabit Ethernet “Server” Scalable, Reliable, Secure Services “Client” Information Appliances MEMS Sensors Societal-Scale Information System - SIS

Desirable SIS Features - Problems to solve • Integrates diverse components seamlessly • Easy to build new services from existing ones • Adapts to interfaces/users • Non-stop, always connected • Secure

Projects • Sahara • Service Architecture for Heterogeneous Access, Resources, and Applications • Katz, Joseph, Stoica • Oceanstore and Tapestry • Kubiatowicz, Joseph • ROC and iStore • Recovery Oriented Computing and Intelligent Store • Patterson, Yelick, Fox (Stanford) • Millennium and PlanetLab • Culler, Kubiatowicz, Stoica, Shenker • Many faculty away on retreats • www.cs.berkeley.edu/~bmiller/saharaRetreat.htm • www.cs.berkeley.edu/~bmiller/ROCretreat.htm

The “Sahara” Project • Service • Architecture for • Heterogeneous • Access, • Resources, and • Applications www.cs.berkeley.edu/~bmiller/saharaRetreat.html

JAL Restaurant Guide Service NTTDoCoMo UI Babblefish Translator Zagat Guide User Sprint Tokyo User Salt LakeCity Scenario: Service Composition

Sahara Research Focus • New mechanisms, techniques for end-to-end services w/ desirable, predictable, enforceable properties spanning potentially distrusting service providers • Tech architecture for service composition & inter-operation across separate admin domains, supporting peering & brokering, and diverse business, value-exchange, access-control models • Functional elements • Service discovery • Service-level agreements • Service composition under constraints • Redirection to a service instance • Performance measurement infrastructure • Constraints based on performance, access control, accounting/billing/settlements • Service modeling and verification

Technical Challenges • Trust management and behavior verification • Meet promised functionality, performance, availability • Adapting to network dynamics • Actively respond to shifting server-side workloads and network congestion, based on pervasive monitoring & measurement • Awareness of network topology to drive service selection • Adapting to user dynamics • Resource allocation responsive to client-side workload variations • Resource provisioning and management • Service allocation and service placement • Interoperability across multiple service providers • Interworking across similar services deployed by different providers

Service Composition Models • Cooperative • Individual component service providers interact in distributed fashion, with distributed responsibility, to provide an end-to-end composed service • Brokered • Single provider, the Broker, uses functionalities provided by underlying service providers, encapsulates these to compose an end-to-end service • Examples • Cooperative: roaming among separate mobile networks • Brokered: JAL restaurant guide

Mechanisms for Service Composition (1) • Measurement-based Adaptation • Examples • Host distance monitoring and estimation service • Universal In-box: exchange network and server load • Content Distribution Networks: redirect client to closest service instance

Mechanisms for Service Composition (2) • Utility-based Resource Allocation Mechanisms • Examples • Auctions to dynamically allocate resources; applied for spectrum/bandwidth resource assignments • Congestion pricing (same idea for power) • Voice port allocation to user-initiated calls in H.323 gateway/Voice over IP service management • Wireless LAN bandwidth allocation and management • H.323 gateway selection, redirection, and load balancing for Voice over IP services

Mechanisms for Service Composition (3) • Trust Mgmt/Verification of Service & Usage • Authentication, Authorization, Accounting Services • Credential transformations to enable cross-domain service invocation • Federated admin domains with credential transformation rules based on established agreements • AAA server makes authorization decisions • Service Level Agreement Verification • Verification and usage monitoring to ensure properties specified in SLA are being honored • Border routers monitoring control traffic from different providers to detect malicious route advertisements

OceanStoreGlobal-Scale Persistent Storage

OceanStore Context: Ubiquitous Computing • Computing everywhere: • Desktop, Laptop, Palmtop • Cars, Cellphones • Shoes? Clothing? Walls? • Connectivity everywhere: • Rapid growth of bandwidth in the interior of the net • Broadband to the home and office • Wireless technologies such as CMDA, Satelite, laser

Questions about information: • Where is persistent information stored? • Want: Geographic independence for availability, durability, and freedom to adapt to circumstances • How is it protected? • Want: Encryption for privacy, signatures for authenticity, and Byzantine commitment for integrity • Can we make it indestructible? • Want: Redundancy with continuous repair and redistribution for long-term durability • Is it hard to manage? • Want: automatic optimization, diagnosis and repair • Who owns the aggregate resouces? • Want: Utility Infrastructure!

Canadian OceanStore Sprint AT&T IBM Pac Bell IBM Utility-based Infrastructure • Transparent data service provided by federationof companies: • Monthly fee paid to one service provider • Companies buy and sell capacity from each other

OceanStore Assumptions • Untrusted Infrastructure: • The OceanStore is comprised of untrusted components • Only ciphertext within the infrastructure • Responsible Party: • Some organization (i.e. service provider) guarantees that your data is consistent and durable • Not trusted with content of data, merely its integrity • Mostly Well-Connected: • Data producers and consumers are connected to a high-bandwidth network most of the time • Exploit multicast for quicker consistency when possible • Promiscuous Caching: • Data may be cached anywhere, anytime • Optimistic Concurrency via Conflict Resolution: • Avoid locking in the wide area • Applications use object-based interface for updates

First Implementation [Java]: • Event-driven state-machine model • Included Components • Initial floating replica design • Conflict resolution and Byzantine agreement • Routing facility (Tapestry) • Bloom Filter location algorithm • Plaxton-based locate and route data structures • Introspective gathering of tacit info and adaptation • Language for introspective handler construction • Clustering, prefetching, adaptation of network routing • Initial archival facilities • Interleaved Reed-Solomon codes for fragmentation • Methods for signing and validating fragments • Target Applications • Unix file-system interface under Linux (“legacy apps”) • Email application, proxy for web caches, streaming multimedia applications

OceanStore Conclusions • OceanStore: everyone’s data, one big utility • Global Utility model for persistent data storage • OceanStore assumptions: • Untrusted infrastructure with a responsible party • Mostly connected with conflict resolution • Continuous on-line optimization • OceanStore properties: • Provides security, privacy, and integrity • Provides extreme durability • Lower maintenance cost through redundancy, continuous adaptation, self-diagnosis and repair • Large scale system has good statistical properties

Oceanstore Prototype Running with 5 other sites worldwide

Recovery-Oriented Computing Philosophy • People/HW/SW failures are facts to cope with, not problems to solve(“Peres’s Law”) • Improving recovery/repair improves availability • UnAvailability =MTTR (assuming MTTR << MTTF) • MTTF • 1/10th MTTR just as valuable as 10X MTBF • Recovery/repair is how we cope with above facts • Since major Sys Admin job is recovery after failure, ROC also helps with maintenance/TCO, and Total Cost of Ownership is 5-10X HW/SW • www.cs.berkeley.edu/~bmiller/ROCretreat.htm

ROC approach • Collect data to see why services fail • Operators cause > 50% failures • Create benchmarks to measure dependability • Benchmarks inspire and enable researchers, name names to spur commercial improvements • Margin of Safety (from Civil Engineering) • Overprovision to handle the unexpected • Create and Evaluate techniques to help • Undo for system administrators in the field • Partitioning to isolate errors, upgrade in the field • Fault insertion to test emergency systems in the field

Availability benchmarking 101 • Availability benchmarks quantify system behavior under failures, maintenance, recovery • They require • A realistic (fault) workload for the system • Fault-injection to simulate failures • Human operators to perform repairs • New winner is fastest to recover, vs. fastest normal behavior(99% conf.) QoS degradation failure Repair Time Source: A. Brown, and D. Patterson, “Towards availability benchmarks: a case study of software RAIDsystems,” Proc. USENIX, 18-23 June 2000

ISTORE – Hardware Techniques for Availability • Cluster of Storage Oriented Nodes (SON) • Scalable, tolerates partial failures, automatic redundancy • Heavily instrumented hardware • Sensors for temp, vibration, humidity, power, intrusion • Independent diagnostic processor on each node • Remote control of power; collects environmental data for • Diagnostic processors connected via independent network • On-demand network partitioning/isolation • Allows testing, repair of online system • Managed by diagnostic processor • Built-in fault injection capabilities • Used for hardware introspection • Important for AME benchmarking

ISTORESoftware Techniques for Availability • Reactive introspection • “Mining” available system data • Proactive introspection • Isolation + fault insertion => test recovery code • Semantic redundancy • Use of coding and application-specific checkpoints • Self-Scrubbing data structures • Check (and repair?) complex distributed structures • Load adaptation for performance faults • Dynamic load balancing for “regular” computations • Benchmarking • Define quantitative evaluations for AME

ISTORE Status • ISTORE Hardware • All 80 Nodes (boards) manufactured • PCB backplane: in layout • 32 node system running May 2002 • Software • 2-node system running -- boots OS • Diagnostic Processor SW and device driver done • Network striping done; fault adaptation ongoing • Load balancing for performance heterogeneity done • Benchmarking • Availability benchmark example complete • Initial maintainability benchmark complete, revised strategy underway

ISTORE Prototype

UCB Clusters • Millennium Central Cluster • 99 Dell 2300/6400/6450 Xeon Dual/Quad: 332 processors • Total: 211GB memory, 3TB disk • Myrinet 2000 + 1000Mb fiber ethernet • OceanStore/ROC cluster, Astro cluster, Math cluster, Cory cluster, more • CITRIS Cluster 1: 3/2002 deployment (Intel Donation) • 4 Dell Precision 730 Itanium Duals: 8 processors • Total: 8GB memory, 128GB disk • Myrinet 2000 + 1000Mb copper ethernet • CITRIS Cluster 2: 9/2002 deployment (Intel Donation) • ~128 Dell McKinley class Duals: 256 processors • Total: ~512GB memory, ~8TB disk • Myrinet 2000 + 1000Mb copper ethernet • Network Expansion Needed! - and underway • UCB shift from NorTel plus expansion • Ganglia cluster management in Susie distribution • hundreds of companies using it

CITRIS Network Rollout

Millennium Top Users • 800 users total on central cluster, many of which are CITRIS users • 75 major users for 2/2002: average 65% total CPU utilization • Independent component analysis – machine learning algorithms (fbach) • Ns-2 a packet level network simulator (machi) • parallel AI algorithms for controlling 4-legged robot (ang) • Image recognition (lwalker) 2 hours on cluster vs. 2 weeks on local resources • Network simulations for infrastructure to track moving objects over a wide area (mdenny) • Analyzing trends in BGP routing tables (sagarwal) • Boundary extraction and segmentation of natural images (dmartin) • Optical simulation and high quality rendering (adamb) • Titanium – compiler and runtime system design for high performance parallel programming languages (bonachea) • AMANDA – neutrino detection from polar ice core samples (amanda) http://ganglia.millennium.berkeley.edu

Planet-Lab Motivation • A new class of services & applications is emerging that spread over a sizable fraction of the web • CDNs as the first examples • Peer-to-peer, ... • Architectural components are beginning to emerge • Distributable hash tables to provide scalable translation • Distributed storage, caching, instrumentation, mapping, ... • The next internet will be created as an overlay on the current one • as did the last one • it will be defined by its services, not its transport • translation, storage, caching, event notification, management • There is NO vehicle to try out the next n great ideas in this area

Structure of the PlanetLab • >1000 viewpoints on the internet • 10-100 resource-rich sites at network crossroads • Typical use involves a slice across substantial subset of nodes • Dual-role by design • Research testbed • large set of geographically distributed machines • diverse & realistic network conditions • classic ‘controlled’ sigcomm, infocomm studies • Deployment platform • services: design  evaluation  client base • -> composite services • nodes: proxy path  physical path • make it useful to people

Initial Researchers (Mar 02) • Rice • Peter Druschel • Utah • Jay Lepreau • CMU • Srini Seshan • Hui Zhang • UCSD • Stefan Savage • Columbia • Andrew Campbell • ICIR • Scott Shenker • Mark Handley • Eddie Kohler • Intel Research • David Culler • Timothy Roscoe • Sylvia Ratnasamy • Gaetano Borriello • Satya • Milan Milenkovic • Duke • Amin Vadat • Jeff Chase • Princeton • Larry Peterson • Randy Wang • Vivek Pai Washington Tom Anderson Steven Gribble David Wetherall MIT Frans Kaashoek Hari Balakrishnan Robert Morris David Anderson Berkeley Ion Stoica Joe Helerstein Eric Brewer John Kubi see http://www.cs.berkeley.edu/~culler/planetlab

Initial Planet-Lab Candidate Sites Uppsala UBC Copenhagen UW Cambridge WI Chicago UPenn Amsterdam Harvard Utah Intel Seattle Karlsruhe Tokyo Intel MIT Intel OR Beijing Barcelona Intel Berkeley Cornell CMU ICIR Princeton UCB St. Louis Columbia Duke UCSB Washu KY UCLA GIT Rice UCSD UT ISI Melbourne Planned as of July 2002

Societal Scale Information Systems Jim Demmel, Chief Scientist EECS and Math Depts.