Industrial Systems

Industrial Systems Imranul Hoque and Sonia Jahid CS 525

ACMSThe Akamai Configuration Management System A. Sherman, P. Lisiecki, A. Berkheimer, J. Wein NSDI 2005

What is Akamai? • Akamai is a CDN • Founded by: Daniel Lewin, Tom Leighton, Jonathan Seelig, Preetish Nijhawan • Customers include: Yahoo, Google, Microsoft, Apple, Xerox, AOL, MTV … • Trivia: • Akamai is a Hawaiian word meaning intelligent • D. Lewin was aboard AA Flight 11 during 9/11 • Al-Jazeera was Akamai’s customer from March 28, 2003 to April 2, 2003

How Akamai Works? Image Source: Wikipedia

Challenges • 15,000 servers • 1200+ different networks • 60+ countries • Customers want to maintain close control on: • html cache timeouts • whether to allow cookies • whether to store session data • Config files must be propagated quickly

Challenges (1) • Single Server vs. Server Farm • A non-trivial fraction of servers may be down • Servers are widely dispersed • Config changes are generated from widely dispersed places • A server recovering from failure needs to be up to dated quickly

Requirements • High Fault Tolerance and Availability • Should have multiple entry points for accepting and storing configuration updates • Efficiency and Scalability • Must deliver updates within reasonable time • Persistent Fault Tolerant Storage • Must store updates and deliver them asynchronously to unavailable machines once they become available • Correctness • Should order updates correctly • Acceptance Guarantee • An accepted update submission should be propagated to the Akamai CDN

Architecture ACMS Storage Point ACMS Storage Point Publisher • Publish • Replication • Agreement • Accept & Upload • Propagation ACMS Storage Point Akamai CDN

Quorum-based Replication • An update should be replicated and agreed upon by a quorum of Storage Points (SP) • Quorum = majority • Each SP maintains connectivity by exchanging liveness message with peers • Network Operation Command Center (NOCC) observes this statistics • Red alert if majority fails to report pairwise connectivity to a quorum

Quorum-based Replication (2) • Acceptance Algorithm • Two phases: Replication & Agreement • Replication Phase • Accepting SP creates a temporary file with a unique filename (foo.A.1234) • Replicates this file to a quorum of SPs • If successful starts the agreement phase • Agreement Phase • Vector Exchange (VE) Protocol

Vector Exchange (Example) 1,0,0 1,1,0 1,1,1 A 1,1,1 B 1,0,0 1,1,0 C 1,0,0 1,0,1 1,1,1

Recovery via Index Merging • Acceptance algorithm guarantees that at least a quorum of SPs stores each update • SPs should sync up any missed update • Recovery protocol known as Index Merging • Configuration files are organized in a tree: Index Tree • Configuration files are split into groups • Group Index file lists the UIDs of the latest agreed upon updates for each file in group • Root Index file lists all Group Index files with the latest modification timestamps

Recovery via Index Merging (2) • At each round a Storage Point: • picks random set of (Q-1) SPs [Q = majority] • downloads and parses index files from those SPs • on detection of a more recent UID of a file, the SP updates its tree and downloads the file from one of its peers • To avoid frequent parsing: • SPs remember the timestamps of one another’s index file • Uses HTTP-IMS (if-modified-since)

Data Delivery • Receivers run on each of 15,000 nodes and check for configuration file updates • Receivers learn about latest configuration files in the same way SPs merge their index files • Receivers are interested in a subset of the index tree that describes their subscription • Receivers periodically query the snapshots on the SPs to learn of any updates • If the updates match any subscriptions, receivers download the files via HTTP-IMS requests • Optimized download due to Akamai caching

Operational Experience • Prototype version of Akamai • Consisted of a single primary Accepting Storage Point replicating submission to a few secondary SPs • Drawback: single point of failure • Quorum Assumption • 36 instances of SP disconnected from quorum for more than 10 minutes due to network outages during Jan-Sep of 2004 • In all instances there was an operating quorum of other SPs

Evaluation • 14,276 total file submissions with 5 SPs over a 48 hour period • Average file size: 121 KB

Evaluation (2) • Tail of Propagation Measurement: • Another random sample of 300 machines over a 4 day period • Looked at propagation of short files (< 20 KB) • 99.8% of time received within 2 minutes from becoming available • 99.96% of time received within 4 minutes Random sampling of 250 nodes

Evaluation (3) • Scalability • Front End scalability dominated by replication • For 5 SPs and average file size of 121 KB VE overhead is 0.4% of replication bandwidth • For 15 SPs VE overhead is 1.2% • For larger # of SPs consistent hashing can be used to split actual storage

Discussion • Fault Tolerant Replication • Distributed FS: Coda, Pangea, Bayou • All of these attempt to improve availability at the expense of consistency • ACMS must provide high level of consistency • Two phase acceptance algorithm used by ACMS is similar in nature to the Two Phase Commit • VE was inspired by concept of vector clocks and uses a quorum based approach similar to Paxos and BFS • Comparison with Software Update Systems • LCFG and Novadigm: Span a single or a few networks • Windows Update: Can delay update, centralized

Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R. Gruber OSDI 2006

What is Bigtable? • A distributed storage system for managing structured data • A sparse, distributed, persistent multidimensional sorted map. • Designed to scale to a large size of data • Implemented and used by Google • Web indexing • Google Earth • Orkut • Etc.

Is It a Database? • Doesn’t support full relational DB • Each value in Bigtable is an uninterpreted array of bytes • Doesn’t speak SQL • Wouldn’t pass ACID test

Data Model • Indexed by: (row: string, column: string, time: int64)

Data Model: Rows Row • Arbitrary strings, 64KB in size • Atomic r/w under a single row key • Data maintained lexicographically • Row range (Tablet) for a table dynamically partitioned • Tablet (100~200MB) is the unit of distribution and load balancing • Tablet allows efficient short row range reads

Data Model: Column Families Optional • Column keys (family:qualifier) grouped into column families • small # (in the hundreds) of distinct families per table, unbounded # of columns

Data Model: Timestamp • Each cell may contain multiple version of data indexed by timestamp • Assigned by Bigtable or by client applications • Client specifies: either keep n versions of a cell or most recent versions (e.g., last 7 days) • Supports “Garbage collection” mechanism

Building Blocks • Built on several other pieces of Google infrastructure • Distributed Google File System (GFS) • Cluster management system • Machine failures, monitoring machine status, resource management, job scheduling • SSTable file format • Provides persistent, ordered immutable map from keys to values • Internally contains sequence of blocks of typically 64KB • Block index used to lookup blocks

Building Blocks • Chubby lock service: • Provides directories and small files used as locks • Client maintains a session with a Chubby service • Bigtable uses Chubby for: • At most 1 active master at a time • Store starting location of Bigtable data • Find live Tablet servers • If UNAVAILABLE… BIGTABLE UNAVAILABLE

Implementation • Three major components: • A library linked to every client • One master server • Assigns tablets to tablet servers (TS) • Detects addition and expiration of TS • Balances TS load • Performs garbage collection in GFS • Handles schema changes

Implementation • Many Tablet Servers • Each manages set of tablets (ten to thousands) • Handles r/w requests to the tablets • Splits tablets grown too large

Tablet Location • 3 levels hierarchy • Client library caches tablet locations • If client cache empty, 3 network round-trips • If client cache stale, 6 network round-trips • Prefetch tablet location

Tablet Assignment • A TS locks a file on a specific Chubby directory upon start • Master periodically polls TS about its lock • If TS reports about lost lock, Master tries to acquire the lock exclusively. If it can, then Chubby is alive. • So, Master deletes TS’s file and moves tablets assigned into set of unassigned tablets.

Tablet Assignment • When a master is started by cluster management system • It grabs a master lock in Chubby • Finds live TS from Chubby • Discovers tablet assignment from TS • Scans METADATA table to know tablet sets • Adds appropriate tablets to unassigned set • Set of tablet changes when tablets are split/merged, tables created or deleted

Tablet Serving • Tablet persistent state stored in GFS • Recent updates stored in memory in a sorted buffer memtable • Write/ Read done after authorization

Compactions • Minor Compaction: • Shrinks TS memory usage • Merging Compaction: • Merges a few SSTables and memtable • Major Compaction: • Merging compaction that rewrites all SSTables into exactly one SSTable • Major compaction SSTable contains no deleted data

Refinements • Locality groups: Group multiple column families. E.g., group language and checksum for WebPage example. • Compression: Clients control whether compress SSTable (block level) for a locality group or not. • Caching: Scan cache and Block cache by TS • Bloom Filters: Allows to ask whether an SSTable might have data for a row/column pair

Performance Evaluation # of 1000 byte values read/written per second per TS • N # of TS, N varied • TS, master, test clients, GFS servers all ran on same set of machines • Row keys partitioned into 10N equal-sized range

Discussion • Provides an API in C++ • Achieved high availability and performance • Works well in practice as more than 60 Google applications use it • A good and flexible industrial storage system

MapReduceSimplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004

Motivation • Special purpose computations at Google process large amount of raw data: • Crawled Documents • Web Request Logs • The output is: • Inverted Indices • Representation of Graph Structure of Web Documents • # of pages crawled per host • Most frequent queries in a given day • Input data is large and the computation is distributed

Motivation (2) • Issues: • How to parallelize the computation • How to distribute the data • How to handle failures • All these obscure the original simple computation with large amount of complex code • Solution: MapReduce

MapReduce • Programming model and associated implementation for processing and generating large data sets • Contains Map and Reduce functions • Inspired by Map and Reduce primitives of Lisp and other functional languages • Map is applied to input to compute a set of intermediate key/value pairs • Reduce combines the derived data appropriately • Allows parallelization of large computations easily

Example • Text 1: it is what it is • Text 2: what is it • Text 3: it is a banana Map tasks are assigned to workers Example inspired from original presentation by authors at OSDI

Example (2): Map • Worker 1: • (it 1), (is 1), (what 1), (it 1), (is 1) • Worker 2: • (what 1), (is 1), (it 1) • Worker 3: • (it 1), (is 1), (a 1), (banana 1)

Example (3): Reduce Input • Worker 1: • (a 1) • Worker 2: • (banana 1) • Worker 3: • (is 1), (is 1), (is 1), (is 1) • Worker 4: • (it 1), (it 1), (it 1), (it 1) • Worker 5: • (what 1), (what 1)

Example (3): Reduce Output • Worker 1: • (a 1) • Worker 2: • (banana 1) • Worker 3: • (is 4) • Worker 4: • (it 4) • Worker 5: • (what 2)

Execution Overview

Execution Overview (2) • Master Data Structure • State and identity of the workers for each map and reduce task • Location and sizes of intermediate files produced by the map task • Fault Tolerance • Master pings workers periodically • Map and reduce tasks are reset depending on whether they were completed or not • Assumes that master failure is unlikely

Execution Overview (3) • Locality • MapReduce takes location information of input files into account and attempts to schedule a map task on a nearby machine • Task Granularity • M and R should be much larger than the number of workers for dynamic load balancing and failure recovery • R is often constrained by users as output of each reduce is a separate file • Backup Task • Some machine may take unusually long time (straggler) • Backup remaining in-progress tasks

Refinements • Partitioning Function • The key k is reduced by worker: hash(k) % R • User can specify: hash(hostname(k))%R • Combiner Function • Data can be merged before sent over network • Skipping Bad Records • Discard few records for which program crashes • Local Execution • Sequentially execute for debugging in local machine • Status Information • Show progress, error, output files

Industrial Systems