1 / 55

Industrial Systems

Industrial Systems. Imranul Hoque and Sonia Jahid CS 525. ACMS The Akamai Configuration Management System. A. Sherman, P. Lisiecki, A. Berkheimer, J. Wein NSDI 2005. What is Akamai?. Akamai is a CDN Founded by: Daniel Lewin, Tom Leighton, Jonathan Seelig, Preetish Nijhawan

wynona
Download Presentation

Industrial Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Industrial Systems Imranul Hoque and Sonia Jahid CS 525

  2. ACMSThe Akamai Configuration Management System A. Sherman, P. Lisiecki, A. Berkheimer, J. Wein NSDI 2005

  3. What is Akamai? • Akamai is a CDN • Founded by: Daniel Lewin, Tom Leighton, Jonathan Seelig, Preetish Nijhawan • Customers include: Yahoo, Google, Microsoft, Apple, Xerox, AOL, MTV … • Trivia: • Akamai is a Hawaiian word meaning intelligent • D. Lewin was aboard AA Flight 11 during 9/11 • Al-Jazeera was Akamai’s customer from March 28, 2003 to April 2, 2003

  4. How Akamai Works? Image Source: Wikipedia

  5. Challenges • 15,000 servers • 1200+ different networks • 60+ countries • Customers want to maintain close control on: • html cache timeouts • whether to allow cookies • whether to store session data • Config files must be propagated quickly

  6. Challenges (1) • Single Server vs. Server Farm • A non-trivial fraction of servers may be down • Servers are widely dispersed • Config changes are generated from widely dispersed places • A server recovering from failure needs to be up to dated quickly

  7. Requirements • High Fault Tolerance and Availability • Should have multiple entry points for accepting and storing configuration updates • Efficiency and Scalability • Must deliver updates within reasonable time • Persistent Fault Tolerant Storage • Must store updates and deliver them asynchronously to unavailable machines once they become available • Correctness • Should order updates correctly • Acceptance Guarantee • An accepted update submission should be propagated to the Akamai CDN

  8. Architecture ACMS Storage Point ACMS Storage Point Publisher • Publish • Replication • Agreement • Accept & Upload • Propagation ACMS Storage Point Akamai CDN

  9. Quorum-based Replication • An update should be replicated and agreed upon by a quorum of Storage Points (SP) • Quorum = majority • Each SP maintains connectivity by exchanging liveness message with peers • Network Operation Command Center (NOCC) observes this statistics • Red alert if majority fails to report pairwise connectivity to a quorum

  10. Quorum-based Replication (2) • Acceptance Algorithm • Two phases: Replication & Agreement • Replication Phase • Accepting SP creates a temporary file with a unique filename (foo.A.1234) • Replicates this file to a quorum of SPs • If successful starts the agreement phase • Agreement Phase • Vector Exchange (VE) Protocol

  11. Vector Exchange (Example) 1,0,0 1,1,0 1,1,1 A 1,1,1 B 1,0,0 1,1,0 C 1,0,0 1,0,1 1,1,1

  12. Recovery via Index Merging • Acceptance algorithm guarantees that at least a quorum of SPs stores each update • SPs should sync up any missed update • Recovery protocol known as Index Merging • Configuration files are organized in a tree: Index Tree • Configuration files are split into groups • Group Index file lists the UIDs of the latest agreed upon updates for each file in group • Root Index file lists all Group Index files with the latest modification timestamps

  13. Recovery via Index Merging (2) • At each round a Storage Point: • picks random set of (Q-1) SPs [Q = majority] • downloads and parses index files from those SPs • on detection of a more recent UID of a file, the SP updates its tree and downloads the file from one of its peers • To avoid frequent parsing: • SPs remember the timestamps of one another’s index file • Uses HTTP-IMS (if-modified-since)

  14. Data Delivery • Receivers run on each of 15,000 nodes and check for configuration file updates • Receivers learn about latest configuration files in the same way SPs merge their index files • Receivers are interested in a subset of the index tree that describes their subscription • Receivers periodically query the snapshots on the SPs to learn of any updates • If the updates match any subscriptions, receivers download the files via HTTP-IMS requests • Optimized download due to Akamai caching

  15. Operational Experience • Prototype version of Akamai • Consisted of a single primary Accepting Storage Point replicating submission to a few secondary SPs • Drawback: single point of failure • Quorum Assumption • 36 instances of SP disconnected from quorum for more than 10 minutes due to network outages during Jan-Sep of 2004 • In all instances there was an operating quorum of other SPs

  16. Evaluation • 14,276 total file submissions with 5 SPs over a 48 hour period • Average file size: 121 KB

  17. Evaluation (2) • Tail of Propagation Measurement: • Another random sample of 300 machines over a 4 day period • Looked at propagation of short files (< 20 KB) • 99.8% of time received within 2 minutes from becoming available • 99.96% of time received within 4 minutes Random sampling of 250 nodes

  18. Evaluation (3) • Scalability • Front End scalability dominated by replication • For 5 SPs and average file size of 121 KB VE overhead is 0.4% of replication bandwidth • For 15 SPs VE overhead is 1.2% • For larger # of SPs consistent hashing can be used to split actual storage

  19. Discussion • Fault Tolerant Replication • Distributed FS: Coda, Pangea, Bayou • All of these attempt to improve availability at the expense of consistency • ACMS must provide high level of consistency • Two phase acceptance algorithm used by ACMS is similar in nature to the Two Phase Commit • VE was inspired by concept of vector clocks and uses a quorum based approach similar to Paxos and BFS • Comparison with Software Update Systems • LCFG and Novadigm: Span a single or a few networks • Windows Update: Can delay update, centralized

  20. Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R. Gruber OSDI 2006

  21. What is Bigtable? • A distributed storage system for managing structured data • A sparse, distributed, persistent multidimensional sorted map. • Designed to scale to a large size of data • Implemented and used by Google • Web indexing • Google Earth • Orkut • Etc.

  22. Is It a Database? • Doesn’t support full relational DB • Each value in Bigtable is an uninterpreted array of bytes • Doesn’t speak SQL • Wouldn’t pass ACID test

  23. Data Model • Indexed by: (row: string, column: string, time: int64)

  24. Data Model: Rows Row • Arbitrary strings, 64KB in size • Atomic r/w under a single row key • Data maintained lexicographically • Row range (Tablet) for a table dynamically partitioned • Tablet (100~200MB) is the unit of distribution and load balancing • Tablet allows efficient short row range reads

  25. Data Model: Column Families Optional • Column keys (family:qualifier) grouped into column families • small # (in the hundreds) of distinct families per table, unbounded # of columns

  26. Data Model: Timestamp • Each cell may contain multiple version of data indexed by timestamp • Assigned by Bigtable or by client applications • Client specifies: either keep n versions of a cell or most recent versions (e.g., last 7 days) • Supports “Garbage collection” mechanism

  27. Building Blocks • Built on several other pieces of Google infrastructure • Distributed Google File System (GFS) • Cluster management system • Machine failures, monitoring machine status, resource management, job scheduling • SSTable file format • Provides persistent, ordered immutable map from keys to values • Internally contains sequence of blocks of typically 64KB • Block index used to lookup blocks

  28. Building Blocks • Chubby lock service: • Provides directories and small files used as locks • Client maintains a session with a Chubby service • Bigtable uses Chubby for: • At most 1 active master at a time • Store starting location of Bigtable data • Find live Tablet servers • If UNAVAILABLE… BIGTABLE UNAVAILABLE

  29. Implementation • Three major components: • A library linked to every client • One master server • Assigns tablets to tablet servers (TS) • Detects addition and expiration of TS • Balances TS load • Performs garbage collection in GFS • Handles schema changes

  30. Implementation • Many Tablet Servers • Each manages set of tablets (ten to thousands) • Handles r/w requests to the tablets • Splits tablets grown too large

  31. Tablet Location • 3 levels hierarchy • Client library caches tablet locations • If client cache empty, 3 network round-trips • If client cache stale, 6 network round-trips • Prefetch tablet location

  32. Tablet Assignment • A TS locks a file on a specific Chubby directory upon start • Master periodically polls TS about its lock • If TS reports about lost lock, Master tries to acquire the lock exclusively. If it can, then Chubby is alive. • So, Master deletes TS’s file and moves tablets assigned into set of unassigned tablets.

  33. Tablet Assignment • When a master is started by cluster management system • It grabs a master lock in Chubby • Finds live TS from Chubby • Discovers tablet assignment from TS • Scans METADATA table to know tablet sets • Adds appropriate tablets to unassigned set • Set of tablet changes when tablets are split/merged, tables created or deleted

  34. Tablet Serving • Tablet persistent state stored in GFS • Recent updates stored in memory in a sorted buffer memtable • Write/ Read done after authorization

  35. Compactions • Minor Compaction: • Shrinks TS memory usage • Merging Compaction: • Merges a few SSTables and memtable • Major Compaction: • Merging compaction that rewrites all SSTables into exactly one SSTable • Major compaction SSTable contains no deleted data

  36. Refinements • Locality groups: Group multiple column families. E.g., group language and checksum for WebPage example. • Compression: Clients control whether compress SSTable (block level) for a locality group or not. • Caching: Scan cache and Block cache by TS • Bloom Filters: Allows to ask whether an SSTable might have data for a row/column pair

  37. Performance Evaluation # of 1000 byte values read/written per second per TS • N # of TS, N varied • TS, master, test clients, GFS servers all ran on same set of machines • Row keys partitioned into 10N equal-sized range

  38. Discussion • Provides an API in C++ • Achieved high availability and performance • Works well in practice as more than 60 Google applications use it • A good and flexible industrial storage system

  39. MapReduceSimplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004

  40. Motivation • Special purpose computations at Google process large amount of raw data: • Crawled Documents • Web Request Logs • The output is: • Inverted Indices • Representation of Graph Structure of Web Documents • # of pages crawled per host • Most frequent queries in a given day • Input data is large and the computation is distributed

  41. Motivation (2) • Issues: • How to parallelize the computation • How to distribute the data • How to handle failures • All these obscure the original simple computation with large amount of complex code • Solution: MapReduce

  42. MapReduce • Programming model and associated implementation for processing and generating large data sets • Contains Map and Reduce functions • Inspired by Map and Reduce primitives of Lisp and other functional languages • Map is applied to input to compute a set of intermediate key/value pairs • Reduce combines the derived data appropriately • Allows parallelization of large computations easily

  43. Example • Text 1: it is what it is • Text 2: what is it • Text 3: it is a banana Map tasks are assigned to workers Example inspired from original presentation by authors at OSDI

  44. Example (2): Map • Worker 1: • (it 1), (is 1), (what 1), (it 1), (is 1) • Worker 2: • (what 1), (is 1), (it 1) • Worker 3: • (it 1), (is 1), (a 1), (banana 1)

  45. Example (3): Reduce Input • Worker 1: • (a 1) • Worker 2: • (banana 1) • Worker 3: • (is 1), (is 1), (is 1), (is 1) • Worker 4: • (it 1), (it 1), (it 1), (it 1) • Worker 5: • (what 1), (what 1)

  46. Example (3): Reduce Output • Worker 1: • (a 1) • Worker 2: • (banana 1) • Worker 3: • (is 4) • Worker 4: • (it 4) • Worker 5: • (what 2)

  47. Execution Overview

  48. Execution Overview (2) • Master Data Structure • State and identity of the workers for each map and reduce task • Location and sizes of intermediate files produced by the map task • Fault Tolerance • Master pings workers periodically • Map and reduce tasks are reset depending on whether they were completed or not • Assumes that master failure is unlikely

  49. Execution Overview (3) • Locality • MapReduce takes location information of input files into account and attempts to schedule a map task on a nearby machine • Task Granularity • M and R should be much larger than the number of workers for dynamic load balancing and failure recovery • R is often constrained by users as output of each reduce is a separate file • Backup Task • Some machine may take unusually long time (straggler) • Backup remaining in-progress tasks

  50. Refinements • Partitioning Function • The key k is reduced by worker: hash(k) % R • User can specify: hash(hostname(k))%R • Combiner Function • Data can be merged before sent over network • Skipping Bad Records • Discard few records for which program crashes • Local Execution • Sequentially execute for debugging in local machine • Status Information • Show progress, error, output files

More Related