Distributed Systems Techniques & Case Studies I

Detour: Distributed Systems Techniques & Case Studies I • Distributing (Logically) Centralized SDN Controllers • NIB need to be maintained by multiple (distributed) SDN controllers • Multiple SDN controllers may need to concurrently read or write the same shared state • Distributed State Management Problem! • Look at three case studies from distributed systems • Google File System (GFS) • Amazon’s Dynamo • Yahoo!’s PNUTS CSci8211: Distributed System Techniques & Case Studies: I

Distributed Data Stores & Consistency Models Availability & Performance vs. Consistency Trade-offs • Traditional (Transactional) Database Systems: • Query Model: more expressive query language e.g., SQL • ACID Properties: Atomicity, Consistency, Isolation and Durability • Efficiency: very expensive to implement at large scale! • Many real Internet applications/systems do not require strong consistency, but require high availability • Google File Systems: many reads, few writes (mostly appends) • Amazon’s Dynamo: simple query model, small data objects, but need to “always-writable” at massive scale • Yahoo’s PNUTS: databases with relaxed consistency for web apps requiring more than “eventual consistency.” (e.g., ordered updates) Implicit/Explicit Assumptions: Applications often can tolerate or know best how to handle inconsistencies (if happen rarely), but care more about availability & performance CSci8211: Distributed System Techniques & Case Studies: I

Data Center and Cloud Computing • Data center: large server farms + data warehouses • not simply for web/web services • managed infrastructure: expensive! • From web hosting to cloud computing • individual web/content providers: must provision for peak load • Expensive, and typically resources are under-utilized • web hosting: third party provides and owns the (server farm) infrastructure, hosting web services for content providers • “server consolidation” via virtualization Under client web service control App Guest OS VMM

Cloud Computing • Cloud computing and cloud-based services: • beyond web-based “information access” or “information delivery” • computing, storage, … • Cloud Computing: NIST Definition "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction." • Models of Cloud Computing • “Infrastructure as a Service” (IaaS), e.g., Amazon EC2, Rackspace • “Platform as a Service” (PaaS), e.g., Micorsoft Azure • “Software as a Service” (SaaS), e.g., Google

With thousands of servers within a data center, How to write applications (services) for them? How to allocate resources, and manage them? in particular, how to ensure performance, reliability, availability, … Scale and complexity bring other key challenges with thousands of machines, failures are the default case! load-balancing, handling “heterogeneity,” … data center (server cluster) as a “computer” “super-computer” vs. “cluster computer” A single “super-high-performance” and highly reliable computer vs. a “computer” built out of thousands of “cheap & unreliable” PCs Pros and cons? Data Centers: Key Challenges

Google Scale and Philosophy • Lots of data • copies of the web, satellite data, user data, email and USENET, Subversion backing store • Workloads are large and easily parallelizable • No commercial system big enough • couldn’t afford it if there was one • might not have made appropriate design choices • But truckloads of low-cost machines • 450,000 machines (NYTimes estimate, June 14th 2006) • Failures are the norm • Even reliable systems fail at Google scale • Software must tolerate failures • Which machine an application is running on should not matter • Firm believers in the “end-to-end” argument • Care about perf/$, not absolute machine perf

Cluster Scheduling Master Lock Service GFS Master Machine 2 Machine 3 Machine 1 BigTableServer UserTask 1 BigTableServer BigTable Master UserTask User Task 2 SchedulerSlave GFSChunkserver SchedulerSlave GFSChunkserver SchedulerSlave GFSChunkserver Linux Linux Linux Typical Cluster at Google

Google: System Building Blocks • Google File System (GFS): • raw storage • (Cluster) Scheduler: • schedules jobs onto machines • Lock service: • distributed lock manager • also can reliably hold tiny files (100s of bytes) w/ high availability • Bigtable: • a multi-dimensional database • MapReduce: • simplified large-scale data processing • ....

Google File System Key Design Considerations • Component failures are the norm • hardware component failures, software bugs, human errors, power supply issues, … • Solutions: built-in mechanisms for monitoring, error detection, fault tolerance, automatic recovery • Files are huge by traditional standards • multi-GB files are common, billions of objects • most writes (modifications or “mutations”) are “append” • two types of reads: large # of “stream” (i.e., sequential) reads, with small # of “random” reads • High concurrency (multiple “producers/consumers” on a file) • atomicity with minimal synchronization • Sustained bandwidth more important than latency

GFS Architectural Design • A GFS cluster: • a single master + multiple chunkservers per master • running on commodity Linux machines • A file: a sequence of fixed-sized chunks (64 MBs) • labeled with 64-bit unique global IDs, • stored at chunkservers (as “native” Linux files, on local disk) • each chunk mirrored across (default 3) chunkservers • master server: maintains all metadata • name space, access control, file-to-chunk mappings, garbage collection, chunk migration • why only a single master? (with read-only shadow masters) • simple, and only answer chunk location queries to clients! • chunk servers (“slaves” or “workers”): • interact directly with clients, perform reads/writes, …

GFS Architecture: Illustration • GPS clients • consult master for metadata • typically ask for multiple chunk locations per request • access data from chunkservers Separation of control and data flows

Chunk Size and Metadata • Chunk size: 64 MBs • fewer chunk location requests to the master • client can perform many operations on a chuck • reduce overhead to access a chunk • can establish persistent TCP connection to a chunkserver • fewer metadata entries • metadata can be kept in memory (at master) • in-memory data structures allows fast periodic scanning • some potential problems with fragmentation • Metadata • file and chunk namespaces (files and chunk identifiers) • file-to-chunk mappings • locations of a chunk’s replicas

Chunk Locations and Logs • Chunk location: • does not keep a persistent record of chunk locations • polls chunkservers at startup, and use heartbeat messages to monitor chunkservers: simplicity! • because of chunkserver failures, it is hard to keep persistent record of chunk locations • on-demand approach vs. coordination • on-demand wins when changes (failures) are often • Operation logs • maintains historical record of critical metadata changes • Namespace and mapping • for reliability and consistency, replicate operation log on multiple remote machines (“shadow masters”)

Clients and APIs • GFS not transparent to clients • requires clients to perform certain “consistency” verification (using chunk id & version #), make snapshots (if needed), … • APIs: • open, delete, read, write (as expected) • append: at least once, possibly with gaps and/or inconsistencies among clients • snapshot: quickly create copy of file • Separation of data and control: • Issues control (metadata) requests to master server • Issues data requests directly to chunkservers • Caches metadata, but does no caching of data • no consistency difficulties among clients • streaming reads (read once) and append writes (write once) don’t benefit much from caching at client

System Interaction: Read • Client sends master: • read(file name, chunk index) • Master’s reply: • chunk ID, chunk version#, locations of replicas • Client sends “closest” chunkserver w/replica: • read(chunk ID, byte range) • “closest” determined by IP address on simple rack-based network topology • Chunkserver replies with data

System Interactions: Write and Record Append • Write and Record Append (atomic) • slightly different semantics: record append is “atomic” • The master grants a chunk lease to a chunkserver (primary), and replies back to client • Client first pushes data to all chunkservers • pushed linearly: each replica forwards as it receives • pipelined transfer: 13 MB/second with 100 Mbps network • Then issues a write/append to primary chunkserver • Primary chunkserver determines the order of updates to all replicas • in record append: primary chunkserver checks to see whether record append would exceed maximum chunk size • if yes, pad the chuck (and ask secondaries to do the same), and then ask client to append to the next chunk

Leases and Mutation Order • Lease: • 60 second timeouts; can be extended indefinitely • extension request are piggybacked on heartbeat messages • after a timeout expires, master can grant new leases • Use leases to maintain consistent mutation order across replicas • Master grant lease to one of the replicas -> Primary • Primary picks serial order for all mutations • Other replicas follow the primary order

Consistency Model • Changes to namespace (i.e., metadata) are atomic • done by single master server! • Master uses log to define global total order of namespace-changing operations • Relaxed consistency • concurrent changes are consistent but “undefined” • defined: after data mutation, file region that is consistent, and all clients see that entire mutation • an append is atomically committed at least once • occasional duplications • All changes to a chunk are applied in the same order to all replicas • Use version number to detect missed updates

Master Namespace Management & Logs • Namespace: files and their chunks • metadata maintained as “flat names”, no hard/symbolic links • full path name to metadata mapping • with prefix compression • Each node in the namespace has associated read-write lock (-> a total global order, no deadlock) • concurrent operations can be properly serialized by this locking mechanism • Metadata updates are logged • logs replicated on remote machines • take global snapshots (checkpoints) to truncate logs (but checkpoints can be created while updates arrive) • Recovery • Latest checkpoint + subsequent log files

Replica Placement • Goals: • Maximize data reliability and availability • Maximize network bandwidth • Need to spread chunk replicas across machines and racks • Higher priority to replica chunks with lower replication factors • Limited resources spent on replication

Other Operations • Locking operations • one lock per path, can modify a directory concurrently • to access /d1/d2/leaf, need to lock /d1, /d1/d2, and /d1/d2/leaf • each thread acquires: a read lock on a directory & a write lock on a file • totally ordered locking to prevent deadlocks • Garbage Collection: • simpler than eager deletion due to • unfinished replicated creation, lost deletion messages • deleted files are hidden for three days, then they are garbage collected • combined with other background (e.g., take snapshots) ops • safety net against accidents

Fault Tolerance and Diagnosis • Fast recovery • Master and chunkserver are designed to restore their states and start in seconds regardless of termination conditions • Chunk replication • Data integrity • A chunk is divided into 64-KB blocks • Each with its checksum • Verified at read and write times • Also background scans for rarely used data • Master replication • Shadow masters provide read-only access when the primary master is down

GFS: Summary • GFS is a distributed file system that support large-scale data processing workloads on commodity hardware • GFS has different points in the design space • Component failures as the norm • Optimize for huge files • Success: used actively by Google to support search service and other applications • But performance may not be good for all apps • assumes read-once, write-once workload (no client caching!) • GFS provides fault tolerance • Replicating data (via chunk replication), fast and automatic recovery • GFS has the simple, centralized master that does not become a bottleneck • Semantics not transparent to apps (“end-to-end” principle?) • Must verify file contents to avoid inconsistent regions, repeated appends (at-least-once semantics)

Highlights of Dynamo • Dynamo: key-value data store at massive scale • Used to maintain users’ shopping cart info • Key Design Goals: highly available and resilient at massive scale, while also meeting SLAs! • i.e., all customers have good experience, not simply most! • Target Workload & Usage Scenarios: • simple read/write operations to a (relatively small) data item uniquely identified by a key; e.g., usually less 1 MB • services must be able to configure Dynamo to consistently achieve their latency and throughput requirements • used by internal services: Non-hostile environments • System Interface: • get(key), put(key, context, object)

Amazon Service-Oriented Arch

Dynamo: Techniques Employed

Dynamo: Key Partitioning & Replications& Sloppy Quorum for Read/Write • # of key replicas >= N • (here N =3) • Each key is associated with a preference list of N ranked (virtual) nodes • Sloppy Quorum: • R +W >N • each read is handled by • a (read) coordinator • -- any node in the ring is fine • Each write is also handled by a (write) coordinator • -- highest ranked available node in the preference list • read via get(): read from all N replicas; success if receiving R responses • write via put(): write to al N replicas; success if receiving W-1 “write OK” acks

Dynamo: Vector Clock version evolution of an object over time

Highlights of PNUTS • PNUTS: massively parallel and geographically distributed database system for Yahoo!’s web apps • data storage organized as hashed or ordered tables • hosted, centrally managed, geographically distributed service with automated load-balancing & fail-over • Target Workload • managing session states, content meta-data, user-generated content such as tags & comments, etc. for web applications • Key Design Goals: • scalability • response time and geographic scope • high availability and fault tolerant • relaxed consistency guarantees • more than eventual consistency supported by GFS & Dynamo

PNUTS Overview • Data model and Features • expose a simple relational model to users, & support single-table scans with predicates • include: scatter-gather ops, async. notification, bulk loading • Fault Tolerance • Redundancy at multiple levels: data, meta-data, serving components, etc. • Leverage consistency model to support highly available reads & writes even after failure or partition • Pub-Sub Message System: topic-based YMB (msg. broker) • Record-level Mastering: write sync’ly to all copies expensively! • make all high latency ops asynchronous: allow local writes, and use record-level mastering to serve all requests locally • Hosting: hosted service shared by many applications

PNUTS Data & Query Model • A simplified relational data model • data organized into tables of records with attributes • in addition to typical data types, allow “blob” data type • schemas are flexible: • allow new attribute addition at any time without halting query or update activities; • records not required to have values for all attributes • each record has a primary key: delete(key)/update(key) • Query language: PNUTS supports • selection and project from a single table • both hashed (for point access) and ordered table (for scan) • get(key), multi-get(list-of-keys), scan(range[, predicate]) • no support for “complex” queries, e.g., “join” or “group-by” • in the near future, provide interface to Hadoop, Pig Latin, …

PNUTS Consistency Model • Applications typically manipulate one record at a time • PNUTS provides per-record timeline consistency • all replicas of a given record apply all updates to the record in the same order (one replica designated as “master”) • A range of APIs with varying levels of consistency guarantees v.generation.version • write • test-and-set-write(required-version) • Future: i) bundled updates • ii) “more” relaxed consistency to cope w/ major (regional data center) failures • read-only • read-critical (required-version) • read-latest

PNUTS System Architecture Interval Mappings • Tables are partitioned into tablets, each tablet stored on one server per region • each tablet: ~ 100s MBs to a few GBs • Planned scale: 1000 servers per region, 1000 tablets each • key: 100 bytes  interval mapping table: 100s MB RAM • tablets ~500 MBs  a database of ~500 TBs

Interval Mappings Ordered Table Hashed Table

PNUTS: Other Features • Yahoo! Message Broker (YMB) • topic-based pub/sub system • together w/ PNUTS: Yahoo! Sherpa data service platform • YMB and Wide-Area Data Replication • Data updates considered “committed” once they are published by YMB • YMB asynchronously propagates the update to different regions and applies to all replicas • YMB provides “logging” and guarantees all published messages will be delivered to all subscribers • YMB logs purged only after PNUTS verifies • Consistency via YMB and Mastership • YMB provides partial ordering of published messages • per-record mastering: updates directed to master 1st, then propagates to other replicas via publishing to YMB • Recovery: can survive storage unit failures; tablet boundaries sync’ed across tablet replicas; recover a lost tablet by copying a remote replica

Distributed Systems Techniques & Case Studies I

Distributed Systems Techniques & Case Studies I

Presentation Transcript

Distributed File Systems

Distributed Object-Based Systems

Distributed Systems: Principles and Paradigms

Systems Techniques and Documentation

Top 7 Oracle Database Tuning Techniques and Case Studies

Distributed Object-Based Systems

Case-control association techniques in genetic studies

Distributed Systems CS 15-440

Chapter 18 – Distributed software engineering

Distributed Systems Lecture 1: Overview

Distributed Systems CS 15-440

Distributed Systems Lecture 1: Overview

CHARACTERIZATION OF DISTRIBUTED SYSTEMS

Distributed file systems, Case studies

Distributed (Operating) Systems -Communication in Distributed Systems-

Distributed Object-Based Systems

Breakpoints and Halting in Distributed Systems

Distributed Object Models

Distributed Systems

Chapter 18 – Distributed software engineering

1. Introduction II

COMP 734 -- Distributed File Systems