Distributed Storage: Key-Value Stores and Security

CS162Operating Systems andSystems ProgrammingLecture 23Distributed Storage,Key-Value Stores,Security April 27th, 2015 Prof. John Kubiatowicz http://cs162.eecs.Berkeley.edu

Recall: Two Phase (2PC) Commit • Distributed transaction: Two or more machines agree to do something, or not do it, atomically • Two Phase Commit: • One coordinator • N workers (replicas) • High level algorithm description • Coordinator asks all workers if they can commit • If all workers reply “VOTE-COMMIT”, then coordinator broadcasts “GLOBAL-COMMIT”, Otherwise coordinator broadcasts “GLOBAL-ABORT” • Workers obey the GLOBALmessages • Use a persistent, stable log on each machine to keep track of what you are doing • If a machine crashes, when it wakes up it first checks its log to recover state of world at time of crash

Brief aside: Remote Procedure Call • Raw messaging is a bit too low-level for programming • Must wrap up information into message at source • Must decide what to do with message at destination • May need to sit and wait for multiple messages to arrive • Better option: Remote Procedure Call (RPC) • Calls a procedure on a remote machine • Client calls: remoteFileSystemRead(“rutabaga”); • Translated automatically into call on server:fileSysRead(“rutabaga”); • Implementation: • Request-response message passing (under covers!) • “Stub” provides glue on client/server • Client stub is responsible for “marshalling” arguments and “unmarshalling” the return values • Server-side stub is responsible for “unmarshalling” arguments and “marshalling” the return values. • Marshalling involves (depending on system) • Converting values to a canonical form, serializing objects, copying arguments passed by reference, etc.

bundle args call send Client Stub return receive Network Network return send Server Stub call receive unbundle args RPC Information Flow Client (caller) Packet Handler unbundle ret vals mbox2 Machine A Machine B bundle ret vals mbox1 Server (callee) Packet Handler

RPC Details • Equivalence with regular procedure call • Parameters Request Message • Result  Reply message • Name of Procedure: Passed in request message • Return Address: mbox2 (client return mail box) • Stub generator: Compiler that generates stubs • Input: interface definitions in an “interface definition language (IDL)” • Contains, among other things, types of arguments/return • Output: stub code in the appropriate source language • Code for client to pack message, send it off, wait for result, unpack result and return to caller • Code for server to unpack message, call procedure, pack results, send them off • Cross-platform issues: • What if client/server machines are different architectures or in different languages? • Convert everything to/from some canonical form • Tag every item with an indication of how it is encoded (avoids unnecessary conversions).

RPC Details (continued) • How does client know which mbox to send to? • Need to translate name of remote service into network endpoint (Remote machine, port, possibly other info) • Binding: the process of converting a user-visible name into a network endpoint • This is another word for “naming” at network level • Static: fixed at compile time • Dynamic: performed at runtime • Dynamic Binding • Most RPC systems use dynamic binding via name service • Name service provides dynamic translation of servicembox • Why dynamic binding? • Access control: check who is permitted to access service • Fail-over: If server fails, use a different one • What if there are multiple servers? • Could give flexibility at binding time • Choose unloaded server for each new client • Could provide same mbox (router level redirect) • Choose unloaded server for each new request • Only works if no state carried from one call to next • What if multiple clients? • Pass pointer to client-specific return mbox in request

Problems with RPC • Non-Atomic failures • Different failure modes in distributed system than on a single machine • Consider many different types of failures • User-level bug causes address space to crash • Machine failure, kernel bug causes all processes on same machine to fail • Some machine is compromised by malicious party • Before RPC: whole system would crash/die • After RPC: One machine crashes/compromised while others keep working • Can easily result in inconsistent view of the world • Did my cached data get written back or not? • Did server do what I requested or not? • Answer? Distributed transactions/Byzantine Commit • Performance • Cost of Procedure call « same-machine RPC « network RPC • Means programmers must be aware that RPC is not free • Caching can help, but may make failure handling complex

Cross-Domain Communication/Location Transparency • How do address spaces communicate with one another? • Shared Memory with Semaphores, monitors, etc… • File System • Pipes (1-way communication) • “Remote” procedure call (2-way communication) • RPC’s can be used to communicate between address spaces on different machines or the same machine • Services can be run wherever it’s most appropriate • Access to local and remote services looks the same • Examples of modern RPC systems: • CORBA (Common Object Request Broker Architecture) • DCOM (Distributed COM) • RMI (Java Remote Method Invocation)

App App App App File sys windows file system address spaces Windowing RPC VM Networking threads Threads Microkernel Structure Monolithic Structure Microkernel operating systems • Example: split kernel into application-level servers. • File system looks remote, even though on same machine • Why split the OS into separate domains? • Fault isolation: bugs are more isolated (build a firewall) • Enforces modularity: allows incremental upgrades of pieces of software (client or server) • Location transparent: service can be local or remote • For example in the X windowing system: Each X client can be on a separate machine from X server; Neither has to run on the machine with the frame buffer.

Network-Attached Storage and the CAP Theorem Network • Consistency: • Changes appear to everyone in the same serial order • Availability: • Can get a result at any time • Partition-Tolerance • System continues to work even when network becomes partitioned • Consistency, Availability, Partition-Tolerance (CAP) Theorem: Cannot have all three at same time • Otherwise known as “Brewer’s Theorem”

Read File Network Data Client Server mount kubi:/jane mount coeus:/sue mount kubi:/prog Distributed File Systems • Distributed File System: • Transparent access to files stored on a remote disk • Naming choices (always an issue): • Hostname:localname: Name files explicitly • No location or migration transparency • Mounting of remote file systems • System manager mounts remote file systemby giving name and local mount point • Transparent to user: all reads and writes look like local reads and writes to usere.g. /users/sue/foo/sue/foo on server • A single, global name space: every file in the world has unique name • Location Transparency: servers can change and files can move without involving user

Read (RPC) Return (Data) Write (RPC) Server Client Client ACK Simple Distributed File System • Remote Disk: Reads and writes forwarded to server • Use Remote Procedure Calls (RPC) to translate file system calls into remote requests • No local caching/can be caching at server-side • Advantage: Server provides completely consistent view of file system to multiple clients • Problems? Performance! • Going over network is slower than going to local memory • Lots of network traffic/not well pipelined • Server can be a bottleneck cache

Read (RPC) Return (Data) Write (RPC) Server Client Client ACK Use of caching to reduce network load Crash! Crash! V1 read(f1) • Idea: Use caching to reduce network load • In practice: use buffer cache at source and destination • Advantage: if open/read/write/close can be done locally, don’t need to do any network traffic…fast! • Problems: • Failure: • Client caches have data not committed at server • Cache consistency! • Client caches not consistent with server/each other cache read(f1)V1 F1:V1 read(f1)V1 read(f1)V1 cache F1:V1 F1:V2 cache OK write(f1) F1:V2 read(f1)V2

Crash! Failures • What if server crashes? Can client wait until server comes back up and continue as before? • Any data in server memory but not on disk can be lost • Shared state across RPC: What if server crashes after seek? Then, when client does “read”, it will fail • Message retries: suppose server crashes after it does UNIX “rm foo”, but before acknowledgment? • Message system will retry: send it again • How does it know not to delete it again? (could solve with two-phase commit protocol, but NFS takes a more ad hoc approach) • Stateless protocol: A protocol in which all information required to process a request is passed with request • Server keeps no state about client, except as hints to help improve performance (e.g. a cache) • Thus, if server crashes and restarted, requests can continue where left off (in many cases) • What if client crashes? • Might lose modified data in client cache

Administrivia • Midterm 2 grading • In progress. Hopefully done by end of week. • Solutions have been posted • Final Exam • Friday, May 15th, 2015. • 3-6P, Wheeler Auditorium • All material from the course • With slightly more focus on second half, but you are still responsible for all the material • Two sheets of notes, both sides • Will need dumb calculator • Should be working on Project 3! • Checkpoint 1 this Wednesday

Administrivia (con’t) • Final Lecture topics submitted to me: • Real Time Operating systems • Peer to peer systems and/or Distributed Systems • OS trends in the mobile phone industry (Android, etc) • Differences from traditional OSes? • GPU and ManyCore programming (and/or OSes?) • Virtual Machines and/or Trusted Hardware for security • Systems programming for non-standard computer systems • i.e. Quantum Computers, Biological Computers, … • Net Neutrality and/or making the Internet Faster • Mesh networks • Device drivers • A couple of votes for Dragons… • This is a lot of topics…

Network File System (NFS) • Three Layers for NFS system • UNIX file-system interface: open, read, write, close calls + file descriptors • VFS layer: distinguishes local from remote files • Calls the NFS protocol procedures for remote requests • NFS service layer: bottom layer of the architecture • Implements the NFS protocol • NFS Protocol: RPC for file operations on server • Reading/searching a directory • manipulating links and directories • accessing file attributes/reading and writing files • Write-through caching: Modified data committed to server’s disk before results are returned to the client • lose some of the advantages of caching • time to perform write() can be long • Need some mechanism for readers to eventually notice changes! (more on this later)

NFS Continued • NFS servers are stateless; each request provides all arguments require for execution • E.g. reads include information for entire operation, such as ReadAt(inumber,position), not Read(openfile) • No need to perform network open() or close() on file – each operation stands on its own • Idempotent: Performing requests multiple times has same effect as performing it exactly once • Example: Server crashes between disk I/O and message send, client resend read, server does operation again • Example: Read and write file blocks: just re-read or re-write file block – no side effects • Example: What about “remove”? NFS does operation twice and second time returns an advisory error • Failure Model: Transparent to client system • Is this a good idea? What if you are in the middle of reading a file and server crashes? • Options (NFS Provides both): • Hang until server comes back up (next week?) • Return an error. (Of course, most applications don’t know they are talking over network)

F1 still ok? No: (F1:V2) Write (RPC) Server Client Client cache ACK F1:V1 cache F1:V2 cache F1:V2 NFS Cache consistency • NFS protocol: weak consistency • Client polls server periodically to check for changes • Polls server if data hasn’t been checked in last 3-30 seconds (exact timeout it tunable parameter). • Thus, when file is changed on one client, server is notified, but other clients use old version of file until timeout. • What if multiple clients write to same file? • In NFS, can get either version (or parts of both) • Completely arbitrary! F1:V2

Read: parts of B or C Read: gets A Write B Client 1: Client 2: Read: gets A or B Write C Client 3: Read: parts of B or C Time Sequential Ordering Constraints • What sort of cache coherence might we expect? • i.e. what if one CPU changes file, and before it’s done, another CPU reads file? • Example: Start with file contents = “A” • What would we actually want? • Assume we want distributed system to behave exactly the same as if all processes are running on single system • If read finishes before write starts, get old copy • If read starts after write finishes, get new copy • Otherwise, get either new or old copy • For NFS: • If read starts more than 30 seconds after write, get new copy; otherwise, could get partial update

Andrew File System • Andrew File System (AFS, late 80’s)  DCE DFS (commercial product) • Callbacks: Server records who has copy of file • On changes, server immediately tells all with old copy • No polling bandwidth (continuous checking) needed • Write through on close • Changes not propagated to server until close() • Session semantics: updates visible to other clients only after the file is closed • As a result, do not get partial writes: all or nothing! • Although, for processes on local machine, updates visible immediately to other programs who have file open • In AFS, everyone who has file open sees old version • Don’t get newer versions until reopen file

Andrew File System (con’t) • Data cached on local disk of client as well as memory • On open with a cache miss (file not on local disk): • Get file from server, set up callback with server • On write followed by close: • Send copy to server; tells all clients with copies to fetch new version from server on next open (using callbacks) • What if server crashes? Lose all callback state! • Reconstruct callback information from client: go ask everyone “who has which files cached?” • AFS Pro: Relative to NFS, less server load: • Disk as cache  more files can be cached locally • Callbacks  server not involved if file is read-only • For both AFS and NFS: central server is bottleneck! • Performance: all writesserver, cache missesserver • Availability: Server is single point of failure • Cost: server machine’s high cost relative to workstation

Implementation of NFS

Enabling Factor: Virtual Filesystem (VFS) • VFS: Virtual abstraction similar to local file system • Provides virtual superblocks, inodes, files, etc • Compatible with a variety of local and remote file systems • provides object-oriented way of implementing file systems • VFS allows the same system call interface (the API) to be used for different types of file systems • The API is to the VFS interface, rather than any specific type of file system • In linux, “VFS” stands for “Virtual Filesystem Switch”

VFS Common File Model in Linux • Four primary object types for VFS: • superblock object: represents a specific mounted filesystem • inode object: represents a specific file • dentry object: represents a directory entry • file object: represents open file associated with process • There is no specific directory object (VFS treats directories as files) • May need to fit the model by faking it • Example: make it look like directories are files • Example: make it look like have inodes, superblocks, etc.

Linux VFS • An operations object is contained within each primary object type to set operations of specific filesystems • “super_operations”: methods that kernel can invoke on a specific filesystem, i.e. write_inode() and sync_fs(). • “inode_operations”: methods that kernel can invoke on a specific file, such as create() and link() • “dentry_operations”: methods that kernel can invoke on a specific directory entry, such as d_compare() or d_delete() • “file_operations”: methods that process can invoke on an open file, such as read() and write() • There are a lot of operations write() sys_write() filesystem’s write method physical media user-space VFS filesystem

Key Value Storage • Handle huge volumes of data, e.g., PBs • Store (key, value) tuples • Simple interface • put(key, value); // insert/write “value” associated with “key” • value = get(key); // get/read data associated with “key” • Used sometimes as a simpler but more scalable “database”

Key Values: Examples • Amazon: • Key: customerID • Value: customer profile (e.g., buying history, credit card, ..) • Facebook, Twitter: • Key: UserID • Value: user profile (e.g., posting history, photos, friends, …) • iCloud/iTunes: • Key: Movie/song name • Value: Movie, Song

Examples • Amazon • DynamoDB: internal key value store used to power Amazon.com (shopping cart) • Simple Storage System (S3) • BigTable/HBase/Hypertable: distributed, scalable data storage • Cassandra: “distributed data management system” (developed by Facebook) • Memcached: in-memory key-value store for small chunks of arbitrary data (strings, objects) • eDonkey/eMule: peer-to-peer sharing system • …

Key Value Store • Also called Distributed Hash Tables (DHT) • Main idea: partition set of key-values across many machines key, value …

Challenges • Fault Tolerance: handle machine failures without losing data and without degradation in performance • Scalability: • Need to scale to thousands of machines • Need to allow easy addition of new machines • Consistency: maintain data consistency in face of node failures and message losses • Heterogeneity (if deployed as peer-to-peer systems): • Latency: 1ms to 1000ms • Bandwidth: 32Kb/s to 100Mb/s …

Key Questions • put(key, value): where do you store a new (key, value) tuple? • get(key): where is the value associated with a given “key” stored? • And, do the above while providing • Fault Tolerance • Scalability • Consistency

Directory-Based Architecture • Have a node maintain the mapping between keys and the machines (nodes) that store the values associated with the keys Master/Directory put(K14, V14) K5 N2 K14 N3 K105 N50 put(K14, V14) K5 V5 K14 V14 K105 V105 … N2 N3 N50 N1

Directory-Based Architecture • Have a node maintain the mapping between keys and the machines (nodes) that store the values associated with the keys Master/Directory get(K14) V14 K5 N2 K14 N3 K105 N50 get(K14) V14 K5 V5 K14 V14 K105 V105 … N2 N3 N50 N1

Directory-Based Architecture • Having the master relay the requests  recursive query • Another method: iterative query (this slide) • Return node to requester and let requester contact node Master/Directory put(K14, V14) N3 K5 N2 K14 N3 K105 N50 put(K14, V14) K5 V5 K14 V14 K105 V105 … N2 N3 N50 N1

Directory-Based Architecture • Having the master relay the requests  recursive query • Another method: iterative query • Return node to requester and let requester contact node Master/Directory get(K14) N3 K5 N2 K14 N3 V14 K105 N50 get(K14) K5 V5 K14 V14 K105 V105 … N2 N3 N50 N1

Discussion: Iterative vs. Recursive Query • Recursive Query: • Advantages: • Faster, as typically master/directory closer to nodes • Easier to maintain consistency, as master/directory can serialize puts()/gets() • Disadvantages: scalability bottleneck, as all “Values” go through master/directory • Iterative Query • Advantages: more scalable • Disadvantages: slower, harder to enforce data consistency Recursive Iterative Master/Directory Master/Directory get(K14) get(K14) V14 N3 N3 K14 N3 K14 V14 get(K14) get(K14) V14 V14 V14 K14 K14 … … N2 N3 N50 N2 N3 N50 N1 N1

Fault Tolerance • Replicate value on several nodes • Usually, place replicas on different racks in a datacenterto guard against rack failures Master/Directory put(K14, V14) N1, N3 K5 N2 K14 N1,N3 K105 N50 put(K14, V14), N1 put(K14, V14) K5 V5 K14 K14 V14 V14 K105 V105 … N2 N3 N50 N1

Fault Tolerance • Again, we can have • Recursive replication (previous slide) • Iterative replication (this slide) Master/Directory put(K14, V14) N1, N3 K5 N2 K14 N1,N3 K105 N50 put(K14, V14) put(K14, V14) K5 V5 K14 K14 V14 V14 K105 V105 … N2 N3 N50 N1

Fault Tolerance • Or we can use recursive query and iterative replication… Master/Directory put(K14, V14) K5 N2 K14 N1,N3 K105 N50 put(K14, V14) put(K14, V14) K5 V5 K14 K14 V14 V14 K105 V105 … N2 N3 N50 N1

Scalability • Storage: use more nodes • Number of requests: • Can serve requests from all nodes on which a value is stored in parallel • Master can replicate a popular value on more nodes • Master/directory scalability: • Replicate it • Partition it, so different keys are served by different masters/directories • How do you partition?

Scalability: Load Balancing • Directory keeps track of the storage availability at each node • Preferentially insert new values on nodes with more storage available • What happens when a new node is added? • Cannot insert only new values on new node. Why? • Move values from the heavy loaded nodes to the new node • What happens when a node fails? • Need to replicate values from fail node to other nodes

Consistency • Need to make sure that a value is replicated correctly • How do you know a value has been replicated on every node? • Wait for acknowledgements from every node • What happens if a node fails during replication? • Pick another node and try again • What happens if a node is slow? • Slow down the entire put()? Pick another node? • In general, with multiple replicas • Slow puts and fast gets

Consistency (cont’d) • If concurrent updates (i.e., puts to same key) may need to make sure that updates happen in the same order • put(K14, V14’) and put(K14, V14’’) reach N1 and N3 in reverse order • What does get(K14) return? • Undefined! Master/Directory K5 N2 put(K14, V14’’) put(K14, V14’) K14 N1,N3 K105 N50 put(K14, V14’') put(K14, V14’) put(K14, V14’’) put(K14, V14’) K14 K14 K14 K14 V14’ V14’’ V14 V14 K5 V5 K105 V105 … N2 N3 N50 N1

Consistency (cont’d) • Large variety of consistency models: • Atomic consistency (linearizability): reads/writes (gets/puts) to replicas appear as if there was a single underlying replica (single system image) • Think “one updated at a time” • Transactions • Eventual consistency: given enough time all updates will propagate through the system • One of the weakest form of consistency; used by many systems in practice • And many others: causal consistency, sequential consistency, strong consistency, …

Quorum Consensus • Improve put() and get() operation performance • Define a replica set of size N • put() waits for acknowledgements from at least W replicas • get() waits for responses from at least R replicas • W+R > N • Why does it work? • There is at least one node that contains the update • Why might you use W+R > N+1?

Quorum Consensus Example • N=3, W=2, R=2 • Replica set for K14: {N1, N2, N4} • Assume put() on N3 fails put(K14, V14) ACK put(K14, V14) put(K14, V14) ACK K14 V14 K14 V14 N2 N3 N4 N1

Quorum Consensus Example • Now, issuing get() to any two nodes out of three will return the answer get(K14) get(K14) nill V14 K14 V14 K14 V14 N2 N3 N4 N1

Scaling Up Directory • Challenge: • Directory contains a number of entries equal to number of (key, value) tuples in the system • Can be tens or hundreds of billions of entries in the system! • Solution: consistent hashing • Associate to each node a unique idin an uni-dimensional space 0..2m-1 • Partition this space across m machines • Assume keys are in same uni-dimensional space • Each (Key, Value) is stored at the node with the smallest ID larger than Key

m = 6  ID space: 0..63 Node 8 maps keys [5,8] Node 15 maps keys [9,15] Node 20 maps keys [16, 20] … Node 4 maps keys [59, 4] Key to Node Mapping Example 63 0 4 58 8 15 44 20 14 V14 35 32

Distributed Storage: Key-Value Stores and Security

Distributed Storage: Key-Value Stores and Security

Presentation Transcript

March 9 th , 2015 Prof. John Kubiatowicz cs162.eecs.Berkeley

November 9 th , 2015 Prof. John Kubiatowicz cs162.eecs.Berkeley

February 23 rd , 2015 Prof. John Kubiatowicz cs162.eecs.Berkeley

April 29 th , 2013 Prof. John Kubiatowicz http://inst.eecs.berkeley.edu/~cs194-24

September 17 th , 2003 Prof. John Kubiatowicz

September 24 th , 2003 Prof. John Kubiatowicz

September 17 th , 2003 Prof. John Kubiatowicz