Cs519 lecture 9
This presentation is the property of its rightful owner.
Sponsored Links
1 / 59

CS519: Lecture 9 PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

CS519: Lecture 9. Distributed File Systems. File Service. Implemented by a user/kernel process called file server A system may have one or several file servers running at the same time Two models for file services

Download Presentation

CS519: Lecture 9

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cs519 lecture 9

CS519: Lecture 9

Distributed File Systems


File service

File Service

  • Implemented by a user/kernel process called file server

  • A system may have one or several file servers running at the same time

  • Two models for file services

    • upload/download: files move between server and clients, few operations (read file & write file), simple, requires storage at client, good if whole file is accessed

    • remote memory access: files stay at server, reach interface for many operations, less space at client, efficient for small accesses

Operating System Theory


Directory service

Directory Service

  • Provides naming usually within a hierarchical file system

  • Clients can have the same view (global root directory) or different views of the file system (remote mounting)

  • Location transparent: location of the file doesn’t appear in the name of the file

    • ex: /server1/dir1/file specifies the server but not where the server is located -> server can move the file in the network without changing the path

  • Location independence: a single name space that looks the same on all machines, files can be moved between servers without changing their names -> difficult

Operating System Theory


Two level naming

Two-Level Naming

  • Symbolic name (external), e.g. prog.c; binary name (internal), e.g. local i-node number as in Unix

  • Directories provide the translation from symbolic to binary names

  • Binary name format

    • i-node: no cross references among servers

    • (server, i-node): a directory in one server can refer to a file on a different server

    • Capability specifying address of server, number of file, access permissions, etc

    • {binary_name+}: binary names refer to the original file and all of its backups

Operating System Theory


File sharing semantics

File Sharing Semantics

  • UNIX semantics: total ordering of R/W events

    • easy to achieve in a non-distributed system

    • in a distributed system with one server and multiple clients with no caching at client, total ordering is also easily achieved since R and W are immediately performed at server

  • Session semantics: writes are guaranteed to become visible only when the file is closed

    • allow caching at client with lazy updating -> better performance

    • if two or more clients simultaneously write: one file (last one or non-deterministically) replaces the other

Operating System Theory


File sharing semantics cont d

File Sharing Semantics (cont’d)

  • Immutable files: create and read file operations (no write)

    • writing a file means to create a new one and enter it into the directory replacing the previous one with the same name: atomic operations

    • collision in writing: last copy or nondeterministically

    • what happens if the old copy is being read

  • Transaction semantics: mutual exclusion on file accesses; either all file operations are completed or none is. Good for banking systems

Operating System Theory


File system properties

File System Properties

  • Observed in a study by Satyanarayanan (1981)

    • most files are small (< 10K)

    • reading is much more frequent than writing

    • most R&W accesses are sequential (random access is rare)

    • most files have a short lifetime -> create the file on the client

    • file sharing is unusual -> caching at client

    • the average process uses only a few files

Operating System Theory


Server system structure

Server System Structure

  • File + directory service: combined or not

  • Cache directory hints at client to accelerate the path name look up – directory and hints must be kept coherent

  • State information about clients at the server

    • stateless server: no client information is kept between requests

    • stateful server: servers maintain state information about clients between requests

Operating System Theory


Stateless vs stateful

Stateful Servers

Stateless Server

  • shorter messages

  • better performance (info in memory until close)

  • open/close at server

  • file locking possible

  • read ahead possible

  • requests are self-contained

  • better fault tolerance

  • open/close at client (fewer msgs)

  • no space reserved for tables

  • thus, no limit of open files

  • no problem if client crashes

Stateless vs. Stateful

Operating System Theory


Caching

Caching

  • Three possible places: server’s memory, client’s disk, client’s memory

  • Caching in server’s memory: avoids disk access but still network access

  • Caching at client’s disk (if available): tradeoff between disk access and remote memory access

  • Caching at client usually in main memory

    • inside each process address space: no sharing at client

    • in the kernel: kernel involvement on hits

    • in a separate user-level cache manager: flexible and efficient if paging can be controlled from user-level

  • Server-side caching eliminates coherence problem. Client-side cache coherence? Next…

Operating System Theory


Client cache coherence in dfs

Client Cache Coherence in DFS

  • How to maintain coherence (according to a model, e.g. UNIX semantics or session semantics) of copies of the same file at various clients

  • Write-through: writes sent to the server as soon as they are performed at the client -> high traffic, requires cache managers to check (modification time) with server before can provide cached content to any client

  • Delayed write: coalesces multiple writes; better performance but ambiguous semantics

  • Write-on-close: implements session semantics

  • Central control: file server keeps a directory of open/cached files at clients -> Unix semantics, but problems with robustness and scalability; problem also with invalidation messages because clients did not solicit them

Operating System Theory


File replication

File Replication

  • Multiple copies are maintained, each copy on a separate file server - multiple reasons:

    • Increase reliability: file accessible even if a server is down

    • Improve scalability: reduce the contention by splitting the workload over multiple servers

  • Replication transparency

    • explicit file replication: programmer controls replication

    • lazy file replication: copies made by the server in background

    • use group communication: all copies made at the same time in the foreground

  • How replicas should be modified? Next…

Operating System Theory


Modifying replicas voting protocol

Modifying Replicas: Voting Protocol

  • Updating all replicas using a coordinator works but is not robust (if coordinator is down, no updates can be performed) => Voting: updates (and reads) can be performed if some specified # of servers agree.

  • Voting Protocol:

    • A version # (incremented at write) is associated with each file

    • To perform a read, a client has to assemble a read quorum of Nr servers; similarly, a write quorum of Nw servers for a write

    • If Nr + Nw > N, then any read quorum will contain at least one most recently updated file version

    • For reading, client contacts Nr active servers and chooses the file with largest version #

    • For writing, client contacts Nw active servers asking them to write. Succeeds if they all say yes.

Operating System Theory


Modifying replicas voting protocol1

Modifying Replicas: Voting Protocol

  • Nr is usually small (reads are frequent), but Nw is usually close to N (want to make sure all replicas are updated). Problem with achieving a write quorum in the presence of server failures

  • Voting with ghosts: allows to establish a write quorum when several servers are down by temporarily creating dummy (ghost) servers (at least one must be real)

  • Ghost servers are not permitted in a read quorum (they don’t have any files)

  • When server comes back it must restore its copy first by obtaining a read quorum

Operating System Theory


Network file system nfs

Network File System (NFS)

  • A stateless DFS implemented at Sun

  • An NFS server exports directories

  • Clients access exported directories by mounting them

  • Because NFS is stateless, OPEN and CLOSE operations are not provided by the server (implemented at the client)

  • NFS provides file locking but UNIX file semantics is not achieved because of client caching

    • dirty cache blocks are sent back by clients in chunks, every 30 sec or at close

    • a timer is associated with each cache block at the client (3 sec for data blocks, 30 sec for directory blocks). When the timer expires, the entry is discarded (if clean, of course)

    • when a file is opened, the last modification time at the server is checked

Operating System Theory


Recent research in dfs

Recent Research in DFS

  • Petal & Frangipani (DEC SRC): 2-layer DFS system

  • xFS (Berkeley) : a serverless network file system

Operating System Theory


Petal distributed virtual disks

Petal: Distributed Virtual Disks

  • A distributed storage system that provides a virtual disk abstraction separate from the physical resource

  • The virtual disk is globally accessible to all Petal clients on the network

  • Virtual disks are implemented on a cluster of servers that cooperate to manage a pool of physical disks

  • Advantages

    • recover from any single failure

    • transparent reconfiguration and expandability

    • load and capacity balancing

    • low-level service (lower than a DFS) that handles distribution problems

Operating System Theory


Petal

Petal

Operating System Theory


Virtual to physical translation

Virtual to Physical Translation

  • <virtual disk, virtual offset> -> <server, physical disk, physical offset>

  • Three data structures: virtual disk directory, global map, and physical map

  • The virtual disk directory and global map are globally replicated and kept consistent

  • Physical map is local to each server

  • One level of indirection (virtual disk to global map) is necessary to allow transparent reconfiguration. We’ll discuss reconfiguration soon

Operating System Theory


Virtual to physical translation cont d

Virtual to Physical Translation (cont’d)

  • The virtual disk directory translates the virtual disk identifier into a global map identifier

  • The global map determines the server responsible for translating the given offset (a virtual disk may be spread over multiple physical disks). The global map also specifies the redundancy scheme for the virtual disk

  • The physical map at specific server translates global map identifier and the offset to a physical disk and an offset within that disk. Physical map is similar to a page table

Operating System Theory


Support for backup

Support for Backup

  • Petal simplifies a client’s backup procedure by providing a snapshot mechanism

  • Petal generates snapshots of virtual disks using copy-on-write. Creating a snapshot requires pausing the client’s application to guarantee consistency

  • A snapshot is a virtual disk that cannot be modified

  • Snapshots require a modification to the translation scheme. The virtual disk directory translates a virtual disk id into a pair <global map id, epoch #> where epoch # is incremented at each snapshot

  • At each snapshot a new tuple with a new epoch is created in the virtual disk directory. The snapshot takes the old epoch #

  • All accesses to the virtual disk are made using the new epoch #, so that any write to the original disk create new entries in the new epoch rather than overwrite the blocks in the snapshot

Operating System Theory


Virtual disk reconfiguration

Virtual Disk Reconfiguration

  • Needed when a new server is added or the redundancy scheme is changed

  • Steps to perform it at once (not incrementally) and in the absence of any other activity:

    • create a new global map with desired redundancy scheme and server mapping

    • change all virtual disk directories to point to the new global map

    • redistribute data to the severs according to the translation specified in the new global map

  • The challenge is to perform it incrementally and concurrently with normal client requests

Operating System Theory


Incremental reconfiguration

Incremental Reconfiguration

  • First two steps as before; step 3 done in background starting with the translations in the most recent epoch that have not yet been moved

  • Old global map is used to perform read translations which are not found in the new global map

  • A write request only accesses the new global map to avoid consistency problems

  • Limitation: the mapping of the entire virtual disk must be changed before any data is moved -> lots of new global map misses on reads -> high traffic. Solution: relocate only a portion of the virtual disk at a time. Read requests for portion of virtual disk being relocated cause misses, but not requests to other areas

Operating System Theory


Redundancy with chained data placement

Redundancy with Chained Data Placement

  • Petal uses chained-declustering data placement

  • two copies of each data block are stored on neighboring servers

  • every pair of neighboring servers has data blocks in common

  • if server 1 fails, servers 0 and 2 will share server’s read load (not server 3)

server 0server 1server 2server 3

d0 d1 d2 d3

d3 d0 d1 d2

d4 d5 d6 d7

d7 d4 d5 d6

Operating System Theory


Chained data placement cont d

Chained Data Placement (cont’d)

  • In case of failure, each server can offload some of its original read load to the next/previous server. Offloading can be cascaded across servers to uniformly balance load

  • Advantage: with a simple mirrored redundancy, the failure of a server would result in a 100% load increase to another server

  • Disadvantage: less reliable than simple mirroring - if a server fails, the failure of either one of its two neighbor servers will result in data becoming unavailable

  • In Petal, one copy is called primary, the other secondary

  • Read requests can be serviced by any of the two servers, while write requests must always try the primary first to prevent deadlock (blocks are locked before reading or writing, but writes require access to both servers)

Operating System Theory


Read request

Read Request

  • The Petal client tries primary or secondary server depending on which one has the shorter queue length. (Each client maintains a small amount of high-level mapping information that is used to route requests to the “most appropriate” servers. If a request is sent to an inappropriate server, the server returns an error code, causing the client to update its hints and retry the request)

  • The server that receives the request attempts to read the requested data

  • If not successful, the client tries the other server

Operating System Theory


Write request

Write Request

  • The Petal client tries the primary server first

  • The primary server marks data busy and sends the request to its local copy and the secondary copy

  • When both complete, the busy bit is cleared and the operation is acknowledged to the client

  • If not successful, the client tries the secondary server

  • If the secondary server detects that the primary server is down, it marks the data element as stale on stable storage before writing to its local disk

  • When the primary server comes up, the primary server has to bring all data marked stale up-to-date during recovery

  • Similar if secondary server is down

Operating System Theory


Petal prototype

Petal Prototype

Operating System Theory


Petal performance latency

Petal Performance - Latency

Single client generates requests to random disk offsets

Operating System Theory


Petal performance throughput

Petal Performance - Throughput

Each of 4 clients making random requests to single VD.

Failed configuration = one of 4 servers has crashed

Operating System Theory


Petal performance scalability

Petal Performance - Scalability

Operating System Theory


Frangipani

Frangipani

  • Petal provides disk interface -> need a file system

  • Frangipani is a file system designed to take full advantage of Petal

  • Frangipani’s main characteristics:

    • All users are given a consistent view of the same set of files

    • Servers can be added without changing configuration of existing servers or interrupting their operation

    • Tolerates and recovers from machine, network, and disk failures

    • Very simple internally: a set of cooperating machines that use a common store and synchronize access to that store with locks

Operating System Theory


Frangipani1

Frangipani

  • Petal takes much of the complexity out of Frangipani

    • Petal provides highly available storage that can scale in throughput and capacity

  • However, Frangipani improves on Petal, since:

    • Petal has no provision for sharing the storage among multiple clients

    • Applications use a file-based interface rather than the disk-like interface provided by Petal

  • Problems with Frangipani on top of Petal:

    • Some logging occurs twice (once in Frangipani and once in Petal)

    • Cannot use disk location in placing data, cause Petal virtualizes disks

    • Frangipani locks entire files and directories as opposed to individual blocks

Operating System Theory


Frangipani structure

Frangipani Structure

Operating System Theory


Frangipani disk layout

Frangipani: Disk Layout

  • A Frangipani file system uses only 1 Petal virtual disk

  • Petal provides a 264 bytes of “virtual” disk space

    • Commits real disk space when actually used (written)

  • Frangipani breaks disk into regions

    • 1st region stores configuration parameters and housekeeping info

    • 2nd region stores logs – each Frangipani server uses a portion of this region for its log. Can have up to 256 logs.

    • 3rd region holds allocation bitmaps, describing which blocks in the remaining regions are free. Each server locks a different portion.

    • 4th region holds inodes

    • 5th region holds small data blocks (4 Kbytes each)

    • Remainder of Petal disk holds large data blocks (1 Tbyte each)

Operating System Theory


Frangipani file structure

Frangipani: File Structure

  • First 16 blocks (64 KB) of a file are stored in small blocks

  • If file becomes larger, store the rest in a 1 TB large block

Operating System Theory


Frangipani dealing with failures

Frangipani: Dealing with Failures

  • Write-ahead redo logging of metadata; user data is not logged

  • Each Frangipani server has its own private log

  • Only after a log record is written to Petal does the server modify the actual metadata in its permanent locations

  • If a server crashes, the system detects the failure and another server uses the log to recover

    • Because the log is on Petal, any server can get to it.

Operating System Theory


Frangipani synchronization coherence

Frangipani: Synchronization & Coherence

  • Frangipani has a lock for each log segment, allocation bitmap segment, and each file

  • Multiple-reader/single-writer locks. In case of conflicting requests, the owner of the lock is asked to release or downgrade it to remove the conflict

  • A read lock allows a server to read data from disk and cache it. If server is asked to release its read lock, it must invalidate the cache entry before complying

  • A write lock allows a server to read or write data and cache it. If a server is asked to release its write lock, it must write dirty data to disk and invalidate the cache entry before complying. If a server is asked to downgrade the lock, it must write dirty data to disk before complying

Operating System Theory


Frangipani lock service

Frangipani: Lock Service

  • Fully distributed lock service for fault tolerance and scalability

  • How to release locks owned by a failed Frangipani server?

    • The failure of a server is discovered when its “lease” expires. A lease is obtained by the server when it first contacts the lock service. All locks acquired are associated with the lease. Each lease has an expiration time (30 seconds) after its creation or last renewal. A server must renew its lease before it expires

    • When a server fails, the locks that it owns cannot be released until its log is processed and any pending updates are written to Petal

Operating System Theory


Frangipani performance

Frangipani: Performance

Operating System Theory


Frangipani performance1

Frangipani: Performance

Operating System Theory


Frangipani scalability

Frangipani: Scalability

Operating System Theory


Frangipani scalability1

Frangipani: Scalability

Operating System Theory


Frangipani scalability2

Frangipani: Scalability

Operating System Theory


Xfs context motivation

xFS (Context & Motivation)

  • A server-less network file system that works over a cluster of cooperative workstations

  • Moving away from central FS is motivated by three factors

    • hardware opportunity (fast switched LANs) provide aggregate bandwidth that scales with the number of machines in the network

    • user demand is increasing: e.g., multimedia

    • limitations of central FS approach:

      • limited scalability

      • Expensive

      • replication for availability increase complexity and operation latency

Operating System Theory


Xfs contribution limitations

xFS (Contribution & Limitations)

  • A well-engineered approach which takes advantage of several research ideas: RAID, LFS, cooperative caching

  • A truly distributed network file system (no central bottleneck)

    • control processing distributed across the system on per-file granularity

    • storage distributed using a software RAID and a log-based network striping (Zebra)

    • use cooperative caching to use portions of client memory as a large, global file cache

  • Limitation: requires machines to trust each other

Operating System Theory


Raid in xfs

RAID in xFS

  • RAID partitions a stripe of data into N-1 data blocks and a parity block (the exclusive-OR of the bits of data blocks)

  • Data and parity blocks are stored on different storage servers

  • Provides both high bandwidth and fault tolerance

  • Traditional RAID drawbacks:

    • multiple accesses for small writes

    • hardware RAID expensive (special hardware to compute parity)

Operating System Theory


Lfs in xfs

LFS in xFS

  • High-performance writes: buffer writes in memory to write them to disk in large, contiguous, fixed-size groups called log segments

  • Writes are always appended as logs

  • imap to locate i-nodes: stored in memory and periodically checkpointed to disk

  • Simple recovery procedure: get the last checkpoint and then rolls forward reading the later segments and in the log and update imap and i-nodes

  • Free disk management through log cleaner: coalesces old, partially empty segments into a smaller number of full segments -> cleaning overhead can be large sometime

Operating System Theory


Zebra

Zebra

  • Combines LFS and RAID: LFS’s large writes make writes to the network RAID efficient

  • Implements RAID in software

  • Writes coalesced into a private per-client log

  • Log-base striping:

    • log segment split into log fragments which are striped over the storage servers

    • parity fragment computation is local (no network access)

  • Deltas stored in the log encapsulate modifications to file system states that must be performed atomically - used for recovery

Operating System Theory


Metadata and data distribution

Metadata and Data Distribution

  • A centralized FS stores all data blocks on its local disks

    • manages location of metadata

    • maintains a central cache of data blocks in its memory

    • manages cache consistency metadata that lists which clients in the system are caching each block (not NFS)

Operating System Theory


Xfs metadata and data distribution

xFS: Metadata and Data Distribution

  • Stores data on storage servers

  • Splits metadata management among multiple managers that can dynamically alter the mapping from a file to its manager

  • Uses cooperative caching that forwards data among client caches under the control of the managers

  • The key design challenge: how to locate data and metadata in such a completely distributed system

Operating System Theory


Xfs data structures

xFS: Data Structures

Operating System Theory


Manager map

Manager Map

  • Allows clients to determine which manager to contact for a file

  • Manager map is globally replicated (it is small)

  • Two translations are necessary to allow manager remapping

    • external file name - > file index number (directory)

    • index number -> manager (manager map)

  • Manager map can also be used for a coarse-grained workload balancing among managers

  • File manager controls disk location metadata (Imap &I-node) and cache consistency state (list of clients caching the block or who has the ownership for write)

Operating System Theory


Read operation

Read Operation

Operating System Theory


Write operation

Write Operation

  • Clients buffer writes in their local memory until committed to a stripe group of storage servers

  • Since xFS uses LFS a write changes the disk address of the modified block

  • After a client commits a segment to a storage server it notifies the modified blocks’ managers to modify their index nodes and imaps

  • Index nodes and data blocks do not have to be simultaneously committed because in Zebra the client’s log includes a delta that allows reconstruction of the manager’s data structure in the event of a crash

Operating System Theory


Cache consistency

Cache Consistency

  • Per-block rather than per-file

  • Ownership-based similar to a DSM scheme

  • To modify a block a client must get the ownership from the manager

  • The manager invalidates any other cached copies of the block, then gives write permission (ownership) to the client

  • Ownership can be revoked by the manager

  • Manager keeps the list of clients caching each block

Operating System Theory


Log cleaner in xfs

Log cleaner in xFS

  • Distributed

  • Relies on utilization status which is also distributed: maintained by the client who wrote that segment

  • A leader in each group initiates cleaning and decides which cleaners should clean the stripe group’s segments

  • Each cleaner receives a subset of segments to clean

  • Cleaners assume optimistic concurrrency to resolve conflicts between cleaner updates and normal writes

  • In case of a conflict (because a client is writing a block as it is cleaned) the manager ensures that client update takes precedence over the cleaner’s update

Operating System Theory


Xfs performance

xFS: Performance

Operating System Theory


Xfs performance1

xFS: Performance

Operating System Theory


  • Login