Distributed File Systems

Distributed File Systems Chapter 16

Distributed Systems • Introduction – advantages of distributed systems 15. Structures – network types, design 16. File Systems – naming, cache updating 17. Coordination – event ordering, mutual exclusion

Internet Multi-CPU Systems M M M C+M C+M C C C C C C C C C Shared memory M C Inter- connect C C C M C C C C C C C C+M C+M C+M M M M Wide-area distributed system Shared-memory multiprocessor Message-passing multicomputer Tanenbaum, Modern Operating Systems, 2nd Ed., p. 505

Examples of Multi-CPU Systems • Multiprocessors – quad CPU PC • Multicomputer – 512 nodes in a room working on pharmaceutical modelling • Distributed System – Thousands of machines loosely cooperating over the Internet Tanenbaum, p. 549

Types of Multi-CPU Systems

Interconnect Topologies Grid Single switch Ring Hypercube Cube Smallest diameter Most links Double Torus Tanenbaum, Modern Operating Systems, 2nd Ed., p. 528

Chapter 16 Distributed-File Systems • Background • Naming and Transparency • Remote File Access • Stateful versus Stateless Service • File Replication • Example Systems

Background • Distributed file system (DFS) – a distributed implementation of the classical time-sharing model of a file system, where multiple users share files and storage resources. • A DFS manages set of dispersed storage devices • Overall storage space is composed of different, remotely located, smaller storage spaces. • A component unit is the smallest set of files that can be stored on a single machine, independently of other units. • There is usually a correspondence between constituent storage spaces and sets of files.

DFS Parts • Service – software entity running on one or more machines and providing a particular type of function to a priori unknown clients. • Server – service software running on a single machine. • Client – process that can invoke a service using a set of operations that forms its client interface. So a file system provides file services to clients.

DFS Features • Client Interface • A set of primitive file operations (create, delete, read, write). • Transparency • Local and remote files are indistinguishable • The multiplicity of its servers and storage devices should appear invisible. • Response time is ideally comparable to that of a local file system

Application Application Application Application Middleware Middleware Middleware Middleware Solaris Windows Mac OS Linux Pentium Macintosh Pentium SPARC Network DFS Implementation • Various Implementations • Part of a distributed operating system, or • A software layer managing communication between conventional Operating Systems Tanenbaum, p. 551

Naming and Transparency • Naming – mapping between logical and physical objects. • Multilevel mapping – abstraction of a file that hides the details of how and where on the disk the file is actually stored. • A transparent DFS hides the location in the network where the file is stored. • A file can be replicated in several sites, • the mapping returns a set of the locations of this file’s replicas • both the existence of multiple copies and their location are hidden.

Naming Structures • Location transparency – file name does not reveal the file’s physical storage location. e.g. /server1/dir/dir2/x says that file is located on server1, but it does not tell where that server is located • File name still denotes a specific, although hidden, set of physical disk blocks. • Convenient way to share data. • Can expose correspondence between component units and machines. • However, if file x is large, the system might like to move x from server1 to server2, but the path name would change from /server1/dir/dir2/x to /server2/dir/dir2/x

Naming Structures • Location independence – file name does not need to be changed when the file’s physical storage location changes. • Better file abstraction. • Promotes sharing the storage space itself. • Separates the naming hierarchy from the storage-devices hierarchy, allowing file migration • Difficult to achieve, few experimental examples only (e.g. Andrews File System) • Even remote mounting will not achieve location independence, since it is not normally possible to move a file from one file group (the unit of mounting) to another, and still be able to use the old path name.

Naming Schemes — Three Main Approaches • Combination names: • Files named by combination of their host name and local name; • Guarantees a unique systemwide name. e.g. host:local-name • Mounting file systems: • Attach remote directories to local directories, giving the appearance of a coherent directory tree; • Automount allows mounts to be done on demand • Global name structure • Total integration of the component file systems. • Spans all the files in the system. • Location-independent file identifiers link files to component units

Types of Middleware • Document-based: • Each page has a unique address • Hyperlinks within each page point to other pages • File System based: • Distributed system looks like a local file system • Shared Object based • All items are objects, bundled with access procedures called methods • Coordination-based • The network appears as a large, shared memory

Document-based Middleware • Make a distributed system look like a giant collection of hyperlinked documents • E.g. hyperlinks on web pages. • Steps in accessing web page http://www.acm.org/dl/faq.html • Browser asks DNS for IP address of www.acm.org • DNS replies with 199.222.69.151 • Browser connects by TCP to Port 80 of 199.222.69.151 • Browser requests file dl/faq.html • TCP connection is released • Browser displays all text in dl/faq.html • Browser fetches and displays all images in dl/faq.html

File-system based Middleware • Make a distributed system look like a great big file system • Single global file system, with users all over the world able to read and write files for which they have authorization Server Client Client Server Old file New file Upload/download model e.g. AFS Remote access model e.g. NFS

Remote File Access • Reduce network traffic by retaining recently accessed disk blocks in a cache, so that repeated accesses to the same information can be handled locally. • If needed data not already cached, a copy of data is brought from the server to the user. • Accesses are performed on the cached copy. • Files identified with one master copy residing at the server machine, but copies of (parts of) the file are scattered in different caches. • Cache-consistency problem – keeping the cached copies consistent with the master file. • Network virtual memory, with backing store at a remote server

Network Cache Location • Disk Cache • More reliable, survive crashes. • Main-Memory Cache • Permit workstations to be diskless. • Data can be accessed more quickly. • Technology trend to bigger, less expensive memories. • Server caches (used to speed up disk I/O) are in main memory regardless of where user caches are located; using main-memory caches on the user machine permits a single caching mechanism for servers and users. e.g. NFS has memory caching, optional disk cache

Cache Update Policy • Write-through policy – write data through to disk as soon as they are placed on any cache. Reliable, but poor performance. • Delayed-write policy – modifications written to the cache and then written through to the server later. • Fast – write accesses complete quickly. • Less reliable – unwritten data lost whenever a user machine crashes. • Update on flush from cache • But flushes happen at irregular intervals • Update on regular scan • Scan cache, flush blocks that have been modified since the last scan (NFS). • Write-on-close – write data back to the server when the file is closed (AFS). • Best for files that are open for long periods and frequently modified.

Consistency • Is locally cached copy of the data consistent with the master copy? How to verify validity of cached data? • Client-initiated approach • Client initiates a validity check. • Server checks whether the local data are consistent with the master copy. • Check before every access, or timed checks • Server-initiated approach • Server records, for each client, the (parts of) files it caches. • When server detects a potential inconsistency, it reacts e.g. When same file is open for read and write on different clients

Caching and Remote Service • Caching • Faster, especially with locality in file accessing • Servers contacted only occasionally (rather than for each access). • Reduced server load and network traffic • Enhanced potential for scalability. • Lower network overhead, as data is transmitted in bigger chunks • Remote server method • Useful for diskless machines • Avoids cache-consistency problem • Inter-machine interface mirrors the local user-file-system interface

Stateful File Service • Mechanism. • Client opens a file. • Server fetches information about the file from its disk, stores it in its memory, and gives the client a connection identifier unique to the client and the open file. • Identifier is used for subsequent accesses until the session ends. • Server must reclaim the main-memory space used by clients who are no longer active. • Increased performance. • Fewer disk accesses. • Stateful server knows if a file was opened for sequential access and can thus read ahead the next blocks.

Stateless File Server • Mechanism • Each request self-contained. • No state information retained between requests. • Each request identifies the file and position in the file. • File open and close are local to the client • Design implications • Reliable, survives server crashes • Slower, with longer request messages • System-wide file names needed, to avoid name translation • Idempotent – File requests should leave server unchanged

Recovery from Failures • Stateful server • Server failure – loses its volatile state • Restore state by recovery protocol in dialog with clients, or • Abort operations that were underway when the crash occurred. • Client failure • Server needs to be aware of client failures in order to reclaim space allocated to record the state of crashed client processes (orphan detection and elimination). • Stateless server • Server failure and recovery almost unnoticeable. • Newly refreshed server can respond to a self-contained request without any difficulty.

File Replication • Replicas of the same file on failure-independent machines. • Improves availability, shortens service time. • Replicated file name mapped to a particular replica. • Existence of replicas should be invisible to higher levels. • Replicas distinguished from one another by different lower-level names. • Updates – replicas of a file denote the same logical entity • thus an update to any replica must be reflected in all other replicas e.g. Locus OS. • Demand replication – reading a nonlocal replica causes it to be cached locally, thereby generating a new nonprimary replica. • Updates are made to the primary copy, others are invalid (e.g. Ibis)

Andrew Distributed Computing Environment • History • under development since 1983 at Carnegie-Mellon University. • Name honours Andrew Carnegie and Andrew Mellon • Highly scalable; • the system is targeted to span over 5000 workstations. • Distinguishes between client machines (workstations) and dedicated server machines. • Servers and clients run slightly modified UNIX • Workstation LAN clusters interconnected by a WAN.

Andrew File System (AFS) • Clients are presented with a partitioned space of file names: a local name space and a shared name space. • Dedicated servers, called Vice, present the shared name space to the clients as a homogeneous, identical, and location transparent file hierarchy. • The local name space is the root file system of a workstation, from which the shared name space descends. • Workstations run the Virtue protocol to communicate with Vice, and are required to have local disks where they store their local name space. • Servers collectively are responsible for the storage and management of the shared name space.

AFS File Operations • Andrew caches entire files from servers. • A workstation interacts with Vice servers only during opening and closing of files. • Venus runs locally in the kernel on each workstation • Caches files from Vice when they are opened, • Stores modified copies of files back when they are closed. • Caches contents of directories and symbolic links, for path-name translation • Reading and writing bytes of a file • Done by the kernel without Venus intervention on the cached copy.

Types of Middleware • Document-based: (e.g. Web) • Each page has a unique address • Hyperlinks within each page point to other pages • File System based: (e.g NFS, AFS) • Distributed system looks like a local file system • Shared Object based (e.g. CORBA, Globe) • All items are objects, bundled with access procedures called methods • Coordination-based (e.g. Linda, Jini) • The network appears as a large, shared memory

Shared Object based Middleware • Objects • Everything is an object, a collection of variables bundled with access procedures called methods • Processes invoke methods to access the variables • Common Object Request Broker Architecture (CORBA) • Client processes on client machines can invoke operations on objects on (possibly) remote server machines • To match objects from different machines, Object Request Brokers (ORBs) are interposed between client and server to allow them to match up • Interface Definition Language (IDL) • Tells what methods the object exports, • Tells what parameter types each object expects

Client ORB Client ORB Operating System Operating System CORBA Model Server code Client stub Skeleton Client code server client Object adapter IIOP Tanenbaum, p. 567

CORBA • Allows different client and server applications to communicate • e.g. a C++ program can use CORBA to access a COBOL database • ORB (Object Request Broker) • implements the interface specified by the IDL • ORB is on both client and server side • IIOP (Internet InterORB Protocol) • specifies how ORBs can communicate • Stub – Client-side library of IDL object specs • Skeleton – Server-side procedure for IDL-spec’d object • Objectadapter • wrapper that registers object, • generates object references, • activates the object

Remote Method Invocation Procedure • Process creates CORBA object, receives its reference • Reference is available to be passed to other processes, or stored in an object database for lookup • Client process acquires a reference to the object • Client process marshals required parameters into a parcel • Client process contacts client ORB • Client ORB sends the parcel to the server ORB • Server ORB arranges for invocation of method on the object

Globe System • Scope • Scales to 1 billion users and 1 trillion objects e.g. stock prices, sports scores • Method • Replicate object, spread load over replicas • Every globe object has a class object with its methods • The object interface is a table of pointers, each a <method pointer, state pointer> pair • State pointers can point to interfaces such as mailboxes, each with its own language or function e.g. business mail, personal mail e.g. languages such as C, C++, Java, assembly

List Messages Read Messages Append Messages Delete Messages … … … … … … … … Globe Object Class object contains the method State of Mailbox 2 State of Mailbox 1 Interface used to access Mailbox 2 Interface used to access Mailbox 1

Accessing a Globe Object • Reading • Process looks it up, finds a contact address (e.g IP, port) • Security check, then object binding • Class object (code) loaded into caller’s address space • Instantiate a copy of its state • Process receives a pointer to its standard interface • Process invokes methods using the interface pointer • Writing • According to object replication policy: • Obtain a sequence number from the sequencer • Multicast a message containing the sequence number, operation name and parameters to all other processes bound to the object • Apply writes in order of sequence, to master, and update replicas

… … Globe Object Interface Control subobject Semantic subobject Replication subobject Communication subobject Security subobject Operating System Messages

Subobjects in a Globe Object • Control subobject • Accepts incoming invocations, distributes tasks • Semantics subobject • Actually does the work required by object interface; only part actually programmed by coder • Replication subobject • Manages object replication (e.g. all active, or master-slave) • Security suboject – implements security policy • Communication subobject – network protocols (e.g. IP v4)

Coordination-based Middleware • Linda • Developed at Yale, 1986 • Users appear to share a big memory, known as tuple space • Processes on any machine can insert tuples into tuple space or remove tuples from tuple space • Publish/Subscribe, 1993 • Processes connected by a broadcast network • Each process can be a producer of information, a consumer, or both • Jini • From Sun Microsystems, 1999 • Self-contained Jini devices are plugged into a network, not a computer • Each device offers or uses services

Linda • tuples • Like a structure in C, pure data, with no associated methods e.g. (“abc, 2, 5) (“matrix-1”, 1, 6, 3.14) (“family”, “is-sister”, “Stephany”, “Roberta”) • Operations • Out – put a tuple into tuple space e.g. out(“abc”, 2, 5); • In – retrieve a tuple from tuple space e.g. in((“abc”, 2, ?i); • addressed by content rather than <name, address> • tuple space is searched for a match to the specified contents • Process is blocked until a match is found • Read a tuple, but leave it in tuple space • Eval to evaluate tuple parameters and the resulting tuple put out

Publish/subscribe • Publishing • New information broadcast as a tuple on the network • Tuple has a subject line with multiple fields separated by periods • Processes can subscribe to certain subjects • Subscribing • The tuple daemon on each machine copies all broadcasted tuples into its RAM • It inspects each subject line, forwards a copy to each interested process. Producer WAN LAN Consumer Daemon Information router

Jini • Network-centric computing • An attempt to change from CPU-centric computing • Many self-contained Jini devices offer services to the others • e.g. Computer, cell phone, printer, palmtop, TV set, stereo • A loose confederation of devices, with no central administration • Coded in JVM (Java Virtual Machine language) • Joining a Jini federation • Broadcasts a message asking for a lookup service • Uses the discovery protocol to find the service • Lookup service sends code to register the new device • Device acquires a lease to register for a fixed time • The registration proxy can be sent to other devices looking for service

Jini • JavaSpaces • Entries like Linda tuples, but strongly typed e.g. Employee entry could have <string, integer, integer, boolean> to accommodate <name, department, telephone, works fulltime> • Operations • Write – put an entry into JavaSpace, specifying the lease time • Read – copy an entry that matches a template out of JavaSpace • Take – copy and remove an entry that matches a template • Notify – notify the caller when a matching entry is written • Transactions can be atomic, so multiple methods can be safely grouped – all or none will execute

Distributed File Systems