Ecs150 fall 2007 operating system 5 file systems chapters 6 4 6 7 8
Download
1 / 170

ecs150 Fall 2007: - PowerPoint PPT Presentation


  • 328 Views
  • Uploaded on

ecs150 Fall 2007 : Operating System #5: File Systems (chapters: 6.4~6.7, 8). Dr. S. Felix Wu Computer Science Department University of California, Davis http://www.cs.ucdavis.edu/~wu/ [email protected] File System Abstraction. Files Directories. System-call interface.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ecs150 Fall 2007:' - Gideon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ecs150 fall 2007 operating system 5 file systems chapters 6 4 6 7 8 l.jpg

ecs150 Fall 2007:Operating System#5: File Systems(chapters: 6.4~6.7, 8)

Dr. S. Felix Wu

Computer Science Department

University of California, Davis

http://www.cs.ucdavis.edu/~wu/

[email protected]

ecs150, Fall 2007


File system abstraction l.jpg
File System Abstraction

  • Files

  • Directories

ecs150, Fall 2007


Slide3 l.jpg

System-call interface

Active file entries

VNODE Layer or VFS

Local naming (UFS)

FFS

Buffer cache

Block or character device driver

Hardware

ecs150, Fall 2007





Slide7 l.jpg

dirp = opendir(const char *filename);

struct dirent *direntp = readdir(dirp);

struct dirent {

ino_t d_ino;

char d_name[NAME_MAX+1];

};

directory

dirent

inode

file_name

dirent

inode

file_name

dirent

inode

file_name

file

file

file

ecs150, Fall 2007


Local versus remote l.jpg
Local versus Remote

  • System Call Interface

  • V-node

  • Local versus remote

    • NFS or i-node

    • Stackable File System

  • Hard-disk blocks

ecs150, Fall 2007


File system structure l.jpg
File-System Structure

  • File structure

    • Logical storage unit

    • Collection of related information

  • File system resides on secondary storage (disks).

  • File system organized into layers.

  • File control block – storage structure consisting of information about a file.

ecs150, Fall 2007


File disk l.jpg
File  Disk

  • separate the disk into blocks

  • separate the file into blocks as well

  • paging from file to disk

blocks: 4 - 7- 2- 10- 12

How to represent the file??

How to link these 5 pages together??

ecs150, Fall 2007


Bit torrent pieces l.jpg
Bit torrent pieces

  • 1 big file (X Gigabytes) with a number of pieces (5%) already in (and sharing with others).

  • How much disk space do we need at this moment?

ecs150, Fall 2007


Hard disk l.jpg
Hard Disk

  • Track, Sector, Head

    • Track + Heads  Cylinder

  • Performance

    • seek time

    • rotation time

    • transfer time

  • LBA

    • Linear Block Addressing

ecs150, Fall 2007


File disk blocks l.jpg
File  Disk blocks

0

file

block

0

file

block

1

file

block

2

file

block

3

file

block

4

4

7

2

10

12

  • What are the disadvantages?

  • disk access can be slow for “random access”.

  • How big is each block? 64 bytes? 68 bytes?

ecs150, Fall 2007


Kernel hacking session l.jpg
Kernel Hacking Session

  • This Friday from 7:30 p.m. until midnight..

  • 3083 Kemper

    • Bring your laptop

    • And bring your mug…

ecs150, Fall 2007


A file system l.jpg
A File System

partition

partition

partition

b

s

i-list

directory and data blocks

d

i-node

i-node

…….

i-node

ecs150, Fall 2007


One logical file physical disk blocks l.jpg
One Logical File  Physical Disk Blocks

efficient representation & access

ecs150, Fall 2007


An i node l.jpg
An i-node

A file

??? entries in

one disk block

Typical:

each block 8K or 16K bytes

ecs150, Fall 2007


Inode index node structure l.jpg
inode (index node) structure

  • meta-data of the file.

    • di_mode 02

    • di_nlinks 02

    • di_uid 02

    • di_gid 02

    • di_size 04

    • di_addr 39

    • di_gen 01

    • di_atime 04

    • di_mtime 04

    • di_ctime 04

ecs150, Fall 2007


Slide19 l.jpg

System-call interface

Active file entries

VNODE Layer or VFS

Local naming (UFS)

FFS

Buffer cache

Block or character device driver

Hardware

ecs150, Fall 2007



A file system21 l.jpg
A File System

partition

partition

partition

b

s

i-list

directory and data blocks

d

i-node

i-node

…….

i-node

ecs150, Fall 2007



Slide23 l.jpg

125 struct ufs2_dinode {

126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */

127 int16_t di_nlink; /* 2: File link count. */

128 u_int32_t di_uid; /* 4: File owner. */

129 u_int32_t di_gid; /* 8: File group. */

130 u_int32_t di_blksize; /* 12: Inode blocksize. */

131 u_int64_t di_size; /* 16: File byte count. */

132 u_int64_t di_blocks; /* 24: Bytes actually held. */

133 ufs_time_t di_atime; /* 32: Last access time. */

134 ufs_time_t di_mtime; /* 40: Last modified time. */

135 ufs_time_t di_ctime; /* 48: Last inode change time. */

136 ufs_time_t di_birthtime; /* 56: Inode creation time. */

137 int32_t di_mtimensec; /* 64: Last modified time. */

138 int32_t di_atimensec; /* 68: Last access time. */

139 int32_t di_ctimensec; /* 72: Last inode change time. */

140 int32_t di_birthnsec; /* 76: Inode creation time. */

141 int32_t di_gen; /* 80: Generation number. */

142 u_int32_t di_kernflags; /* 84: Kernel flags. */

143 u_int32_t di_flags; /* 88: Status flags (chflags). */

144 int32_t di_extsize; /* 92: External attributes block. */

145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */

146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */

147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */

148 int64_t di_spare[3]; /* 232: Reserved; currently unused */

149 };

ecs150, Fall 2007


Slide24 l.jpg

166 struct ufs1_dinode {

167 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */

168 int16_t di_nlink; /* 2: File link count. */

169 union {

170 u_int16_t oldids[2]; /* 4: Ffs: old user and group ids. */

171 } di_u;

172 u_int64_t di_size; /* 8: File byte count. */

173 int32_t di_atime; /* 16: Last access time. */

174 int32_t di_atimensec; /* 20: Last access time. */

175 int32_t di_mtime; /* 24: Last modified time. */

176 int32_t di_mtimensec; /* 28: Last modified time. */

177 int32_t di_ctime; /* 32: Last inode change time. */

178 int32_t di_ctimensec; /* 36: Last inode change time. */

179 ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */

180 ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */

181 u_int32_t di_flags; /* 100: Status flags (chflags). */

182 int32_t di_blocks; /* 104: Blocks actually held. */

183 int32_t di_gen; /* 108: Generation number. */

184 u_int32_t di_uid; /* 112: File owner. */

185 u_int32_t di_gid; /* 116: File group. */

186 int32_t di_spare[2]; /* 120: Reserved; currently unused */

187 };

ecs150, Fall 2007


Bittorrent pieces l.jpg
Bittorrent pieces

File size: 10 GB

Pieces downloaded: 512 MB

How much disk space do we need?

ecs150, Fall 2007


Slide26 l.jpg

#include <stdio.h>

#include <stdlib.h>

int

main

(void)

{

FILE *f1 = fopen("./sss.txt", "w");

int i;

for (i = 0; i < 1000; i++)

{

fseek(f1, rand(), SEEK_SET);

fprintf(f1, "%d%d%d%d", rand(), rand(),

rand(), rand());

if (i % 100 == 0) sleep(1);

}

fflush(f1);

}

# ./t

# ls –l ./sss.txt

ecs150, Fall 2007





An i node30 l.jpg
An i-node

A file

??? entries in

one disk block

Typical:

each block 1K

ecs150, Fall 2007


I node l.jpg
i-node

  • How many disk blocks can a FS have?

  • How many levels of i-node indirection will be necessary to store a file of 2G bytes? (I.e., 0, 1, 2 or 3)

  • What is the largest possible file size in i-node?

  • What is the size of the i-node itself for a file of 10GB with only 512 MB downloaded?

ecs150, Fall 2007


Answer l.jpg
Answer

  • How many disk blocks can a FS have?

    • 264 or 232: Pointer (to blocks) size is 8/4 bytes.

  • How many levels of i-node indirection will be necessary to store a file of 2G (231) bytes? (I.e., 0, 1, 2 or 3)

    • 12*210 + 28 * 210 + 28 *28 *2 10 +28 *28 *28 *2 10 >? 231

  • What is the largest possible file size in i-node?

    • 12*210 + 28 * 210 + 28 *28 *2 10 +28 *28 *28 *2 10

    • 264 –1

    • 232 * 210

You need to consider three issues and find the minimum!

ecs150, Fall 2007


Answer lower bound l.jpg
Answer: Lower Bound

  • How many pointers?

    • 512MB divided by the block size (1K)

    • 512K pointers times 8 (4) bytes = 4 (2) MB

ecs150, Fall 2007


Bittorrent pieces34 l.jpg
Bittorrent pieces

File size: 10 GB

Pieces downloaded: 512 MB

How much disk space do we need?

ecs150, Fall 2007


Answer upper bound l.jpg
Answer: Upper Bound

  • In the worst case, EVERY indirection block has at least one entry!

  • How many indirection blocks?

    • Single: 1 block

    • Double: 1 + 28

    • Tripple: 1 + 28 + 216

  • Total ~ 216 blocks times 1K = 64 MB

    • 214 times 1K = 16MB (ufs2 inode)

ecs150, Fall 2007


Answer 4 l.jpg
Answer (4)

  • 2 MB ~ 64 MB ufs1

  • 4 MB ~ 16 MB ufs2

  • Answer: sss.txt ~17 MB

    • ~16 MB (inode indirection blocks)

    • 1000 writes times 1K ~ 1MB

ecs150, Fall 2007


An i node37 l.jpg
An i-node

A file

??? entries in

one disk block

Typical:

each block 1K

ecs150, Fall 2007


A file system38 l.jpg
A File System

partition

partition

partition

b

s

i-list

directory and data blocks

d

i-node

i-node

…….

i-node

ecs150, Fall 2007


Ffs and ufs l.jpg
FFS and UFS

  • /usr/src/sys/ufs/ffs/*

    • Higher-level: directory structure

    • Soft updates & Snapshot

  • /usr/src/sys/ufs/ufs/*

    • Lower-level: buffer, i-node

ecs150, Fall 2007


Of i nodes l.jpg
# of i-nodes

  • UFS1: pre-allocation

    • 3% of HD, about < 25% used.

  • UFS2: dynamic allocation

    • Still limited # of i-nods

ecs150, Fall 2007


Di size vs di blocks l.jpg
di_size vs. di_blocks

  • ???

ecs150, Fall 2007


One logical file physical disk blocks42 l.jpg
One Logical File  Physical Disk Blocks

efficient representation & access

ecs150, Fall 2007


Di size vs di blocks43 l.jpg
di_size vs. di_blocks

  • Logical

  • Physical

  • fstat

  • du

ecs150, Fall 2007


Extended attributes in ufs2 l.jpg
Extended Attributes in UFS2

  • Attributes associated with the File

    • di_extb[2];

    • two blocks, but indirection if needed.

  • Format

    • Length 4

    • Name Space 1

    • Content Pad Length 1

    • Name Length 1

    • Name mod 8

    • Content variable

  • Applications: ACL, Data Labelling

ecs150, Fall 2007


Some thoughts l.jpg
Some thoughts….

  • What can you do with “extended attributes”?

  • How to design/implement?

    • Should/can we do it “Stackable File Systems”?

    • Otherwise, the program to manipulate the EA’s will have to be very UFS2-dependent or FiST with an UFS2 optimization option.

  • Are there any counter examples?

    • security and performance considerations.

ecs150, Fall 2007


A file system46 l.jpg
A File System

partition

partition

partition

b

s

i-list

directory and data blocks

d

i-node

i-node

…….

i-node

ecs150, Fall 2007


Slide47 l.jpg

struct dirent {

ino_t d_ino;

char d_name[NAME_MAX+1];

};

struct stat {…

short nlinks;

…};

directory

dirent

inode

file_name

dirent

inode

file_name

dirent

inode

file_name

file

file

file

ecs150, Fall 2007



Slide49 l.jpg

root

wheel

.

2

directory

/

2

drwxr-xr-x

..

2

Apr 1 2004

usr

4

3

vmunix

5

root

wheel

.

4

drwxr-xr-x

directory

/usr

..

2

4

Apr 1 2004

bin

7

root

wheel

foo

6

rwxr-xr-x

5

file

/vmunix

Apr 15 2004

text

data

kirk

staff

6

rw-rw-r--

file

/usr/foo

Hello World!

Jan 19 2004

root

wheel

.

7

7

drwxr-xr-x

..

4

directory

/usr/bin

Apr 1 2004

ex

9

8

groff

10

bin

bin

vi

9

file

/usr/bin/vi

9

rwxr-xr-x

text

data

Apr 15 2004

ecs150, Fall 2007


What is the difference l.jpg
What is the difference?

  • ln –s /usr/src/sys/sys/proc.h ppp.h

  • ln /usr/src/sys/sys/proc.h ppp.h

ecs150, Fall 2007


Hard versus symbolic l.jpg
Hard versus Symbolic

  • ln –s /usr/src/sys/sys/proc.h ppp.h

    • Link to anything, any mounted partitions

    • Delete a Symbolic link?

  • ln /usr/src/sys/sys/proc.h ppp.h

    • Link only to “file” (not directory)

    • Link only within the same partition -- why?

    • Delete a Hard Link?

ecs150, Fall 2007


Slide52 l.jpg

125 struct ufs2_dinode {

126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */

127 int16_tdi_nlink;/* 2: File link count. */

128 u_int32_t di_uid; /* 4: File owner. */

129 u_int32_t di_gid; /* 8: File group. */

130 u_int32_t di_blksize; /* 12: Inode blocksize. */

131 u_int64_t di_size; /* 16: File byte count. */

132 u_int64_t di_blocks; /* 24: Bytes actually held. */

133 ufs_time_t di_atime; /* 32: Last access time. */

134 ufs_time_t di_mtime; /* 40: Last modified time. */

135 ufs_time_t di_ctime; /* 48: Last inode change time. */

136 ufs_time_t di_birthtime; /* 56: Inode creation time. */

137 int32_t di_mtimensec; /* 64: Last modified time. */

138 int32_t di_atimensec; /* 68: Last access time. */

139 int32_t di_ctimensec; /* 72: Last inode change time. */

140 int32_t di_birthnsec; /* 76: Inode creation time. */

141 int32_t di_gen; /* 80: Generation number. */

142 u_int32_t di_kernflags; /* 84: Kernel flags. */

143 u_int32_t di_flags; /* 88: Status flags (chflags). */

144 int32_t di_extsize; /* 92: External attributes block. */

145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */

146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */

147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */

148 int64_t di_spare[3]; /* 232: Reserved; currently unused */

149 };

ecs150, Fall 2007


Slide53 l.jpg

struct dirent {

ino_t d_ino;

char d_name[NAME_MAX+1];

};

struct stat {…

short nlinks;

…};

directory

dirent

inode

file_name

dirent

inode

file_name

dirent

inode

file_name

file

file

file

ecs150, Fall 2007


File system buffer cache l.jpg
File System Buffer Cache

application: read/write files

translate file to disk blocks

OS:

...

...buffer cache

maintains

controls disk accesses: read/write blocks

hardware:

Any problems?

ecs150, Fall 2007


File system consistency l.jpg
File System Consistency

  • To maintain file system consistency the ordering of updates from buffer cache to disk is critical

  • Example:

    • if the directory block is written back before the i-node and the system crashes, the directory structure will be inconsistent

ecs150, Fall 2007


File system consistency56 l.jpg
File System Consistency

  • File system almost always use a buffer/disk cache for performance reasons

  • This problem is critical especially for the blocks that contain control information: i-node, free-list, directory blocks

  • Two copies of a disk block (buffer cache, disk)  consistency problem if the system crashes before all the modified blocks are written back to disk

  • Write back critical blocks from the buffer cache to disk immediately

  • Data blocks are also written back periodically: sync

ecs150, Fall 2007


Two strategies l.jpg
Two Strategies

  • Prevention

    • Use un-buffered I/O when writing i-nodes or pointer blocks

    • Use buffered I/O for other writes and force sync every 30 seconds

  • Detect and Fix

    • Detect the inconsistency

    • Fix them according to the “rules”

    • Fsck (File System Checker)

ecs150, Fall 2007


File system integrity l.jpg
File System Integrity

  • Block consistency:

    • Block-in-use table

    • Free-list table

  • File consistency:

    • how many directories pointing to that i-node?

    • nlink?

    • three cases: D == L, L > D, D > L

      • What to do with the latter two cases?

0

1

1

1

0

0

0

1

0

0

0

2

1

0

0

0

1

1

1

0

1

0

2

0

ecs150, Fall 2007


File system integrity59 l.jpg
File System Integrity

  • File system states

    (a) consistent

    (b) missing block

    (c) duplicate block in free list

    (d) duplicate data block

ecs150, Fall 2007


Metadata operations l.jpg
Metadata Operations

  • Metadata operations modify thestructure of the file system

    • Creating, deleting, or renamingfiles, directories, or special files

    • Directory & I-node

  • Data must be written to disk in such a way that the file system can be recovered to a consistent state after a system crash

ecs150, Fall 2007


Metadata integrity l.jpg
Metadata Integrity

  • FFS uses synchronous writes to guarantee the integrity of metadata

    • Any operation modifying multiple pieces of metadata will write its data to disk in a specific order

    • These writes will beblocking

  • Guarantees integrity and durability of metadata updates

ecs150, Fall 2007


Deleting a file i l.jpg
Deleting a file (I)

i-node-1

abc

def

i-node-2

ghi

i-node-3

Assume we want to delete file “def”

ecs150, Fall 2007


Deleting a file ii l.jpg
Deleting a file (II)

i-node-1

abc

?

def

ghi

i-node-3

Cannot delete i-node before directory entry “def”

ecs150, Fall 2007


Deleting a file iii l.jpg
Deleting a file (III)

  • Correct sequence is

    • Write to disk directory block containing deleted directory entry “def”

    • Write to disk i-node block containing deleted i-node

  • Leaves the file system in a consistent state

  • ecs150, Fall 2007


    Creating a file i l.jpg
    Creating a file (I)

    i-node-1

    abc

    ghi

    i-node-3

    Assume we want to create new file “tuv”

    ecs150, Fall 2007


    Creating a file ii l.jpg
    Creating a file (II)

    i-node-1

    abc

    ghi

    i-node-3

    tuv

    ?

    Cannot write directory entry “tuv” before i-node

    ecs150, Fall 2007


    Creating a file iii l.jpg
    Creating a file (III)

    • Correct sequence is

      • Write to disk i-node block containing new i-node

      • Write to disk directory block containing new directory entry

  • Leaves the file system in a consistent state

  • ecs150, Fall 2007


    Synchronous updates l.jpg
    Synchronous Updates

    • Used by FFS to guarantee consistency of metadata:

      • All metadata updates are done through blocking writes

    • Increases the cost of metadata updates

    • Can significantly impact the performance of whole file system

    ecs150, Fall 2007



    Soft updates l.jpg
    SOFT UPDATES

    • Use delayed writes (write back)

    • Maintain dependency informationabout cached pieces of metadata:

      This i-node must be updated before/after this directory entry

    • Guarantee that metadata blocks are written to disk in the required order

    ecs150, Fall 2007


    3 soft update rules l.jpg
    3 Soft Update Rules

    • Never point to a structure before it has been initialized.

    • Never reuse a resource before nullifying all previous pointers to it.

    • Never reset the old pointer to a live resource before the new pointer has been set.

    ecs150, Fall 2007


    Problem 1 with s u l.jpg
    Problem #1 with S.U.

    • Synchronous writes guaranteed that metadata operations were durable once the system call returned

    • Soft Updates guarantee that file system will recover into a consistent state but not necessarily the most recent one

      • Some updates could be lost

    ecs150, Fall 2007


    Slide73 l.jpg

    What are the dependency relationship?

    We want to delete file “foo” and create new file “bar”

    Block A

    Block B

    foo

    i-node-2

    NEW bar

    NEW i-node-3

    ecs150, Fall 2007


    Slide74 l.jpg

    Circular Dependency

    X-2nd

    Y-1st

    We want to delete file “foo” and create new file “bar”

    Block A

    Block B

    foo

    i-node-2

    NEW bar

    NEW i-node-3

    ecs150, Fall 2007


    Problem 2 with s u l.jpg
    Problem #2 with S.U.

    • Cyclical dependencies:

      • Same directory block contains entries to be created and entries to be deleted

      • These entries point to i-nodes in the same block

    • Brainstorming:

      • How to resolve this issue in S.U.?

    ecs150, Fall 2007


    Fs buffer or disk l.jpg
    FS: buffer or disk??

    • They appear in both and we try to synchronize them..

    ecs150, Fall 2007


    Slide77 l.jpg
    Disk

    Block A-Dir

    Block B-i-Node

    foo

    i-node-2

    ecs150, Fall 2007


    Buffer l.jpg
    Buffer

    Block A-Dir

    Block B-i-Node

    NEW bar

    NEW i-node-3

    ecs150, Fall 2007


    Synchronize l.jpg
    Synchronize??

    Block A

    Block B

    foo

    i-node-2

    NEW bar

    NEW i-node-3

    ecs150, Fall 2007




    Solution in s u l.jpg

    def

    Solution in S.U.

    • Roll back metadata in one of the blocks to an earlier, safe state

      (Safe state does not contain new directory entry)

    Block A’

    ecs150, Fall 2007


    Slide83 l.jpg

    • Write first block with metadata that were rolled back (block A’ of example)

    • Write blocks that can be written after first block has been written (block B of example)

    • Roll forward block that was rolled back

    • Write that block

    • Breaks the cyclical dependency but must nowwrite twice block A

    ecs150, Fall 2007


    Slide84 l.jpg

    Before any Write Operation

    SU Dependency Checking

    (roll back if necessary)

    After any Write Operation

    SU Dependency Processing

    (task list updating)

    (roll forward if necessary)

    ecs150, Fall 2007


    Slide85 l.jpg

    ecs150, Fall 2007


    Journaling l.jpg
    JOURNALING metadata operations and recovery:

    • Journaling systems maintain an auxiliary log that records all meta-data operations

    • Write-ahead loggingensures that the log is written to disk beforeany blocks containing data modified by the corresponding operations.

      • After a crash, can replay the log to bring the file system to a consistent state

    ecs150, Fall 2007


    Journaling87 l.jpg
    JOURNALING metadata operations and recovery:

    • Log writes are performed in addition to the regular writes

    • Journaling systems incur log write overhead but

      • Log writes can be performed efficiently because they are sequential (block operation consideration)

      • Metadata blocks do not need to be written back after each update

    ecs150, Fall 2007


    Journaling88 l.jpg
    JOURNALING metadata operations and recovery:

    • Journaling systems can provide

      • same durability semantics as FFS if log is forced to disk after each meta-data operation

      • the laxer semantics of Soft Updates if log writes are buffered until entire buffers are full

    ecs150, Fall 2007


    Soft updates vs journaling l.jpg
    Soft Updates vs. Journaling metadata operations and recovery:

    • Advantages

    • disadvantages

    ecs150, Fall 2007


    With soft updates l.jpg
    With Soft Updates?? metadata operations and recovery:

    Do we still need “FSCK”? at boot time?

    CPU

    ecs150, Fall 2007


    Recover the missing resources l.jpg
    Recover the Missing Resources metadata operations and recovery:

    • In the background, in an active FS…

      • We don’t want to wait for the lengthy FSCK process to complete…

    • A related issue:

      • the virus scanning process

      • what happens if we get a new virus signature?

    ecs150, Fall 2007


    Snapshot of the fs l.jpg
    Snapshot of the FS metadata operations and recovery:

    • backup and restore

    • dump reliably an active File System

      • what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…)

    • “background FSCK checks”

    ecs150, Fall 2007


    What is a snapshot i mean conceptually l.jpg
    What is a snapshot? metadata operations and recovery:(I mean “conceptually”.)

    • Freeze all activities related to the FS.

    • Copy everything to “some space”.

    • Resume the activities.

    How do we efficiently implement this concept such that the activities will only be blocked for about 0.25 seconds, and we don’t have to buy a really big hard drive?

    ecs150, Fall 2007


    Slide94 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide95 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide96 l.jpg

    Copy-on-Write metadata operations and recovery:

    ecs150, Fall 2007


    Snapshot a file l.jpg
    Snapshot: a file metadata operations and recovery:

    Logical size

    Versus physical size

    ecs150, Fall 2007


    Example l.jpg
    Example metadata operations and recovery:

    # mkdir /backups/usr/noon

    # mount –u –o snapshot /usr/snap.noon /usr

    # mdconfig –a –t vnode –u 0 –f /usr/snap.noon

    # mount –r /dev/md0 /backups/usr/noon

    /* do whatever you want to test it */

    # umount /backups/usr/noon

    # mdconfig –d –u 0

    # rm –f /usr/snap.noon

    ecs150, Fall 2007


    Slide99 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide100 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide101 l.jpg

    #include <stdio.h> metadata operations and recovery:

    #include <stdlib.h>

    int

    main

    (void)

    {

    FILE *f1 = fopen("./sss.txt", "w");

    int i;

    for (i = 0; i < 1000; i++)

    {

    fseek(f1, rand(), SEEK_SET);

    fprintf(f1, "%d%d%d%d", rand(), rand(),

    rand(), rand());

    if (i % 100 == 0) sleep(1);

    }

    fflush(f1);

    }

    ecs150, Fall 2007


    Example102 l.jpg
    Example metadata operations and recovery:

    # mkdir /backups/usr/noon

    # mount –u –o snapshot /usr/snap.noon /usr

    # mdconfig –a –t vnode –u 0 –f /usr/snap.noon

    # mount –r /dev/md0 /backups/usr/noon

    /* do whatever you want to test it */

    # umount /backups/usr/noon

    # mdconfig –d –u 0

    # rm –f /usr/snap.noon

    ecs150, Fall 2007


    Slide103 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide104 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide105 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide106 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide107 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide108 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide109 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Example110 l.jpg
    Example metadata operations and recovery:

    # mkdir /backups/usr/noon

    # mount –u –o snapshot /usr/snap.noon /usr

    # mdconfig –a –t vnode –u 0 –f /usr/snap.noon

    # mount –r /dev/md0 /backups/usr/noon

    /* do whatever you want to test it */

    # umount /backups/usr/noon

    # mdconfig –d –u 0

    # rm –f /usr/snap.noon

    ecs150, Fall 2007


    Slide111 l.jpg

    Copy-on-Write metadata operations and recovery:

    ecs150, Fall 2007


    Slide112 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    A file system113 l.jpg
    A File System metadata operations and recovery:

    A file

    ??? entries in

    one disk block

    ecs150, Fall 2007


    A snapshot i node l.jpg
    A Snapshot i-node metadata operations and recovery:

    A file

    ??? entries in

    one disk block

    Not used or

    Not yet copy

    ecs150, Fall 2007


    Copy on write l.jpg
    Copy-on-write metadata operations and recovery:

    A file

    ??? entries in

    one disk block

    Not used or

    Not yet copy

    ecs150, Fall 2007


    Copy on write116 l.jpg
    Copy-on-write metadata operations and recovery:

    A file

    ??? entries in

    one disk block

    Not used or

    Not yet copy

    ecs150, Fall 2007


    Multiple snapshots l.jpg
    Multiple Snapshots metadata operations and recovery:

    • about 20 snapshots

    • Interactions/sharing among snapshots

    ecs150, Fall 2007


    Snapshot of the fs118 l.jpg
    Snapshot of the FS metadata operations and recovery:

    • backup and restore

    • dump reliably an active File System

      • what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…)

    • “background FSCK checks”

    ecs150, Fall 2007


    Slide119 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Vfs the fs switch l.jpg

    user space metadata operations and recovery:

    syscall layer (file, uio, etc.)

    Virtual File System (VFS)

    network

    protocol

    stack

    (TCP/IP)

    FFS

    LFS

    NFS

    *FS

    etc.

    etc.

    device drivers

    VFS: the FS Switch

    • Sun Microsystems introduced the virtual file system interface in 1985 to accommodate diverse filesystem types cleanly.

      • VFS allows diverse specific file systems to coexist in a file tree, isolating all FS-dependencies in pluggable filesystem modules.

    VFS was an internal kernel restructuring

    with no effect on the syscall interface.

    Incorporates object-oriented concepts:

    a generic procedural interface with

    multiple implementations.

    Based on abstract objects with dynamic

    method binding by type...in C.

    Other abstract interfaces in the kernel: device drivers,

    file objects, executable files, memory objects.

    ecs150, Fall 2007


    Vnode l.jpg

    syscall layer metadata operations and recovery:

    free vnodes

    NFS

    UFS

    vnode

    • In the VFS framework, every file or directory in active use is represented by a vnode object in kernel memory.

    Each vnode has a standard

    file attributes struct.

    Generic vnode points at

    filesystem-specific struct

    (e.g., inode, rnode), seen

    only by the filesystem.

    Each specific file system maintains a cache of its resident vnodes.

    Vnode operations are

    macros that vector to

    filesystem-specific

    procedures.

    ecs150, Fall 2007


    Vnode operations and attributes l.jpg
    vnode Operations and Attributes metadata operations and recovery:

    vnode attributes (vattr)

    type (VREG, VDIR, VLNK, etc.)

    mode (9+ bits of permissions)

    nlink (hard link count)

    owner user ID

    owner group ID

    filesystem ID

    unique file ID

    file size (bytes and blocks)

    access time

    modify time

    generation number

    directories only

    vop_lookup (OUT vpp, name)

    vop_create (OUT vpp, name, vattr)

    vop_remove (vp, name)

    vop_link (vp, name)

    vop_rename (vp, name, tdvp, tvp, name)

    vop_mkdir (OUT vpp, name, vattr)

    vop_rmdir (vp, name)

    vop_symlink (OUT vpp, name, vattr, contents)

    vop_readdir (uio, cookie)

    vop_readlink (uio)

    files only

    vop_getpages (page**, count, offset)

    vop_putpages (page**, count, sync, offset)

    vop_fsync ()

    generic operations

    vop_getattr (vattr)

    vop_setattr (vattr)

    vhold()

    vholdrele()

    ecs150, Fall 2007


    Network file system nfs l.jpg
    Network File System (NFS) metadata operations and recovery:

    server

    client

    syscall layer

    user programs

    VFS

    syscall layer

    NFS

    server

    VFS

    UFS

    NFS

    client

    UFS

    network

    ecs150, Fall 2007


    Vnode cache l.jpg
    vnode Cache metadata operations and recovery:

    VFS free list head

    HASH(fsid, fileid)

    Active vnodes are reference- counted by the structures that hold pointers to them.

    - system open file table

    - process current directory

    - file system mount points

    - etc.

    Each specific file system maintains its own hash of vnodes (BSD).

    - specific FS handles initialization

    - free list is maintained by VFS

    vget(vp): reclaim cached inactive vnode from VFS free list

    vref(vp): increment reference count on an active vnode

    vrele(vp): release reference count on a vnode

    vgone(vp): vnode is no longer valid (file is removed)

    ecs150, Fall 2007


    Slide125 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide126 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    Slide127 l.jpg

    struct vnode { metadata operations and recovery:

    struct mtx v_interlock; /* lock for "i" things */

    u_long v_iflag; /* i vnode flags (see below) */

    int v_usecount; /* i ref count of users */

    long v_numoutput; /* i writes in progress */

    struct thread *v_vxthread; /* i thread owning VXLOCK */

    int v_holdcnt; /* i page & buffer references */

    struct buflists v_cleanblkhd; /* i SORTED clean blocklist */

    struct buf *v_cleanblkroot; /* i clean buf splay tree */

    int v_cleanbufcnt; /* i number of clean buffers */

    struct buflists v_dirtyblkhd; /* i SORTED dirty blocklist */

    struct buf *v_dirtyblkroot; /* i dirty buf splay tree */

    int v_dirtybufcnt;

    ecs150, Fall 2007


    Distributed fs l.jpg
    Distributed FS metadata operations and recovery:

    ftp.cs.ucdavis.edu fs0: /dev/hd0a

    /

    usr

    sys

    dev

    etc

    bin

    /

    local

    adm

    home

    lib

    bin

    Server.yahoo.com fs0: /dev/hd0e

    ecs150, Fall 2007


    Logical disks l.jpg

    fs0: /dev/hd0a metadata operations and recovery:

    logical disks

    /

    usr

    sys

    dev

    etc

    bin

    mount -t ufs /dev/hd0e /usr

    /

    local

    adm

    home

    lib

    bin

    fs1: /dev/hd0e

    mount -t nfs 152.1.23.12:/export/cdrom /mnt/cdrom

    ecs150, Fall 2007


    Correctness l.jpg
    Correctness metadata operations and recovery:

    • One-copy Unix Semantics

      • every modification to every byte of a file has to be immediately and permanently visible to every client.

    ecs150, Fall 2007


    Correctness131 l.jpg
    Correctness metadata operations and recovery:

    • One-copy Unix Semantics

      • every modification to every byte of a file has to be immediately and permanently visible to every client.

      • Conceptually FS sequent access

        • Make sense in a local file system

        • Single processor versus shared memory

    • Is this necessary?

    ecs150, Fall 2007


    Dfs architecture l.jpg
    DFS Architecture metadata operations and recovery:

    • Server

      • storage for the distributed/shared files.

      • provides an access interface for the clients.

    • Client

      • consumer of the files.

      • runs applications in a distributed environment.

    open close

    read write

    opendir stat

    readdir

    applications

    ecs150, Fall 2007


    Nfs sun 1985 l.jpg
    NFS (SUN, 1985) metadata operations and recovery:

    • Based on RPC (Remote Procedure Call) and XDR (Extended Data Representation)

    • Server maintains no state

      • a READ on the server opens, seeks, reads, and closes

      • a WRITE is similar, but the buffer is flushed to disk before closing

    • Server crash: client continues to try until server reboots – no loss

    • Client crashes: client must rebuild its own state – no effect on server

    ecs150, Fall 2007


    Rpc xdr l.jpg
    RPC - XDR metadata operations and recovery:

    • RPC: Standard protocol for calling procedures in another machine

    • Procedure is packaged with authorization and admin info

    • XDR: standard format for data, because manufacturers of computers cannot agree on byte ordering.

    ecs150, Fall 2007


    Rpcgen l.jpg

    data metadata operations and recovery:

    structure

    data

    structure

    rpcgen

    RPC program

    rpcgen

    RPC client.c

    RPC.h

    RPC server.c

    ecs150, Fall 2007


    Nfs operations l.jpg
    NFS Operations metadata operations and recovery:

    • Every operation is independent: server opens file for every operation

    • File identified by handle -- no state information retained by server

    • client maintains mount table, v-node, offset in file table etc.

    What do these imply???

    ecs150, Fall 2007


    Slide137 l.jpg

    Client computer metadata operations and recovery:

    NFS

    Client

    Kernel

    Application

    Application

    program

    program

    UNIX

    system calls

    UNIX kernel

    Operationson

    remote files

    Operations on local files

    NFS

    Client

    Other file system

    NFSprotocol (remote operations)

    Client computer

    Server computer

    Application

    Application

    program

    program

    Virtual file system

    Virtual file system

    UNIX

    UNIX

    NFS

    NFS

    file

    file

    client

    server

    system

    system

    mount –t nfs home.yahoo.com:/pub/linux /mnt/linux

    ecs150, Fall 2007

    *


    Slide138 l.jpg

    ecs150, Fall 2007 metadata operations and recovery:


    State ful vs state less l.jpg
    State-ful vs. State-less metadata operations and recovery:

    • A server is fully aware of its clients

      • does the client have the newest copy?

      • what is the offset of an opened file?

      • “a session” between a client and a server!

    • A server is completely unaware of its clients

      • memory-less: I do not remember you!!

      • Just tell me what you want to get (and where).

      • I am not responsible for your offset values (the client needs to maintain the state).

    ecs150, Fall 2007


    The state l.jpg
    The State metadata operations and recovery:

    open

    read

    stat

    lseek

    applications

    open

    read

    stat

    lseek

    offset

    applications

    ecs150, Fall 2007


    Network file sharing l.jpg
    Network File Sharing metadata operations and recovery:

    • Server side:

      • Rpcbind (portmap)

      • Mountd - respond to mount requests (sometimes called rpc.mountd).

        • Relies on several files

          • /etc/dfs/dfstab,

          • /etc/exports,

          • /etc/netgroup

      • nfsd - serves files - actually a call to kernel level code.

      • lockd – file locking daemon.

      • statd – manages locks for lockd.

      • rquotad – manages quotas for exported file systems.

    ecs150, Fall 2007


    Network file sharing142 l.jpg
    Network File Sharing metadata operations and recovery:

    • Client Side

      • biod - client side caching daemon

      • mount must understand the hostname:directory convention.

      • Filesystem entries in /etc/[v]fstab tell the client what filesystems to mount.

    ecs150, Fall 2007


    Unix file semantics l.jpg
    Unix file semantics metadata operations and recovery:

    • NFS:

      • open a file with read-write mode

      • later, the server’s copy becomes read-only mode

      • now, the application tries to write it!!

    ecs150, Fall 2007


    Problems with nfs l.jpg
    Problems with NFS metadata operations and recovery:

    • Performance not scaleable:

      • maybe it is OK for a local office.

      • will be horrible with large scale systems.

    ecs150, Fall 2007


    Slide145 l.jpg

    • Similar to UNIX file caching for local files: metadata operations and recovery:

      • pages (blocks) from disk are held in a main memory buffer cache until the space is required for newer pages. Read-ahead and delayed-write optimisations.

      • For local files, writes are deferred to next sync event (30 second intervals)

      • Works well in local context, where files are always accessed through the local cache, but in the remote case it doesn't offer necessary synchronization guarantees to clients.

    • NFS v3 servers offers two strategies for updating the disk:

      • write-through - altered pages are written to disk as soon as they are received at the server. When a write() RPC returns, the NFS client knows that the page is on the disk.

      • delayed commit - pages are held only in the cache until a commit() call is received for the relevant file. This is the default mode used by NFS v3 clients. A commit() is issued by the client whenever a file is closed.

    ecs150, Fall 2007

    *


    Slide146 l.jpg

    • Server caching does nothing to reduce RPC traffic between client and server

      • further optimisation is essential to reduce server load in large networks

      • NFS client module caches the results of read, write, getattr, lookup and readdir operations

      • synchronization of file contents (one-copy semantics) is not guaranteed when two or more clients are sharing the same file.

    • Timestamp-based validity check

      • reduces inconsistency, but doesn't eliminate it

      • validity condition for cache entries at the client:

        (T - Tc < t) v (Tmclient = Tmserver)

      • t is configurable (per file) but is typically set to 3 seconds for files and 30 secs. for directories

      • it remains difficult to write distributed applications that share files with NFS

    t freshness guarantee

    Tc time when cache entry was last validated

    Tm time when block was last updated at server

    T current time

    ecs150, Fall 2007

    *


    Slide147 l.jpg
    AFS client and server

    • State-ful clients and servers.

    • Caching the files to clients.

      • File close ==> check-in the changes.

    • How to maintain consistency?

      • Using “Callback” in v2/3 (Valid or Cancelled)

    open

    read

    applications

    invalidate and re-cache

    ecs150, Fall 2007


    Why afs l.jpg
    Why AFS? client and server

    • Shared files are infrequently updated

    • Local cache of a few hundred mega bytes

      • Now 50~100 giga bytes

    • Unix workload:

      • Files are small, Read Operations dominated, sequential access is common, read/written by one user, reference bursts.

      • Are these still true?

    ecs150, Fall 2007


    Fault tolerance in afs l.jpg
    Fault Tolerance in AFS client and server

    • a server crashes

    • a client crashes

      • check for call-back tokens first.

    ecs150, Fall 2007


    Problems with afs l.jpg
    Problems with AFS client and server

    • Availability

    • what happens if call-back itself is lost??

    ecs150, Fall 2007


    Gfs google file system l.jpg
    GFS – Google File System client and server

    • “failures” are norm

    • Multiple-GB files are common

    • Append rather than overwrite

      • Random writes are rare

    • Can we relax the consistency?

    ecs150, Fall 2007


    Slide152 l.jpg

    ecs150, Fall 2007 client and server


    The master l.jpg
    The Master client and server

    • Maintains all file system metadata.

      • names space, access control info, file to chunk mappings, chunk (including replicas) location, etc.

    • Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state

    ecs150, Fall 2007


    The master154 l.jpg
    The Master client and server

    • Helps make sophisticated chunk placement and replication decision, using global knowledge

    • For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers

      • Master is not a bottleneck for reads/writes

    ecs150, Fall 2007


    Chunkservers l.jpg
    Chunkservers client and server

    • Files are broken into chunks. Each chunk has a immutable globally unique 64-bit chunk-handle.

      • handle is assigned by the master at chunk creation

    • Chunk size is 64 MB

    • Each chunk is replicated on 3 (default) servers

    ecs150, Fall 2007


    Clients l.jpg
    Clients client and server

    • Linked to apps using the file system API.

    • Communicates with master and chunkservers for reading and writing

      • Master interactions only for metadata

      • Chunkserver interactions for data

    • Only caches metadata information

      • Data is too large to cache.

    ecs150, Fall 2007


    Chunk locations l.jpg
    Chunk Locations client and server

    • Master does not keep a persistent record of locations of chunks and replicas.

    • Polls chunkservers at startup, and when new chunkservers join/leave for this.

    • Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers)

    ecs150, Fall 2007


    Slide158 l.jpg

    ecs150, Fall 2007 client and server


    Slide159 l.jpg
    CODA client and server

    • Server Replication:

      • if one server goes down, I can get another.

    • Disconnected Operation:

      • if all go down, I will use my own cache.

    ecs150, Fall 2007


    Slide160 l.jpg

    ecs150, Fall 2007 client and server


    Disconnected operation l.jpg
    Disconnected Operation client and server

    • Continue critical work when that repository is inaccessible.

    • Key idea: caching data.

      • Performance

      • Availability

    • Server Replication

    ecs150, Fall 2007


    Slide162 l.jpg

    ecs150, Fall 2007 client and server


    Slide163 l.jpg

    ecs150, Fall 2007 client and server


    Slide164 l.jpg

    ecs150, Fall 2007 client and server


    Slide165 l.jpg

    ecs150, Fall 2007 client and server


    Slide166 l.jpg

    ecs150, Fall 2007 client and server


    Slide167 l.jpg

    ecs150, Fall 2007 client and server


    Consistency l.jpg
    Consistency client and server

    • If John update file X on server A and Mary read file X on server B….

    Read-one & Write-all

    ecs150, Fall 2007


    Read x write n x 1 l.jpg
    Read x & Write (N-x+1) client and server

    read

    write

    ecs150, Fall 2007


    Example r3w4 6 1 l.jpg
    Example: R3W4 (6+1) client and server

    Initial 0 0 0 0 0 0

    Alice-W2 2 0 2 2 0

    Bob-W 2 3 3 3 3 0

    Alice-R 2 3 3 3 3 0

    Chris-W 2 1 1 1 1 0

    Dan-R 2 1 1 1 1 0

    Emily-W7 7 1 1 1 7

    Frank-R 7 7 1 1 1 7

    ecs150, Fall 2007


    ad