today n.
Skip this Video
Loading SlideShow in 5 Seconds..
Today PowerPoint Presentation


157 Views Download Presentation
Download Presentation


- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Today • Implementing filesystems on disk • Trade-offs and performance • Look at some of the VFS objects for Linux • No complete listings • Example of filesystem implementation • Ext2/3

  2. Filesystem Implementation A possible file system layout

  3. Files consists of blocks of data 1 2 3 8 12 4 9 • Where to store/allocate blocks? • How to find files/blocks? • What is a good block size? 4 5 6 1 6 7 5 7 8 9 10 11 12 3 10 2 11 Logical address (block) Physical address (block)

  4. Implementing Files (1) (a) Contiguous allocation of disk space for 7 files (b) State of the disk after files D and E have been removed

  5. Contiguous Allocation • Finding files/blocks is easy • Offset + number of blocks • Excellent read performance • Fragmentation • Compactation  • Reuse of holes • Need to know max file size when allocating • Where could this allocation be useful? • What is the standard alternative to static allocation in computer science (think arrays in C)?

  6. Implementing Files (2) Storing a file as a linked list of disk blocks How much data can be stored in 10 blocks?

  7. Linked List Allocation • No holes, no pre-allocation problem • Only address of first block needs to be stored • Finding block n is expensive • Need to read all n-1 blocks prior to block n • Size of data block is not 2x • The pointer is not data • Both disadvantages can be removed using a new data structure, which?

  8. Implementing Files – FAT A: 4 – 7 – 2 – 10 – 12 Idea: store the pointers in a table • Fast random access • Table can be stored in RAM • Full 2x block size This method is called FAT(File Allocation Table) Disadvantage: table size 20 GB, block size 1 KB  20M blocks  80 MB (4-byte entries) or 60MB (3-byte entries) What can we do to reduce the RAM requirement?

  9. FAT  inodes • Do we actually need to have the whole table in memory all the time? • Actually, only open files need to be there… • Split the table into per-file tables, called inodes (index node)

  10. Implementing Files (4) An example i-node

  11. Indirect Addressing

  12. Implementing Directories • Directories are files with inode pointers • Directory systems should translate a name to a file (inode) • dentry keeps this info in VFS • We also need to store the attributes of a file • Directly in the directory • In the inodes

  13. Implementing Directories (1) (a) A simple directory fixed size entries disk addresses and attributes in directory entry (b) Directory in which each entry just refers to an inode

  14. Implementing Directories (2) • Two ways of handling long file names in directory • (a) In-line • (b) In a heap

  15. Locating /usr/ast/mbox

  16. Shared Files File system containing a shared file

  17. Hard/Soft Links • Hard links are actually the same file • Share the same inode • Will be seen as the same file everywhere • Same owner • Same contents • Same permissions • Keeps counter • Symbolic links are dereferenced • A special file • Different owners/permissions • Can cross filesystem boundaries

  18. demo Execute as u1=user1, u2=user2 (make sure that user2 has write permissions) • u1: echo Hi > file-u1 • u2: ln file-u1 file-u2 • u2: ln –s file-u1 file-u2-s • u2: cat file-u2 • u2: cat file-u2-s • u1: echo again >> file-u1 • u1: rm file-u1 • u2: cat file-u2 • u2: cat file-u2-s What is the output of line 4, 5 & 8, 9? Why?

  19. Shared Files (a) Situation prior to linking (b) After the link is created (c) After the original owner removes the file

  20. Mounting / • The directory inode indicates that it is a mount point usr bin tmp windows Windows Temp Documents and Settings

  21. Disk Space Management • Dark line (left hand scale) gives data rate of a disk • Dotted line (right hand scale) gives disk space efficiency • All files 2KB Block size

  22. Keeping track of free blocks (a) Storing the free list on a linked list (b) A bit map

  23. Keeping track of free blocks • Bitmaps are generally smaller • Linked lists can use free blocks … • Only one block of the linked list needed in main memory • The others are read/written on demand • Problems?

  24. Keeping track of free blocks (a) Almost-full block of pointers to free disk blocks in RAM - three blocks of pointers on disk (b) Result of freeing a 3-block file (c) Alternative strategy for handling 3 free blocks - shaded entries are pointers to free disk blocks

  25. Quota Quotas for keeping track of each user’s disk use

  26. Backups • Performing filesystem backups is essential for reliable systems • Two types • Full • Incremental • Typically a mixed algorithm is used • How to keep track of which files to save?

  27. Backups • A filesystem to be dumped • squares are directories, circles are files • shaded items, modified since last dump • each directory & file labeled by i-node number File that has not changed

  28. Backups • Commonly all modified files and directories above them are stored • Can restore on another filesystem • Individual files can be restored from incremental backup • Bitmaps are used to find the modified inodes

  29. Backups 4 phases of the algorithm • Recursively mark each dir and each modified inode (a) • Recursively unmark non-modified dirs (b) • Dump all directories (c) • Dump all modified inodes (d)

  30. The Common File Model from <fs.h> from <dcache.h> Represents an open file in a process Represents a file in the filesystem Represents a directory entry from <fs.h> Represents a filesystem Process struct file struct dentry struct inode file struct super_block

  31. struct task_struct { volatile long state; struct thread_info *thread_info; atomic_t usage; unsigned long flags; unsigned long ptrace; int lock_depth; int prio, static_prio; struct list_head run_list; prio_array_t *array; unsigned long sleep_avg; long interactive_credit; […] /* file system info */ int link_count, total_link_count; struct tty_struct *tty; /* NULL if no tty */ /* ipc stuff */ struct sysv_sem sysvsem; /* CPU-specific state of this task */ struct thread_struct thread; /* filesystem information */ struct fs_struct *fs; /* open file information */ struct files_struct *files; /* namespace */ struct namespace *namespace; /* signal handlers */ struct signal_struct *signal; struct sighand_struct *sighand; […] }; struct files_struct { atomic_t count; spinlock_t file_lock; int max_fds; int max_fdset; int next_fd; struct file ** fd; /* current fd array */ fd_set *close_on_exec; fd_set *open_fds; fd_set close_on_exec_init; fd_set open_fds_init; struct file * fd_array[NR_OPEN_DEFAULT]; }; task_struct (sched.h) Remember: • Each process is represented using a task_struct • Keeps “a list” of open files • files_struct

  32. struct file { struct list_head f_list; struct dentry *f_dentry; struct vfsmount *f_vfsmnt; struct file_operations *f_op; atomic_t f_count; unsigned int f_flags; mode_t f_mode; loff_t f_pos; struct fown_struct f_owner; unsigned int f_uid, f_gid; int f_error; struct file_ra_state f_ra; unsigned long f_version; void *f_security; [..] }; The fileobject: Created by the OS when a file is opened Does not exist on disk! no “dirty” bit is needed Several processes can use the same file object Contains a list of pointers to operations on this file File (fs.h) Directory entry for the file! Set by the OS when file loaded from inode File reference count Current file pointer (offset)

  33. Operations of Files struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void __user *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); };

  34. dentry (directory entry) • Dentry does not represent directories! • inodes represent directories • Used in directory related operations • Pathname lookup • Created on the fly

  35. dentry /users/aja/crap/exam.tex 1 dentry and 1 inode for each component

  36. dentry cache • Dentry objects are created on the fly • Time consuming! • Inefficient • dentry objects are often reused soon after creation • Store dentry objects in a SW cache • The dentry cache (remember dcache.h)

  37. Software Caches • The frequently used (created/destroyed) objects are stored/allocated in SW caches • Basically three caches exists in Linux • User mode memory (VM) • Slab allocator (common stuctures/objects) • Page cache (inodes, disk blocks) • Disk caches (the Page Cache) are used to cache disk accesses (not VM pages!!) • Crucial to system performance! • Must also be part of the page replacement algorithm • Bovet, Ch. 17

  38. dentry Cache • Unused dentry objects stored in a list • Allows easy LRU replacement • A hash table (name  dentry) • Allows fast lookup • Dentry states: • In use– used, and contains valid info • Unused– not used, but points to valid inode • Negative – the inode does not exist, kept to speed up lookups • Free– contains no valid info (stored in the slab cache) Can safely be deleted by the page replacement algorithm

  39. dentry (dcache.h) dentry: • Associates the components of a pathname to their inodes • Does not exist on disk struct dentry { atomic_t d_count; unsigned long d_vfs_flags; /* moved here to be on same cacheline */ spinlock_t d_lock; /* per dentry lock */ struct inode * d_inode; /* Where the name belongs to - NULL is negative */ struct list_head d_lru; /* LRU list */ struct list_head d_child; /* child of parent list */ struct list_head d_subdirs; /* our children */ struct list_head d_alias; /* inode alias list */ unsigned long d_time; /* used by d_revalidate */ struct dentry_operations *d_op; struct super_block * d_sb; /* The root of the dentry tree */ unsigned int d_flags; int d_mounted; void * d_fsdata; /* fs-specific data */ struct rcu_head d_rcu; struct dcookie_struct * d_cookie; /* cookie, if any */ unsigned long d_move_count;/* to indicated moved dentry while lockless lookup */ struct qstr * d_qstr; /* quick str ptr used in lockless lookup and concurrent d_move */ struct dentry * d_parent; /* parent directory */ struct qstr d_name; struct hlist_node d_hash; /* lookup hash list */ struct hlist_head * d_bucket; /* lookup hash bucket */ unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */ } ____cacheline_aligned;

  40. struct inode { struct hlist_node i_hash; struct list_head i_list; struct list_head i_sb_list; struct list_head i_dentry; unsigned long i_ino; atomic_t i_count; umode_t i_mode; unsigned int i_nlink; uid_t i_uid; gid_t i_gid; dev_t i_rdev; loff_t i_size; struct timespec i_atime; struct timespec i_mtime; struct timespec i_ctime; unsigned int i_blkbits; unsigned long i_blksize; unsigned long i_version; unsigned long i_blocks; unsigned short i_bytes; spinlock_t i_lock; struct semaphore i_sem; struct inode_operations *i_op; struct file_operations *i_fop; struct super_block *i_sb; struct file_lock *i_flock; struct address_space *i_mapping; struct address_space i_data; struct dquot *i_dquot[MAXQUOTAS]; /* These three should probably be a union */ struct list_head i_devices; struct pipe_inode_info *i_pipe; struct block_device *i_bdev; struct cdev *i_cdev; int i_cindex; unsigned long i_dnotify_mask; struct dnotify_struct *i_dnotify; unsigned long i_state; unsigned int i_flags; unsigned char i_sock; atomic_t i_writecount; void *i_security; u32 i_generation; union { void *generic_ip; } u; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; #endif }; inode (fs.h) Structure with pointers to the page cache List of operations supported on this file(system) There is also an inode cache (inode.c)

  41. inode_operations (fs.h) struct inode_operations { int (*create) (struct inode *,struct dentry *,int, struct nameidata *); struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); int (*mkdir) (struct inode *,struct dentry *,int); int (*rmdir) (struct inode *,struct dentry *); int (*mknod) (struct inode *,struct dentry *,int,dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); int (*readlink) (struct dentry *, char __user *,int); int (*follow_link) (struct dentry *, struct nameidata *); void (*truncate) (struct inode *); int (*permission) (struct inode *, int, struct nameidata *); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); };

  42. struct address_space • Stores pages in the page cache as a radix tree • Remember digital search trees (tries)? • Allows fast lookup and sorting • Retrieve all dirty blocks • Read more on: • • Bovet, Ch. 15

  43. super_block (fs.h) struct super_block { struct list_head s_list; /* Keep this first */ dev_t s_dev; /* search index; _not_ kdev_t */ unsigned long s_blocksize; unsigned long s_old_blocksize; unsigned char s_blocksize_bits; unsigned char s_dirt; unsigned long long s_maxbytes; /* Max file size */ struct file_system_type * s_type; struct super_operations * s_op; struct dquot_operations * dq_op; struct quotactl_ops * s_qcop; struct export_operations * s_export_op; unsigned long s_flags; unsigned long s_magic; struct dentry * s_root; struct rw_semaphore s_umount; Used to store filesystem specific information This reflects VFS’s view of the fs!

  44. struct semaphore s_lock; int s_count; int s_syncing; int s_need_sync_fs; atomic_t s_active; void * s_security; struct list_head s_inodes; /* all inodes */ struct list_head s_dirty; /* dirty inodes */ struct list_head s_io; /* parked for writeback */ struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */ struct list_head s_files; struct block_device * s_bdev; struct list_head s_instances; struct quota_info s_dquot; /* Diskquota specific options */ char s_id[32]; /* Informational name */ struct kobject kobj; /* anchor for sysfs */ void * s_fs_info; /* Filesystem private info */ /* * The next field is for VFS *only*. No filesystems have any business * even looking at it. You had been warned. */ struct semaphore s_vfs_rename_sem; /* Kludge */ };

  45. Examples of Filesystems Let’s look at two examples • Ext2 – popular and robust • Ext3 – extended with journaling Bovet, Ch. 18

  46. Ext2 • Basic features • Native to Linux • Variable block size • “Related” blocks stored in Block Groups • Pre-allocates blocks to allow file growth • Supports fast symlinks

  47. Ext2 One bit for each block in the group s/(8b), s=partition size, b = block size (bytes) • Contains: • pointer to block bitmap • pointer to inode bitmap • pointer to inode table • free blocks count • free inodes count • directory count • pads Copy in every block group Ext2 partition: Boot Block Block Group 0 … Block Group n 1 n 1 1 n n Super Block Block group descriptors Data Block Bitmap inode Bitmap inode Table Data Blocks

  48. Disk vs. Memory structs • There needs to be a mapping • VFS ↔ disk structures • inode ↔ ext2_inode • Superblock ↔ ext2_super_block • Most structures are stored in page cache • Some operations are generic VFS and some ext2-specific

  49. Ext2 Disk data structure for an inode (fixed 128 bytes size) struct ext2_inode { __u16 i_mode; /* File mode */ __u16 i_uid; /* Low 16 bits of Owner Uid */ __u32 i_size; /* Size in bytes */ __u32 i_atime; /* Access time */ __u32 i_ctime; /* Creation time */ __u32 i_mtime; /* Modification time */ __u32 i_dtime; /* Deletion Time */ __u16 i_gid; /* Low 16 bits of Group Id */ __u16 i_links_count; /* Links count */ __u32 i_blocks; /* Blocks count */ __u32 i_flags; /* File flags */ union osd1; /* OS dependent 1 */ __u32 i_block[EXT2_N_BLOCKS]; /* Pointers to blocks */ __u32 i_generation; /* File version (for NFS) */ __u32 i_file_acl; /* File ACL */ __u32 i_dir_acl; /* Directory ACL */ __u32 i_faddr; /* Fragment address */ union osd2; /* OS dependent 2 */ }; Effective length of file #blocks allocated to file Pointer to the blocks Pointer to extended attributes

  50. File Size • i_size and i_blocks do not always match • internal fragmentation in blocks • i_size < i_blocks*512 • File “holes” • i_size > i_blocks*512 echo -n "S" | dd of=hole bs=1024 seek=6 • Creates a file (hole) with zeroes and an ‘S’ • Only one block is allocated