The Linux file system modules

The Linux file system modules Nezer J. Zaidenberg

Minhala • In 29.1 recitation I will publish ex.1 and 2 questions. And ex. 2 solution. • Students who have not yet submitted ex. 2 must do so prior to 29.1 • All students that submitted HW must schedule oral exam prior to 29.1 or they will fail the homework! • Students that cannot meet the 29.1 deadline with good reason should inform me. We will work something out.

Minhala 2 • You should submit ex 3. before the test, or request extension before the test. • If you will not request extension (sending me email with your team members ID) We will publish your final grade after the exam. • Please send me the requests to nzaidenberg@mac.com • We will not accept requests after the exam and if you have posted a request you should submit the ex.

Minhala 3 • Shiurhazara before the test – will be held 1 day before the exam at noon. • (Room will be announced) • I will answer all your questions and go over the questions we asked in HW-1,2 and some issues that will be raised on the lectures and filesystem ex.

Back to file system

What should we know • What is a File system • How the VFS calls file system specific functions via “virtual table” (“Inheritance in C”) • How to operate (start/stop) VMWARE • How to write simple (hello world) modules • How to write file system modules that register file system and read the super-block • How to debug using printk and /var/log/messages

What next • Successful mount • Successful ls • Successful open/touch • Successful read/write • Successful mkdir/remove dir • Successful mmap/munmap • List of functions to implement • List of kernel function we can use

A word of caution… • In order not to give all my cards…. • I have cited sources from 3 different sources • My uxfs • Minix • Ext2 • This way you can still think about ex.3 without getting all the code… • But beware not everything is done exactly the same in all file systems • You will also see examples of how the “inheritance” in Linux file system is implemented. (Think about “generic file system” from which uxfs, minix and ext2 inherit)

Working with block devices • References • P. 348 (scanning for uxfs file system) UNIX filesystems – very simplified • Chapter 15.2 Understanding The Linux Kernel 3rdeditiion – much more then we need

Buffer head bread basics • When we access a block from block device we call the bread() function. • The bread() function reads block from block device returning a buffer_head object (this object can later be accessed for data) • Each call to bread() will be followed by a call to brelse() which will release the buffer. • A 2nd call to bread() before brelse() was called will cause the operation to block() • Sb_bread() is a wrapper to bread() • Sb_bread(sb, block(==offset)) == bread(sb->s_dev, block, sb->s_blocksize) • We will use sb_bread() in most code samples (brelse still apply)

Buffer head writing and reading • In order to write a buffer head we mark it as dirty using mark_buffer_dirty(structbuffer_head) • The dirty buffers are periodically written to disk (or written on brelse) • In order to access the data read we read b_data member of structbuffer_head

Examples – ux_put_super + ux_write_super Void ux_put_super(structsuper_block *s) { structux_fs *fs = (structux_fs *)s->s_fs_info; structbuffer_head *bh = fs->u_sbh; printk (KERN_ERR "scipio : ux_put_super %s %d", __FILE__, __LINE__); kfree(fs); brelse(bh); }

Ux_write_super 1/2 void ux_write_super(structsuper_block *sb) { structux_fs *fs = (structux_fs *)sb->s_fs_info; structbuffer_head *bh = fs->u_sbh; printk (KERN_ERR "Scipio write super was called %s %d\n”,__FILE__, __LINE__); lock_kernel();

Ux_write_super 2/2 printk(KERN_ERR "Scipio write super after lock kernel %s %d\n”,__FILE__, __LINE__); if (!(sb->s_flags & MS_RDONLY)) { mark_buffer_dirty(bh); } sb->s_dirt = 0; printk (KERN_ERR "Scipio write super before unlock kernel %s %d\n”,__FILE__, __LINE__); unlock_kernel(); printk (KERN_ERR "Scipio write super after unlock kernel %s %d\n”, _FILE__, __LINE__); }

Completing the mount operation And initial discussion on locking

So what does mount(1) check after mounting • File system mount(1) operation also calls read to the root inodeverifing that indeed mount was successful and a directory was written • Some of you have demonstrated mount that fails with “not a directory” message. • For mount(1) to be completed successfully we need theXX_igetimplementation. • (The kernel knows what is the root inode to read because of the d_alloc_root function)

ux_iget() – my iget (porting the book) structinode* ux_iget(structsuper_block *sb, unsigned long ino) { structbuffer_head *bh; structux_inode *di; intblock; structinode * inode; printk (KERN_ERR "scipio : ux_iget was called %s %d\n", __FILE__, __LINE__); inode = iget_locked(sb,ino);

My ux_iget (2/6) if (!(inode)) { printk (KERN_ERR "scipio : ux_igetiget_locked failed %s %d\n", __FILE__, __LINE__); return ERR_PTR(-ENOMEM); } if (!(inode->i_state & I_NEW)) return inode; if (ino < UX_ROOT_INO || ino > UX_MAXFILES) { printk("uxfs: Bad inode number %lu\n", ino); printk(KERN_ERR "scipio : ux_iget bad inode number%lu, %s %d\n", ino, __FILE__, __LINE__); gotoux_iget_error; }

My ux_iget 3/6 // Note that for simplicity, there is only oneinodeper block! block = UX_INODE_BLOCK + ino; bh = sb_bread(inode->i_sb, block); if (!bh) { printk (KERN_ERR "scipio : ux_iget problem with sb_bread on inode %d %s %d\n",ino, __FILE__, __LINE__); gotoux_iget_error; } di = (structux_inode *)(bh->b_data); inode->i_mode = di->i_mode;

My ux_iget (4/6) if (di->i_mode & S_IFDIR) { inode->i_mode |= S_IFDIR; inode->i_op = &ux_dir_inops; inode->i_fop = &ux_dir_operations; } else if (di->i_mode & S_IFREG) { inode->i_mode |= S_IFREG; inode->i_op = &ux_file_inops; inode->i_fop = &ux_file_operations; inode->i_mapping->a_ops = &ux_aops; }

My ux_iget 5/6 inode->i_uid = di->i_uid; inode->i_gid = di->i_gid; inode->i_nlink = di->i_nlink; inode->i_size = di->i_size; inode->i_blocks = di->i_blocks; inode->i_atime.tv_sec = di->i_atime; inode->i_mtime.tv_sec = di->i_mtime; inode->i_ctime.tv_sec = di->i_ctime; inode->i_atime.tv_nsec = 0; inode->i_mtime.tv_nsec = 0;

My ux_iget6/6 Inode->i_ctime.tv_nsec = 0; memcpy(&inode->i_private, di, sizeof(structux_inode)); brelse(bh); unlock_new_inode(inode); printk (KERN_ERR "scipio : ux_iget before return %s %d\n", __FILE__, __LINE__); return inode; ux_iget_error: printk (KERN_ERR "scipio : ux_iget had error %s %d\n", __FILE__, __LINE__); iget_failed(inode); return ERR_PTR(-EINVAL); }

The new iget_locked() New way Old way Iget() method Each fs had read_inode() Disappeared : 2.6.25 (not so very long ago!) Problems : with style and locking • Each file system has fs_iget() which calls iget_locked(); • Iget_locked() -> search for inode in the inode cache (shared memory) if its there it is returned. If not it is red from disk. • (naturally all shared memory operations are locked) For more information : http://kerneltrap.org/Linus/Removing_iget_and_read_inode

Some more kernel operations • Printk - we know • kmalloc/kfree – same as the none kernel function (kmalloc should get extra parameter value GFP_KERNEL) (more on this… kzalloc = kmalloc + set memory to zero. Kcalloc = like normal calloc) • most strXXX and memXXXfunctions are usable in the kernel same as in user mode (though the implementation is built in kernel not via library function) • Complete kernel API reference : http://www.gelato.unsw.edu.au/~dsw/public-files/kernel-docs/kernel-api/index.html

Just a word of caution • The Linux kernel is evolving beast with API coming in and out with practically no attempt for backward compatibility. • Examples : iget_locked was added at kernel 2.6.25 while kzalloc was added at 2.6.14 (and doesn’t appear in the API reference) • The kernel progress via emails and post in mailing list and everything is documented. When in doubt ask google.

Reading inode from disk – minix stylefs/minix/bitmap.c 115 minix_V1_raw_inode(struct super_block *sb, ino_tino, structbuffer_head **bh) 116 { 117 int block; 118 structminix_sb_info *sbi = minix_sb(sb); 119 structminix_inode *p; 120 121 if (!ino || ino>sbi->s_ninodes) { 122 printk("Badinode number on dev %s: %ld is out of range\n", 123 sb->s_id, (long)ino); 124 return NULL; 125 }

fs/minix/bitmap.c 126 ino--; 127 block = 2 + sbi->s_imap_blocks + sbi->s_zmap_blocks + 128 ino / MINIX_INODES_PER_BLOCK; 129 *bh = sb_bread(sb, block); 130 if (!*bh) { 131 printk("Unable to read inode block\n"); 132 return NULL; 133 } 134 p = (void *)(*bh)->b_data; 135 return p + ino % MINIX_INODES_PER_BLOCK; 136 }

Writing inode • Is done via call to iput. (This will also call your routines) • Iput() marks the inode as used one less time. When usage equal zero the inode is put to disk and is freed. • Iget/iget_locked() increase the usage by 1

Write_inode (from minix)fs/minix/inode.c 560 static intminix_write_inode(structinode * inode, int wait) 561 { 562 brelse(minix_update_inode(inode)); 563 return 0; 564 }

Still minix : fs/minix/inode.c 552 static structbuffer_head *minix_update_inode(structinode *inode) 553 { 554 if (INODE_VERSION(inode) == MINIX_V1) 555 return V1_minix_update_inode(inode); 556 else 557 return V2_minix_update_inode(inode); 558 }

More from minixfs/minix/inode.c 499 static structbuffer_head * V1_minix_update_inode(struct inode * inode) 500 { 501 structbuffer_head * bh; 502 structminix_inode * raw_inode; 503 structminix_inode_info *minix_inode = minix_i(inode); 504 inti; 505

And… fs/minix/inode.c 506 raw_inode = minix_V1_raw_inode(inode->i_sb, inode->i_ino, &bh); 507 if (!raw_inode) 508 return NULL; … 519 mark_buffer_dirty(bh); 520 return bh; 521 }

Creating new files • When we call touch for example… • We need to allocate new inode • We allocate a Linux inode and also a file system inode pointed by the above • Please note : allocate_inode is a new method (It does not appear in UNIX file system book) do not confuse with pate’s ux_ialloc() which finds a free inode.

How ext2 allocate inode 142 static structinode*ext2_alloc_inode(struct super_block *sb) 143 { 144 struct ext2_inode_info *ei; 145 ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, GFP_KERNEL ); 146 if (!ei) 147 return NULL; // scipio : I removed some #ifdef 152 ei->i_block_alloc_info = NULL; 153 ei->vfs_inode.i_version = 1; 154 return &ei->vfs_inode; 155 }

For those who find it weird : ext2.h 16 struct ext2_inode_info { 17 __le32 i_data[15]; 18 __u32 i_flags; • __u32 i_faddr; • … 62 structmutextruncate_mutex; 63 structinodevfsinode; 64 structlist_headi_orphan; /* unlinked but open inodes */ 65 };

Explaining Structinodeis encapsulated in ext2_inode_info so using simple pointer arithmetic one an find the correct pointer… that is done via the static inline struct ext2_inode_info *EXT2_I(struct inode *inode) Function (Though it may be more correct that ext2_inode “is a”ninode and not “has a”ninode kernel developers are more interested in speed and memory locality then OOP. I’ve implemented two mallocs and it also works)

Speed is of most importance to kernel developers (but I would be most willing to explain code lines)

Get block/put block Works roughly the same as with Inode but via different data structure (blocks are read using sb_bread() and put using brelese() after we mark the block as dirty) We may want to do our own locking (especially in SMP systems)

Kernel spinlocks and the BKL • Kernel spinlocks are named “recursive mutexes” • When the lock is obtained nobody else can obtain the lock. (operation would block) • Previous versions of Linux had the “Big Kernel Lock” acronym == BKL. That means that each lock locked the entire kernel (even unrelated parts) • This lock is beginning to phase out… • But for simplicity and improved stability it may be a good idea to have all your functions inside a “lock_kernel() statement. • (The BKL is released with unlock_kernel())

Example in kernel code(from fs/ext2/inode.c) BKL 1384 lock_kernel(); 1385 ext2_update_dynamic_rev(sb); 1386 EXT2_SET_RO_COMPAT_FEATURE(sb, 1387 EXT2_FEATURE_RO_COMPAT_LARGE_FILE); 1388 unlock_kernel(); 1389 ext2_write_super(sb);

Reading directories • Reading directories is identical to reading inodes as far as Inodes are concerned • Reading directories requires directory_operationstruct with different functions then file operations • Reading directories one has to fill a dirent structure (take note that this is why dirent structure has inode number which we never used in user space)

List of useful directory functions • d_alloc_root (p. 349 Unix filesystems) – allocate the root Inode for the kernel to read • filldir (p. 353 Unix filesystems) – copy directory content to user space • d_XXX (see the kernel API) – manipulate the kernel directory cache does exactly what the name applies

NOW WHAT • You should be able to create file system (using Userlandmkfs) • You should be able to create file system that support reading and writing files and directories (You have all the API’s and the kernel example. (feel free to examine other file systems)) • You should be able to DIG into mmap and links alone… but I’ll cover that next lecture

Some more implementation hints • It may be a good idea to turn SELinux off while working on the kernel. • echo 0 > /selinux/enforce • It may be better idea to make SELinux not start or permissive • Edit /etc/selinux/config • SELINUX=disabled // DISABLED OR • SELINUX=permissive // generate warnings • http://www.geocities.com/ravikiran_uvs/articles/rkfs.html is an helpful (beej like) manual on how to write file system kernel modules may be worth your time

It’s never to late to start digging the kernel!

The Linux file system modules

The Linux file system modules

Presentation Transcript

Disk Organisation Linux File Systems Linux File System Hierarchy General Security Information Linux File System Securit

Linux The File System

Linux Virtual File System

Linux File system

Linux Virtual File System

The Linux File System

Managing the Linux file system

Linux File System

Linux File System

The Linux File System

Linux file system

Linux File System (Advanced)

Linux File System

Linux File system

Linux Virtual File System

Linux file system