Chapter 5: The Kernel API -  System Calls

Chapter 5: The Kernel API - System Calls PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

2. Objectives. Introduce notions of mode, space and context.Distinguish interrupts, exceptions and traps and identify how they are used for kernel entry/exit.Introduce mechanisms for tracing system calls.Discuss implications of blocking system calls.Briefly consider Intel low-level hardware even

Download Presentation

Chapter 5: The Kernel API - System Calls

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

1. Chapter 5: The Kernel API - System Calls

2. 2 Objectives Introduce notions of mode, space and context. Distinguish interrupts, exceptions and traps and identify how they are used for kernel entry/exit. Introduce mechanisms for tracing system calls. Discuss implications of blocking system calls. Briefly consider Intel low-level hardware events. Carefully examine system_call, the Linux system call entry code. Examine implementations of several system calls. Describe how to implement a new system call.

3. 3 Mode, Space, Context Mode: hardware restricted execution state restricted access, privileged instructions user mode vs. kernel mode “dual-mode architecture”, “protected mode” Intel supports 4 protection “rings”: 0 kernel, 1 unused, 2 unused, 3 user Space: kernel (system) vs. user (process) address space requires MMU support (virtual memory) “userland”: any process address space; there are many user address spaces reality: kernel is often mapped into user process space Context: kernel activity on “behalf” of ??? process: on behalf of current process system: unrelated to current process (maybe no process!) example “interrupt context” blocking not allowed!

4. 4 User Mode, Process Context

5. 5 Kernel Mode, Process Context

6. 6 Kernel Mode, System Context Stephen Tweedie claims “All kernel code executes in a process context (except during startup)”. He also says that it is possible for an interrupt to occur during a context switch. So it is uncommon for there to be no user process mapped. The real problem is not knowing what process is mapped. Kernel mode, system context activities occurs asynchronously and may be entirely unrelated to the current process. Stephen Tweedie claims “All kernel code executes in a process context (except during startup)”. He also says that it is possible for an interrupt to occur during a context switch. So it is uncommon for there to be no user process mapped. The real problem is not knowing what process is mapped. Kernel mode, system context activities occurs asynchronously and may be entirely unrelated to the current process.

7. 7 User Mode, System Context?

8. 8 Interrupts and Exceptions Interrupts - async device to cpu communication example: service request, completion notification aside: IPI – interprocessor interrupt (another cpu!) system may be interrupted in either kernel or user mode interrupts are logically unrelated to current processing Exceptions - sync hardware error notification example: divide-by-zero (AU), illegal address (MMU) exceptions are caused by current processing Software interrupts (traps) synchronous “simulated” interrupt allows controlled “entry” into the kernel from userland

9. 9 Kernel Entry and Exit

10. 10 Cost of Crossing the “Kernel Barrier” more than a procedure call less than a context switch costs: vectoring mechanism establishing kernel stack validating parameters kernel mapped to user address space? updating page map permissions kernel in a separate address space? reloading page maps invalidating cache, TLB

11. 11 System Calls vs. Library Calls man 2 historical evolution of # of calls Unix 6e (~50), Solaris 7 (~250) Linux 2.0 (~160), Linux 2.2 ( ~190), Linux 2.4 (~220) library calls vs. system call possibilities: library call never invokes system call library call sometimes invokes system call library call always invokes system call system call not available via library can invoke system call “directly” via assembly code man 2: undocumented, unimplemented, obsolete “externals” vs. “internals”

12. 12 Blocking System Calls system calls may block “in the kernel” “slow” system calls may block indefinitely reads, writes of pipes, terminals, net devices some ipc calls, pause, some opens and ioctls disk io is NOT slow (it will eventually complete) blocking slow calls may be “interrupted” by a signal returns EINTR problem: slow calls must be wrapped in a loop BSD introduced “automatic restart” of slow interrupted calls POSIX didn’t specify semantics Linux no automatic restart by default specify restart when setting signal handler (SA_RESTART)

13. 13 Tracing Process Signals and System Calls ptrace() – allow parent process to observe/control child child stops before signal delivery or system call execution parent waits for child parent can view/modify child state possible to “attach” and “reparent” existing processes architecture dependent strace – useful diagnostic application to trace processes strace whatever Solaris uses more sophisticated /proc mechanism

14. 14 Sample strace –r Output > strace –r sync 0.000000 execve("/bin/sync", ["sync"], [/* 21 vars */]) = 0 0.001002 brk(0) = 0x804a178 0.000192 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40014000 0.000164 open("/etc/", O_RDONLY) = -1 ENOENT (No such file or directory) 0.000133 open("/etc/", O_RDONLY) = 4 0.000069 fstat(4, {st_mode=S_IFREG|0644, st_size=20404, ...}) = 0 0.000120 old_mmap(NULL, 20404, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40015000 0.000075 close(4) = 0 0.000064 open("/lib/", O_RDONLY) = 4 0.000076 fstat(4, {st_mode=S_IFREG|0755, st_size=4101324, ...}) = 0 0.000096 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\210\212"..., 4096) = 4096 0.000192 old_mmap(NULL, 1001564, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x4001a000 0.000083 mprotect(0x40107000, 30812, PROT_NONE) = 0 0.000058 old_mmap(0x40107000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0xec000) = 0x40107000 0.000137 old_mmap(0x4010b000, 14428, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4010b000 0.000080 close(4) = 0 0.000102 mprotect(0x4001a000, 970752, PROT_READ|PROT_WRITE) = 0 0.001043 mprotect(0x4001a000, 970752, PROT_READ|PROT_EXEC) = 0 0.000248 munmap(0x40015000, 20404) = 0 0.000077 personality(PER_LINUX) = 0 0.000127 getpid() = 2225 0.000193 brk(0) = 0x804a178 0.000054 brk(0x804a1b0) = 0x804a1b0 0.000097 brk(0x804b000) = 0x804b000 0.000130 sync() = 0 0.015855 _exit(0) = ?

15. 15 Sample strace –c Output > strace –c sync execve("/bin/sync", ["sync"], [/* 21 vars */]) = 0 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.47 0.008277 8277 1 sync 0.85 0.000072 24 3 1 open 0.45 0.000038 38 1 read 0.40 0.000034 7 5 old_mmap 0.37 0.000031 10 3 mprotect 0.19 0.000016 16 1 munmap 0.11 0.000009 2 4 brk 0.08 0.000007 4 2 fstat 0.05 0.000004 2 2 close 0.02 0.000002 2 1 getpid 0.02 0.000002 2 1 personality ------ ----------- ----------- --------- --------- ---------------- 100.00 0.008492 24 1 total

16. 16 Low-level Intel “Event” Mechanisms Intel provides very complex hardware protection task: execution environment provides hardware support for context switching (not used by Linux) 4 protection levels lot’s of segments and descriptors segments and descriptors all have privilege levels hardware support for “stack swapping” on privilege change avoids holes where privileged code crashes because no stack space task gate – context switch to privileged code call gate – execute privileged code with stack swapping interrupt gate – call gate with interrupts disabled trap gate – interrupt gate with interrupts still enabled

17. 17 System Call Dispatch Table Broad system call categories: files, i/o, devices memory, processes ipc, time, misc System call listing: include/unistd.h (include/asm-i386/unistd.h)

18. 18 System Calls (2.2)

19. 19 System Calls (2.2) semop shmat sync send shmctl syscalls sendfile shmdt sysctl sendmsg shmget sysfs sendto shmop sysinfo setcontext shutdown syslog setdomainname sigaction time setegid sigaltstack times seteuid sigblock truncate setfsgid siggetmask umask setfsuid sigmask umount setgid signal uname setgroups sigpause undocumented sethostid sigpending unimplemented sethostname sigprocmask unlink setitimer sigreturn uselib setpgid sigsetmask ustat setpgrp sigsuspend utime setpriority sigvec utimes setregid socket vfork setresgid socketcall vhangup setresuid socketpair vm86 setreuid ssetmask wait setrlimit stat wait3 setsid statfs wait4 setsockopt stime waitpid settimeofday stty write setuid swapoff writev setup swapon sgetmask symlink

20. 20 system_call arch/i386/kernel/entry.S:ENTRY(system_call) SAVE_ALL get current task struct syscall # not OK? ?badsys traced? ? tracesys dispatch specific syscall ?*(sys_call_table[call_number]) save return value bottom half active? ?handle_bottom_half need to reschedule? ?reschedule signal pending? ?signal_return (do_signal) RESTORE_ALL return_from_exception return_from_intr

21. 21 lcall7

22. 22 Example System Calls sys_foo, do_foo idiom all system calls proper begin with sys_ often delegate to do_ function for the real work asmlinkage gcc magic to keep parameters on the stack avoids register optimizations sys_ni_syscall just return ENOSYS! guards position 0 in table (catch uninitialized bugs) fills “holes” for obsolete syscalls or library implemented calls

23. 23 Example System Calls: sys_time kernel/time.c:sys_time just return the number of seconds since Jan 1, 1970 available as volatile CURRENT_TIME (xtime.tv_sec) snapshot current time check user-supplied pointer for validity copy time to user space (asm/uaccess.h:put_user) return time snapshot or error

24. 24 Example System Calls: sys_reboot kernel/sys.c:sys_reboot require SYS_BOOT capability check “magic numbers” (0xfee1dead, Torvalds family birthdays) acquire the “big kernel lock” switch options shutdown in various ways: restart, halt, poweroff “user-specified” shutdown command for some architectures toggle control-alt-delete processing go through reboot_notifier callbacks as appropriate unlock and return error if failure

25. 25 Example System Calls: sys_sysinfo kernel/info.c:sys_sysinfo allocate a local struct to return info to user space disable (clear) interrupts to keep info consistent calculate uptime calculate 1, 5, 15 second “load averages” average length of run queue over interval use confusing int math to avoid floating-point inefficiency enable (set) interrupts return number of processes and some mem stats copy local struct values to user space (copy_to_user)

26. 26 Adding a System Call link statically or implement as a kernel module allocate a number from sys_call_table export sys_whatever validate all parameters! return appropriate error codes use uaccess.h macros as necessary create a “library wrapper” with _syscallN macros linux/unistd.h _syscallN( return_type, entry, type1, arg1, type2, arg2, …)

27. 27 Summary System calls represent the primary kernel API. A system call is one way to enter “protected mode”. Crossing the “kernel barrier” is expensive. System calls are usually wrapped in library routines. Blocking “slow” system calls may be interrupted by a signal. It is possible to “trace” system calls with ptrace(). Intel Linux implements system calls using interrupts and a low-level feature referred to as a “call gate”. System calls often requiring copy data to and from user and kernel space.

  • Login