1 / 98

Linux System Calls: Introduction and Implementation in Linux Kernel 3.7

Learn about system calls in the Linux operating system and how to add a system call in Linux Kernel 3.7. Explore the concept of APIs vs system calls and understand the execution flow and operations performed by system calls.

wcole
Download Presentation

Linux System Calls: Introduction and Implementation in Linux Kernel 3.7

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linux Operating System 許 富 皓

  2. System Calls [1][2][3]

  3. 在Linux系統中新增system call 如何用在linux kernel 3.7中加入system call

  4. System Call • Operating systems offer processes running in User Mode a set of interfaces to interact with hardware devices such as • the CPU • disks and • printers. • Unix systems implement most interfaces between User Mode processes and hardware devices by means of system calls issued to the kernel.

  5. POSIX APIsvs. System Calls • An application programmer interface is a function definition that specifies how to obtain a given service. • A system call is an explicit request to the kernel made via a software interrupt.

  6. From a Wrapper Routine to a System Call • Unix systems include several libraries of functions that provide APIs to programmers. • Some of the APIs defined by the libc standard C library refer to wrapper routines (routines whose only purpose is to issue a system call). • Usually, each system call has a corresponding wrapper routine, which defines the API that application programs should employ.

  7. APIs and System Calls • An API does not necessarily correspond to a specific system call. • First of all, the API could offer its services directly in User Mode. (For something abstract such as math functions, there may be no reason to make system calls.) • Second, a single API function could make several system calls. • Moreover, several API functions could make the same system call, but wrap extra functionality around it.

  8. Example of Different APIs Issuing the Same System Call • In Linux, the malloc( ) , calloc( ) , and free( )APIs are implemented in the libc library. • The code in this library keeps track of the allocation and deallocation requests and uses the brk( ) system call to enlarge or shrink the process heap. • P.S.: See the section "Managing the Heap" in Chapter 9.

  9. The Return Value of a Wrapper Routine • Most wrapper routines return an integer value, whose meaning depends on the corresponding system call. • A return value of -1 usually indicates that the kernel was unable to satisfy the process request. • A failure in the system call handler may be caused by • invalid parameters • a lack of available resources • hardware problems, and so on. • The specific error code is contained in the errno variable, which is defined in the libc library.

  10. Execution Flow of a System Call • When a User Mode process invokes a system call, the CPU switches to Kernel Mode and starts the execution of a kernel function. • As we will see in the next section, in the 80x86 architecture a Linux system call can be invoked in two different ways. • The net result of both methods, however, is a jump to an assembly language function called the system call handler.

  11. System Call Number • Because the kernel implements many different system calls, the User Mode process must pass a parameter called the system call number to identify the required system call. • The eax register is used by Linux for this purpose.

  12. The Return Value of a System Call • All system calls return an integer value. • The conventions for these return values are different from those for wrapper routines. • In the kernel • positive or 0 values denote a successful termination of the system call • negative values denote an error condition • In the latter case, the value is the negation of the error code that must be returned to the application program in the errno variable. • The errno variable is not set or used by the kernel. Instead, the wrapper routines handle the task of setting this variable after a return from a system call.

  13. Operations Performed by a System Call • The system call handler, which has a structure similar to that of the other exception handlers, performs the following operations: • Saves the contents of most registers in the Kernel Mode stack. • This operation is common to all system calls and is coded in assembly language. • Handles the system call by invoking a corresponding C function called the system call service routine. • Exits from the handler: • the registers are loaded with the values saved in the Kernel Mode stack • the CPU is switched back from Kernel Mode to User Mode. • This operation is common to all system calls and is coded in assembly language.

  14. Naming Rules of System Call Service Routines • The name of the service routine associated with the xyz( ) system call is usually sys_xyz( ); there are, however, a few exceptions to this rule.

  15. Control Flow Diagram of a System Call • The arrows denote the execution flow between the functions. • The terms "SYSCALL" and "SYSEXIT" are placeholders for the actual assembly language instructions that switch the CPU, respectively, from User Mode to Kernel Mode and from Kernel Mode to User Mode.

  16. System Call Dispatch Table • To associate each system call number with its corresponding service routine, the kernel uses a system call dispatch table, which is stored in the sys_call_table array and has NR_syscalls[1][2] entries. • The nth entry contains the service routine address of the system call having number n.

  17. NR_syscalls • The NR_syscalls macro is just a static limit on the maximum number of implementable system calls; it does not indicate the number of system calls actually implemented. • Indeed, each entry of the dispatch table may contain the address of the sys_ni_syscall( ) function, which is the service routine of the "nonimplemented" system calls; it just returns the error code -ENOSYS.

  18. sys_call_table Array[p569354158] const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { /* * Smells like a compiler bug -- it doesn't work * when the & below is removed. */ [0 ... __NR_syscall_max] = &sys_ni_syscall, #include <asm/syscalls_32.h> }; “[0 ... __NR_syscall_max] = &sys_ni_syscall“ uses the address of function sys_ni_syscall to initialize each element of array sys_call_table. Then use file asm/syscalls_32.h to reinitialize some entries of the array.

  19. File asm/syscalls_32.h[p569354158] File asm/syscalls_32.h is created during compilation based on header files, such as /include/uapi/asm-generic/unistd.h.

  20. Partial Content of File asm/syscalls_32.h[p569354158] __SYSCALL_I386(0, sys_restart_syscall, sys_restart_syscall)   __SYSCALL_I386(1, sys_exit, sys_exit)   __SYSCALL_I386(2, sys_fork, stub32_fork)   __SYSCALL_I386(3, sys_read, sys_read)   __SYSCALL_I386(4, sys_write, sys_write)   __SYSCALL_I386(5, sys_open, compat_sys_open)   __SYSCALL_I386(6, sys_close, sys_close)   __SYSCALL_I386(7, sys_waitpid, sys32_waitpid)   __SYSCALL_I386(8, sys_creat, sys_creat)   __SYSCALL_I386(9, sys_link, sys_link)   __SYSCALL_I386(10, sys_unlink, sys_unlink)   : __SYSCALL_I386(28, sys_fstat, sys_fstat) __SYSCALL_I386(29, sys_pause, sys_pause) __SYSCALL_I386(30, sys_utime, compat_sys_utime) __SYSCALL_I386(33, sys_access, sys_access) __SYSCALL_I386(34, sys_nice, sys_nice) __SYSCALL_I386(36, sys_sync, sys_sync) :

  21. Macro __SYSCALL_I386 #define __SYSCALL_I386(nr, sym, compat) [nr] = sym,

  22. Ways to Invoke a System Call • Applications can invoke a system call in two different ways: • By executing the int $0x80 assembly language instruction; in older versions of the Linux kernel, this was the only way to switch from User Mode to Kernel Mode. • By executing the sysenter assembly language instruction, introduced in the Intel Pentium II microprocessors; this instruction is now supported by the Linux 2.6 kernel.

  23. Ways to Exit a System Call • The kernel can exit from a system call thus switching the CPU back to User Mode in two ways: • By executing the iret assembly language instruction. • By executing the sysexit assembly language instruction, which was introduced in the Intel Pentium II microprocessors together with the sysenter instruction.

  24. Interrupt Descriptor Table • A system table called Interrupt Descriptor Table (IDT) associates each interrupt or exception vector with the address of the corresponding interrupt or exception handler. • The IDT must be properly initialized before the kernel enables interrupts. • The IDT format is similar to that of the GDT and the LDTs. • Each entry corresponds to an interrupt or an exception vector and consists of an 8-byte descriptor. Thus, a maximum of 256 x 8 = 2048 bytes are required to store the IDT.

  25. idtrCPU register • The idtrCPU register allows the IDT to be located anywhere in memory: it specifies both the IDT base physical address and its limit (maximum length). • It must be initialized before enabling interrupts by using the lidt assembly language instruction.

  26. Types of IDT Descriptors • The IDT may include three types of descriptor • Task gate • Interrupt gate • Trap gate • Used by system calls

  27. Layout of a Trap Gate

  28. Vector 128 of the Interrupt Descriptor Table • The vector 128, in hexadecimal 0x80, is associated with the kernel entry point. • The trap_init( )function, invoked during kernel initialization, sets up the Interrupt Descriptor Table entry corresponding to vector 128 as the next slide.

  29. Set the IDT Entry for System Calls #ifdef CONFIG_X86_32 #define SYSCALL_VECTOR 0x80 #endif #ifdef CONFIG_X86_32 set_system_trap_gate(SYSCALL_VECTOR, &system_call); set_bit(SYSCALL_VECTOR, used_vectors); #endif

  30. set_system_trap_gate(0x80, &system_call) • The call loads the following values into the gate descriptor fields: • Segment Selector • The __KERNEL_CS Segment Selector of the kernel code segment. • Offset • The pointer to the system_call( ) system call handler. • Type • Set to 15 (0x0f). Indicates that the exception is a Trap and that the corresponding handler does not disable maskable interrupts. • DPL (Descriptor Privilege Level) • Set to 3. This allows processes in User Mode to invoke the exception handler • Therefore, when a User Mode process issues an int $0x80 instruction, the CPU switches into Kernel Mode and starts executing instructions from the system_call address.

  31. Save Registers • The system_call( ) function starts by saving the system call number and all the CPU registers that may be used by the exception handler on the stack except for eflags, cs, eip, ss, and esp, which have already been saved automatically by the control unit.

  32. Code to Save Registers # system call handler stub ENTRY(system_call) RING0_INT_FRAME # can't unwind into user space anyway ASM_CLAC pushl_cfi %eax # save orig_eax SAVE_ALL GET_THREAD_INFO(%ebp) • The function then stores the address of the thread_info data structure of the current process in ebp • This is done by taking the value of the kernel stack pointer and rounding it up to a multiple of 8 KB.

  33. Call Frame Information Directives CFI directives are GNU assembler AS directives. “The CFI directives are used for debugging. It allows the debugger to unwind a stack.”[Stott] “On some architectures, exception handling must be managed with Call Frame Information directives. ”[Ninefingers]

  34. RING0_INT_FRAME (1) .macro RING0_INT_FRAME CFI_STARTPROC simple CFI_SIGNAL_FRAME CFI_DEF_CFA esp, 3*4 /*CFI_OFFSET cs, -2*4;*/ CFI_OFFSET eip, -3*4 .endm

  35. RING0_INT_FRAME (2) • Empty; hence, all CFI_xxx Macros can be ignored. .macro cfi_ignore a=0, b=0, c=0, d=0 .endm #define CFI_STARTPROC cfi_ignore #define CFI_ENDPROC cfi_ignore #define CFI_DEF_CFA cfi_ignore #define CFI_DEF_CFA_REGISTER cfi_ignore #define CFI_DEF_CFA_OFFSET cfi_ignore #define CFI_ADJUST_CFA_OFFSET cfi_ignore #define CFI_OFFSET cfi_ignore #define CFI_REL_OFFSET cfi_ignore #define CFI_REGISTER cfi_ignore #define CFI_RESTORE cfi_ignore #define CFI_REMEMBER_STATE cfi_ignore #define CFI_RESTORE_STATE cfi_ignore #define CFI_UNDEFINED cfi_ignore #define CFI_ESCAPE cfi_ignore #define CFI_SIGNAL_FRAME cfi_ignore

  36. Graphic Explanation of the Register-Saving Processing ss esp eflags cs eip original eax gs fs es ds eax ebp edi esi edx ecx ebx Saved by hardware kernel mode stack %esp esp esp0 eip thread thread_info

  37. Check Trace-related Flags • Next, the system_call( ) function checks whether some specific flags, such as _TIF_SYSCALL_TRACE and _TIF_SYSCALL_AUDIT flags, included in the flags[1][2] field of the thread_info structure is set that is, whether the system call invocations of the executed program are being traced by a debugger. • If this is the case, system_call( ) invokes functions syscall_trace_entry( ) and syscall_trace_leave() . • syscall_trace_entry( )is invoked right before the execution of the system call service. • syscall_trace_leave( )is invoked after the execution of the system call service. • These two functions stop current and thus allow the debugging process to collect information about it.

  38. Validity Check • A validity check is then performed on the system call number passed by the User Mode process. • If it is greater than or equal to the number of entries in the system call dispatch table, the system call handler terminates: cmpl $(NR_syscalls), %eax jae syscall_badsys : syscall_badsys: movl $-ENOSYS,PT_EAX(%esp) jmp resume_userspace • If the system call number is not valid, the function stores the -ENOSYS value in the stack location where the eax register has been saved that is, at offset 24 from the current stack top. • It then jumps to resume_userspace (see below). In this way, when the process resumes its execution in User Mode, it will find a negative return code in eax.

  39. Return Code of Invalid System Call -ENOSYS ss esp eflags cs eip original eax gs fs es ds eax ebp edi esi edx ecx ebx Saved by hardware kernel mode stack -ENOSYS %esp esp esp0 eip thread thread_info

  40. Invoke a System Call Service Routine • Finally, the specific service routine associated with the system call number contained in eax is invoked: call *sys_call_table(,%eax,4) • Because each entry in the dispatch table is 4 bytes long, the kernel finds the address of the service routine to be invoked by multiplying the system call number by 4, adding the initial address of the sys_call_table dispatch table, and extracting a pointer to the service routine from that slot in the table.

  41. Exiting from a System Call • When the system call service routine terminates, the system_call( ) function gets its return code from eax and stores it in the stack location where the User Mode value of the eax register is saved: movl %eax, 24(%esp) • Thus, the User Mode process will find the return code of the system call in the eax register.

  42. Prepare the Return Code of the System Call ss esp eflags cs eip original eax gs fs es ds eax ebp edi esi edx ecx ebx Saved by hardware kernel mode stack Return Code %esp esp esp0 eip thread thread_info

  43. Check Flags • Then, the system_call( ) function disables the local interrupts and checks the flags in the thread_info structure of current: #define DISABLE_INTERRUPTS(x) cli #define _TIF_ALLWORK_MASK \ (_TIF_SIGPENDING|_TIF_NEED_RESCHED|_TIF_SINGLESTEP|\ _TIF_ASYNC_TLB|_TIF_NOTIFY_RESUME) DISABLE_INTERRUPTS(CLBR_ANY) TRACE_IRQS_OFF movl TI_flags(%ebp), %ecx testl $_TIF_ALLWORK_MASK, %ecx # current->work jne syscall_exit_work restore_all:

  44. Return to User Mode • The flags field is at offset 8 in the thread_info structure. • The mask _TIF_ALLWORK_MASK selects specific flags. • If none of these flags is set, the function executes the instruction at label restore_all: this code • restores the contents of the registers saved on the Kernel Mode stack • executes an iret assembly language instruction to resume the User Mode process.

  45. Handle Works Indicated by the Flags • If any of the flags is set, then there is some work to be done before returning to User Mode. • If any flag defined in macro _TIF_WORK_SYSCALL_EXIT is set: the system_call( ) function invokes for syscall_trace_leave( ) function, then jumps to the resume_userspace label. • If any flag defined in macro _TIF_WORK_SYSCALL_EXITis not set: the function jumps to the work_pending label. • code at the resume_userspace and work_pending labels checks for • rescheduling requests • virtual-8086 mode • pending signals • single stepping • then eventually a jump is done to the restore_all label to resume the execution of the User Mode process

  46. Issuing a System Call via the sysenter Instruction • The int assembly language instruction is inherently slow because it performs several consistency and security checks. • The sysenter instruction, dubbed in Intel documentation as "Fast System Call," provides a faster way to switch from User Mode to Kernel Mode.

  47. Set up Registers • The sysenter assembly language instruction makes use of three special registers that must be loaded with the following information: • SYSENTER_CS_MSR • The Segment Selector of the kernel code segment • SYSENTER_EIP_MSR • The linear address of the kernel entry point • SYSENTER_ESP_MSR • The kernel stack pointer • "MSR" is an acronym for "Model-Specific Register" and denotes a register that is present only in some models of 80 x 86 microprocessors.

  48. Go into Kernel • When the sysenter instruction is executed, the CPU control unit: • Copies the content of SYSENTER_CS_MSR into cs. • Copies the content of SYSENTER_EIP_MSR into eip. • Copies the content of SYSENTER_ESP_MSR into esp. • Adds 8 to the value of SYSENTER_CS_MSR, and loads this value into ss. • Therefore, the CPU switches to Kernel Mode and starts executing the first instruction of the kernel entry point.

  49. Why SYSENTER_CS_MSR + 8Is Loaded intoss ? • As we have seen in the section "The Linux GDT" in Chapter 2: • The kernel stack segment coincides with the kernel data segment. • The corresponding descriptor follows the descriptor of the kernel code segment in the Global Descriptor Table. • Therefore, step 4 loads the proper Segment Selector in the ss register.

  50. The Mechanics of SYSENTER • All Model Specific Registers are 64-bit registers. • They are loaded from EDX:EAX using the WRMSR instruction. • TheMSR index in the ECX register tells the WRMSR instruction which MSR to load. • The RDMSR works the same way but it stores the current value of an MSR into EDX:EAX. • The Programming manual for the CPU used specifies what index to use for any given MSR.

More Related