UNIX/Web Application Tutorial

UNIX/Web Application Tutorial William R. Sullivan CTO WHAM Engineering & Software

Three Important Components of Web Applications • Threads • Scheduling • Memory Management

Three Important Components of Web Applications • Threads • What they are • Where they are • What they do • Thread Synchronization • Mutexes • Serialization and Concurrency

Three Important Components • Memory Concepts • Address Space Management • Address Translation • Locality of Reference

Three Important Components • Performance issues that can’t be solved with hardware • Scalability of Applications • Memory Management in C++ applications • Memory Management in Java applications

Thread Characteristics • Schedulable entity • Consumes CPU resource • Contention Scope • Where it is scheduled

What is a Thread? • A path of execution within a program • This can be a function that runs as an infinite loop or that simply returns when it is done • The name thread comes from the idea that a fabric is made up of many single threads. A program can considered as many different threads of execution.

Thread Resources • Has a stack even if it is local • Has a runtime context even if it is local • Includes general purpose register set • Condition and floating point registers • For a global scope thread there is a kernel level representation of the thread • Uses process address space and I/O • Associated with some start-up function

Thread Contention Scope • Global (this is now default on Solaris 2.9) • contends with all threads • Context Switches occur in the kernel • Local (default on AIX4.3+, Solaris 2.5-8) • contends with threads within process at library level first • Context Switches are fast and efficient

Thread Mappings Local Scope Threads Global Scope Thread Local Scope Threads . . User Protection Domain Library Library Library Kernel Protection Domain Kernel Thread Kernel Thread Kernel Thread . . . . . . . CPU CPU

Managing Thread Mappings • Thread Library manages local to global scope thread mappings normally • AIXTHREAD_MNRATIO overrides whatever ratio the library defaults to • pthread_create accepts a thread attribute and scope can be set to process or system (local or global)

Thread Synchronization • Mutexes • Mutex - Mutual Exclusion • Acts as a gate where threads wait • The wait isn’t fair and the next thread enabled is random • At the lowest level they are a spin-lock which is based on a test and set instruction implemented in hardware

Thread Synchronization • Condition Variables • Consist of a Mutex and a Predicate which is usually a variable used for counting • Used for implementing master-slave thread communication and thread-thread communication • Can be used to implement a fair FIFO access scheme for variables protected by a mutex

Mutex Contention - How it occurs • Mutex contention occurs when multiple threads attempt to acquire the same mutex and resolve the conflict in the kernel • When two threads attempt to obtain a mutex one wins and the other spins in a loop testing the status of the mutex until either • a maximum spin count is reached (and then the thread blocks) • the mutex is released and the waiting thread gets it • When more than two threads attempt to obtain a single mutex, more than two will spin, this is wasteful since only one will ever get the mutex next

Mutex Contention - Where it occurs • In your application due to a single locking point that is entered frequently by all or many threads in your application (malloc/free) • In the operating system where there is conflict on a single point of high use by many programs (the dispatch queue)

Mutex Contention - What it Costs Thread Holding Lock in func() Thread Spin Waiting to enter func() Thread Spin Waiting to enter func() Thread Spin Waiting to enter func() Thread Spin Waiting to enter func() Thread 1 Executes func() in time ts Thread 2 spins for t and executes func() in ts for a total 2ts Thread n spins for (n-1)ts and executes func() in ts for a total nts The average execution time for func() across n threads is expressed as n t t n(n+1) t(n+1) k = = n n 2 2 k=1

Mutex Contention - What it Costs • The generalized magnification factor for a critical section when n threads collide is given by (n+1)/2 • If the contention persists and causes queuing on the critical section, the magnification factor for the execution time of the critical section is given by q where q is the average queue depth. • Mutex Contention can rapidly degrade the performance of programs as concurrency is increased

Pathology of Mutex Contention • What can you look for to detect mutex contention? • CPU time not scaling linearly with workload • High system to user CPU ratio on Solaris • Increased cost per transaction as workload increases • Reduced throughput with higher concurrency • More threads active with less work being done • We will be looking at an example of how Mutex Contention causes the same work being done to cost 20x in CPU with our scheduling example.

Scheduling Policies • FIFO • Runs until yields, blocked, or interrupted by higher priority thread • Fixed priority • Round Robin • Fixed priority • Other • Implementation defined

Scheduling Concepts Solaris • Priority -- A number associated with a run queue from which the dispatcher selects threads to run. The highest number queue is searched first. • Quantum -- An amount of time the thread is allowed to run without losing the CPU.

Scheduling Concepts Solaris • Preemption -- The process of bumping a running thread from the CPU in favor of an interrupt or a real-time thread requesting a kernel preemption. • Tick -- The timing interval at which synchronous scheduling decisions are made. On Solaris this is every 10ms.

Scheduling Concepts Solaris • Priority Queues -- Per CPU dispatch queues for threads needing service. There is a dispatch queue for each priority level.

Scheduling Classes Solaris • Time Share (TS) • The idea is to give small jobs the best response. Long running jobs get less favored priority at the expense of short jobs. Scheduling policy is priority RR. • Real Time (RT) • Fixed Priority for life or until manually changed by super-user. Scheduling policy is priority RR.

Scheduling Classes (cont.) • Interactive (IA) • A special case of TS created for GUI based threads. A boost in priority is always given to the thread in the focus window • System (SYS) • These threads run in kernel mode under the FIFO scheduling policy

Dispatch Algorithm - Per CPU • Check global kernel preempt queue for interrupt threads or system threads • Look for highest priority thread on RT, TS or IA queues (per CPU queues) • CPU structure includes a bit mask representing each priority queue • Look on other CPU queues for work if none on it’s own

Priority Calculation Solaris • Determined by a table for the scheduling class associated with an LWP • Value between 0 and 59 for TS and IA • Value between 60 and 99 for SYS • Fixed Value between 100 and 159 for RT • Value between 160 and 169 for interrupts

Time Share Table Time Sharing Dispatcher Configuration RES=1000 ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL 200 0 50 0 50 # 0 160 0 51 0 51 # 10 120 10 52 0 52 # 20 80 20 53 0 53 # 30 80 29 54 0 54 # 39 40 30 55 0 55 # 40 40 39 58 0 59 # 49 40 40 58 0 59 # 50 40 41 58 0 59 # 51 40 42 58 0 59 # 52 40 43 58 0 59 # 53 40 44 58 0 59 # 54 40 45 58 0 59 # 55 40 46 58 0 59 # 56 40 47 58 0 59 # 57 40 48 58 0 59 # 58 20 49 59 32000 59 # 59

TS Column Meanings • ts_quantam – time to run on CPU • ts_qexp - next priority after quantum used • ts_slpret – next priority after wakeup • ts_maxwait – how many second to wait without getting a priority boost or a quantum • ts_lwait – priority after ts_maxwait exceeded

Priority Calculation Example • Two threads in the TS class with different work loads • fast thread only has 25ms to spend every 100ms • busy thread has 500ms work to do each second • Many fast threads take precedence and the busy thread provides poor response

Priority Calculation Example 59 49 58 39 bt,ft 0 1 bt,ft 2 ft bt 3 ft bt bt,ft 4 5 bt,ft 6 bt,ft 7 bt,ft ft 8 bt 9 bt 10 bt ft 11 bt ft

Scheduling Matters on Solaris • We have a case here of two processes that are doing the same work but with different RPC implementations. The two RPCs are different architectures to implement the same solution. • The server processes are the same but the client calls them using two different RPC protocols • Work done is the same in each case but one is efficient the other isn’t. One is plagued by mutex contention in a critical code location used by all threads in the application.

Scheduling Matters on Solaris • Point of the exercise was to show how an inefficient process could impact an efficient process (will second hand smoke hurt me?) • Many business areas “that have no time for optimization” will use the excuse that they bought X cpus on this system and they can use them however they see fit. • This scenario was developed to determine if that argument was specious, it was, as we will see.

Scheduling Matters on Solaris • Servers (rpc_test) implement the same operations using a tcp rpc or a udp rpc • Several clients were started up to access servers in each mode. • One server was accessed in tcp mode then udp mode while two others were tcp only • One server was accessed in udp mode only

Scheduling Matters on Solaris 82,000 UDP RPCs total CPU 91.6s or .001s/RPC

Scheduling Matters on Solaris PID 22063 - 42,275 TCP RPCs, 317 CPUs or .0075s/RPC ||42,700 UDP RPCs, 31.9 CPUs or .00075s/RPC PID 22066 – 28,600 TCP RPCs, 227 CPUs or .0079s/RPC PID 22068 - 16,200 TCP RPCs, 123 CPUs or .0075s/RPC

Scheduling Matters on Solaris

Process Addressing Hierarchy • Program Address • Address between 0 and 0xffffffff that is produced by your program • All programs produce the same range of addresses • Virtual Address • Program Address as seen by the Address Translation Hardware (MMU)

Address Hierarchy (cont) • Physical Address • Address emitted by the MMU after translation takes place • This interfaces with the system memory bus to actually reference RAM data

Memory Management Terms • Addressing Fault • A failure of the addressing hardware to be able to translate a virtual address to a physical address • Protection Fault • A failure of the segment driver to allow access to a program address that produced an addressing fault

Memory Management Terms • Process Context • The collection of registers both machine level and general purpose used by a kernel thread as it runs on a CPU • This context is used when addressing faults occur to resolve them. The context is saved when a thread is switched out by the dispatcher.

Memory Management Elements • Virtual Memory (VM) system manages all virtual memory objects in the system • Virtual Memory mappings are contained in Segments of up to 4Gb on Solaris and 256Mb on AIX • Segments are the level at which memory is protected and shared by processes in the system • Segments are contained in Address Spaces

Pages • Physical memory is divided into pages • The size of a page is dependent on the hardware but VM doesn’t care how big they are • Segments are divided into pages • Segment pages are mapped to physical memory pages by the VM system

Hardware Address Translation Program A Program B Program C 0x25900 0x25900 0x25900 Virtual Address Input Memory Management Unit Physical Address Output

Hardware Address Translation • In the previous slide we have three programs applying the same virtual address to the MMU • What real address gets emitted? • It puts out the last one it was programmed for • The others will produce addressing faults (They don’t have correct context)

Hardware Address Translation Program Address + Process Context Virtual Address Generator Virtual Page Address/Number Tag RAM Tag RAM Comparator Comparator PN RAM PN RAM TLB Hardware Address Bus

Locality of Reference • This simply refers to the fact that the next location fetched from memory is close to the first • As long as it is in the same page, no new virtual mapping needs to be created • Programs with poor locality of reference rarely get extra performance with faster CPUs

Address Space Management in C++ Applications • The operator new is used to create instances of C++ class objects which invokes the class constructor • delete invokes the class destructor • If no class specific constructor or destructor are provided, malloc and free are used • This leads to poorly performing applications where lots of construction and destruction occur for a specific class

A Poisonous Mix • Amonia and Bleach combined, produce a toxic gas, Chloramine. If you combine these in your home, you better get out fast. • Threads and C++ applications are a poisonous mix as well, which can be seen in the following case example.

Webc Performance on ES6000 20way for 1000 requests Elapsed time 850 seconds and 3500 CPU seconds

UNIX/Web Application Tutorial