Servers: Concurrency and Performance

Servers: Concurrency and Performance Jeff Chase Duke University

CPS 110 Fall 08 • These slides are included for completeness. • I used some of them to illustrate how servers are structured and how they use the OS system call interface (e.g., *nix), threads, and sockets. • The server bottleneck material (e.g., Little’s Law) was also discussed along with scheduling earlier in the semester. It won’t be tested, but you should understand how/why response time grows so rapidly as systems saturate, and the idea of “thrashing” or “congestion collapse” that often causes throughput to drop at saturation. • Server structuring alternatives (event structures, etc.) will not be tested.

HTTP Server • HTTP Server • Creates a socket (socket) • Binds to an address • Listens to setup accept backlog • Can call accept to block waiting for connections • (Can call select to check for data on multiple socks) • Handle request • GET /index.html HTTP/1.0\n<optional body, multiple lines>\n\n

Inside your server Measures offered load response time throughput utilization Server application (Apache, Tomcat/Java, etc) accept queue packet queues listen queue

Client() { fd = connect(“server”); write (fd, “video.mpg”); while (!eof(fd)) { read (fd, buf); display (buf); } } Server() { while (1) { cfd = accept(); read (cfd, name); fd = open (name); while (!eof(fd)) { read(fd, block); write (cfd, block); } close (cfd); close (fd); } Example: Video On Demand How many clients can the server support? Suppose, say, 200 kb/s video on a 100 Mb/s network link? [MIT/Morris]

Performance “analysis” • Server capacity: • Network (100 Mbit/s) • Disk (20 Mbyte/s) • Obtained performance: one client stream • Server is limited by software structure • If a video is 200 Kbit/s, server should be able to support more than one client. 500? [MIT/Morris]

128.36.232.5128.36.230.2 TCP socket space state: listening address: {*.6789, *.*} completed connection queue: sendbuf: recvbuf: state: established address: {128.36.232.5:6789, 198.69.10.10.1500} sendbuf: recvbuf: state: listening address: {*.25, *.*} completed connection queue: sendbuf: recvbuf: WebServer Flow Create ServerSocket connSocket = accept() read request from connSocket read local file write file to connSocket close connSocket Discussion: what does step do and how longdoes it take?

Accept Client Connection Read HTTP Request Header Find File Send HTTP Response Header Read File Send Data Web Server Processing Steps may block waiting on disk I/O may block waiting on network Want to be able to process requests concurrently.

Process States and Transitions running (user) interrupt, exception trap/return Yield running (kernel) Sleep Run blocked ready Wakeup

Server Blocking • accept() when no connect requests are waiting on the listen queue • What if server has multiple ports to listen from? • E.g., 80 for HTTP, 443 for HTTPS • open/read/write on server files • read() on a socket, if the client is sending too slowly • write() on socket, if the client is receiving too slowly • Yup, TCP has flow control like pipes What if the server blocks while serving one client, and another client has work to do?

start (arrival rate λ) I/O completion I/O request I/O device exit (throughput λ until some center saturates) CPU Under the Hood

Concurrency and Pipelining CPU DISK Before NET CPU DISK After NET

Better single-server performance • Goal: run at server’s hardware speed • Disk or network should be bottleneck • Method: • Pipeline blocks of each request • Multiplex requests from multiple clients • Two implementation approaches: • Multithreaded server • Asynchronous I/O [MIT/Morris]

Concurrent threads or processes • Using multiple threads/processes • so that only the flow processing a particular request is blocked • Java: extends Thread or implements Runnable interface Example: a Multi-threaded WebServer, which creates a thread for each request

Find File Find File Send Header Send Header Read File Send Data Read File Send Data Accept Conn Accept Conn Read Request Read Request Multiple Process Architecture Process 1 • Advantages • Simple programming while addressing blocking issue • Disadvantages • Many processes; large context switch overheads • Consumes much memory • Optimizations involving sharing information among processes (e.g., caching) harder … separate address spaces Process N

Thread 1 Find File Send Header Read File Send Data Accept Conn Read Request … Thread N Find File Send Header Read File Send Data Accept Conn Read Request Using Threads • Advantages • Lower context switch overheads • Shared address space simplifies optimizations (e.g., caches) • Disadvantages • Need kernel level threads (why?) • Some extra memory needed to support multiple stacks • Need thread-safe programs, synchronization

Threads • A thread is a schedulable stream of control. • defined by CPU register values (PC, SP) • suspend: save register values in memory • resume: restore registers from memory • Multiple threads can execute independently: • They can run in parallel on multiple CPUs... • - physical concurrency • …or arbitrarily interleaved on a single CPU. • - logical concurrency • Each thread must have its own stack.

server() { while (1) { cfd = accept(); read (cfd, name); fd = open (name); while (!eof(fd)) { read(fd, block); write (cfd, block); } close (cfd); close (fd); }} for (i = 0; i < 10; i++) threadfork (server); Multithreaded server • When waiting for I/O, thread scheduler runs another thread • What about references to shared data? • Synchronization [MIT/Morris]

Event-Driven Programming • One execution stream: no CPU concurrency. • Register interest in events (callbacks). • Event loop waits for events, invokes handlers. • No preemption of event handlers. • Handlers generally short-lived. Event Loop Event Handlers [Ousterhout 1995]

Single Process Event Driven (SPED) Find File Send Header Read File Send Data Accept Conn Read Request • Single threaded • Asynchronous (non-blocking) I/O • Advantages • Single address space • No synchronization • Disadvantages • In practice, disk reads still block Event Dispatcher

Asynchronous Multi-Process Event Driven (AMPED) Find File Send Header Read File Send Data Accept Conn Read Request • Like SPED, but use helper processes/thread for disk I/O • Use IPC to communicate with helper process • Advantages • Shared address space for most web server functions • Concurrency for disk I/O • Disadvantages • IPC between main thread and helper threads Event Dispatcher Helper 1 Helper 1 Helper 1 This hybrid model is used by the “Flash” web server.

Event-Based Concurrent Servers Using I/O Multiplexing • Maintain a pool of connected descriptors. • Repeat the following forever: • Use the Unix select function to block until: • (a) New connection request arrives on the listening descriptor. • (b) New data arrives on an existing connected descriptor. • If (a), add the new connection to the pool of connections. • If (b), read any available data from the connection • Close connection on EOF and remove it from the pool. [CMU 15-213]

Select • If a server has many open sockets, how does it know when one of them is ready for I/O? int select(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout); • Issues with scalability: alternative event interfaces have been offered.

struct callback { bool (*is_ready)(); void (*cb)(arg); void *arg; } main() { while (1) { for (c = each callback) { if (c->is_ready()) c->handler(c->arg); } } } Asynchronous I/O • Code is structured as a collection of handlers • Handlers are nonblocking • Create new handlers for blocking operations • When operation completes, call handler [MIT/Morris]

init() { on_accept(accept_cb); } accept_cb() { on_readable(cfd,name_cb); } on_readable(fd, fn) { c = new callback(test_readable, fn, fd); add c to callback list; } name_cb(cfd) { read(cfd,name); fd = open(name); on_readable(fd, read_cb); } read_cb(cfd, fd) { read(fd, block); on_writeeable(fd, write_cb); } write_cb(cfd, fd) { write(cfd, block); on_readable(fd, read_cb); } Asychronous server [MIT/Morris]

Hard to program Locking code Need to know what blocks Coordination explicit State stored on thread’s stack Memory allocation implicit Context switch may be expensive Multiprocessors Hard to program Callback code Need to know what blocks Coordination implicit State passed around explicitly Memory allocation explicit Lightweight context switch Uniprocessors Multithreaded vs. Async [MIT/Morris]

Threaded server: Thread for network interface Interrupt wakes up network thread Protected (locks and conditional variables) shared buffer shared between server threads and network thread Asynchronous I/O Poll for packets How often to poll? Or, interrupt generates an event Be careful: disable interrupts when manipulating callback queue. Coordination example [MIT/Morris]

One View Threads!

Should You Abandon Threads? • No: important for high-end servers (e.g. databases). • But, avoid threads wherever possible: • Use events, not threads, for GUIs,distributed systems, low-end servers. • Only use threads where true CPUconcurrency is needed. • Where threads needed, isolate usagein threaded application kernel: keepmost of code single-threaded. Event-Driven Handlers Threaded Kernel [Ousterhout 1995]

Another view • Events obscure control flow • For programmers and tools Web Server AcceptConn. ReadRequest PinCache ReadFile WriteResponse Exit [von Behren]

Control Flow • Events obscure control flow • For programmers and tools Web Server AcceptConn. ReadRequest PinCache ReadFile WriteResponse Exit [von Behren]

Exceptions • Exceptions complicate control flow • Harder to understand program flow • Cause bugs in cleanup code Web Server AcceptConn. ReadRequest PinCache ReadFile WriteResponse Exit [von Behren]

State Management • Events require manual state management • Hard to know when to free • Use GC or risk bugs Web Server AcceptConn. ReadRequest PinCache ReadFile WriteResponse Exit [von Behren]

Thread 1 Find File Send Header Read File Send Data Accept Conn Read Request … Thread N Find File Send Header Read File Send Data Accept Conn Read Request

Internet Growth and Scale The Internet How to handle all those client requests raining on your server?

Servers Under Stress Ideal Peak: some resource at max Performance Overload: someresource thrashing Load (concurrent requests, or arrival rate) [Von Behren]

Response Time Components • Wire time + • Queuing time + • Service demand + • Wire time (response) Depends on • Cost/length of request • Load conditions at server latency offered load

Queuing Theory for Busy People • Big Assumptions • Queue is First-Come-First-Served (FIFO, FCFS). • Request arrivals are independent (poisson arrivals). • Requests have independent service demands. • i.e., arrival interval and service demand are exponentially distributed (noted as “M”). wait here offered load request stream @ arrival rate λ Process for mean service demand D M/M/1 Service Center

Utilization • What is the probability that the center is busy? • Answer: some number between 0 and 1. • What percentage of the time is the center busy? • Answer: some number between 0 and 100 • These are interchangeable: called utilizationU • If the center is not saturated, i.e., it completes all its requests in some bounded time, then: • U = λD = (arrivals/T * service demand) • “Utilization Law” • The probability that the service center is idle is 1-U.

Little’s Law • For an unsaturated queue in steady state, mean response time R and mean queue length N are governed by: Little’s Law: N = λR • Suppose a task T is in the system for R time units. • During that time: • λRnew tasks arrive. • N tasks depart (all tasks ahead of T). • But in steady state, the flow in balances flow out. • Note: this means that throughput X = λ.

Inverse Idle Time “Law” Service center saturates as 1/ λ approaches D: small increases in λ cause large increases in the expected response time R. R U 1(100%) • Little’s Law gives response time R = D/(1 - U). • Intuitively, each task T’s response time R = D + DN. • Substituting λR for N: R = D + D λR • Substituting U for λD: R = D + UR • R - UR = D --> R(1 - U) = D --> R = D/(1 - U)

Why Little’s Law Is Important 1. Intuitive understanding of FCFS queue behavior. • Compute response time from demand parameters (λ, D). • Compute N: how much storage is needed for the queue. 2. Notion of a saturated service center. • Response times rise rapidly with load and are unbounded. • At 50% utilization, a 10% increase in load increases R by 10%. • At 90% utilization, a 10% increase in load increases R by 10x. 3. Basis for predicting performance of queuing networks. • Cheap and easy “back of napkin” estimates of system performance based on observed behavior and proposed changes, e.g., capacity planning, “what if” questions.

What does this tell us about server behavior at saturation?

start (arrival rate λ) I/O completion I/O request I/O device exit (throughput λ until some center saturates) CPU Under the Hood

Common Bottlenecks • No more File Descriptors • Sockets stuck in TIME_WAIT • High Memory Use (swapping) • CPU Overload • Interrupt (IRQ) Overload [Aaron Bannert]

Scaling Server Sites: Clustering Goals server load balancing failure detection access control filtering priorities/QoS request locality transparent caching L4: TCP L7: HTTP SSL etc. virtual IP addresses (VIPs) smart switch Clients What to switch/filter on? L3 source IP and/or VIP L4 (TCP) ports etc. L7 URLs and/or cookies L7 SSL session IDs server array

Scaling Services: Replication Site A Site B Distribute service load across multiple sites. How to select a server site for each client or request? Is it scalable? ? Internet Client

Extra Slides (Any new information on the following slides will not be tested.)

Event-Based Concurrent Servers Using I/O Multiplexing • Maintain a pool of connected descriptors. • Repeat the following forever: • Use the Unix select function to block until: • (a) New connection request arrives on the listening descriptor. • (b) New data arrives on an existing connected descriptor. • If (a), add the new connection to the pool of connections. • If (b), read any available data from the connection • Close connection on EOF and remove it from the pool. [CMU 15-213]

Problems of Multi-Thread Server • High resource usage, context switch overhead, contended locks • Too many threads  throughput meltdown, response time explosion • Solution: bound total number of threads

Servers: Concurrency and Performance