File Server Performance

File Server Performance AFS vs YFS

Accepted AFS LimitationsAlter Deployments • Use large numbers of small file servers • Use many small partitions per file server • Restrict the number of processors to 1 or 2 • Limit the network bandwidth to 1gbit • Avoid workloads requiring: • Multiple clients creating / removing entries in a single directory • Multiple clients writing to or reading from a single file • More clients than file server worker threads accessing a single volume • Applications requiring features that AFS does not offer: • Byte range locking, ext. attributes, per file ACLs, etc

Instead of fixing the core problems organizations have … • Deployed isolation file servers and complex monitoring to detect hot volumes and quarantine them • Developed complex workarounds including vicep-access, OSD, and OOB • Segregated RW and RO access into separate cells and constructed their own volume management systems to “vos release” volumes from RW cell to RO cells • Used the AFS name space for some tasks and other “high performance” file systems for others • NFS3, NFS4, Lustre, GPFS, Panasys, others

At what cost? • Additional servers cost money • US$6800 per year according to Cornell University • Including hardware depreciation, support contracts, maintenance, power and cooling, staff time • Increased complexity for end users • Multiple backup strategies

The YFS Premise Maintain the data and the name space Fix the performance problems Enhance the functionality to match Apple/Microsoft first class file systems Improve Security Save money

Talk Outline What are the bottlenecks in AFS and why do they exist? What can be done to maximize the performance of an AFS file server? How scalable is a YFS file server?

AFS RX • File Server Throughput is bound by the amount of data the listener thread can read from the network during any time period • As Simon Wilkinson likes to say: • “There are only two things wrong with AFS RX, the protocol and the implementation.”

AFS RX: The Protocol Issues • Incorrect Round Trip Time calculations • Incorrect Retransmission Timeout implementation • Window size vs Congested Networks • Broken window management makes congested networks worse • Soft ACKs and Hard ACKs • Twice as many ACKs as necessary

AFS RX: Implementation Issues • Lock Contention • 20% of runtime spent waiting for locks • UDP Context Switching • Every packet processed on a different CPU • Cache line invalidation

Simon’s RX Performance Talk • To see the full details, see • http://tinyurl.com/p8c8yqs

The legacy of LWP • Light weight processes (LWP) is a cooperative threading model that was used for the original AFS implementation • Only one thread can execute at a time • Threads yield voluntarily or when blocking for I/O • Data access is implicitly protected by single execution • All lock state changes are atomic when a thread yields. In other words: • Acquire + Release + Yield == Never AcquireAcquire A + Acquire B == Acquire B + Acquire A

The pthreads conversion When converting a cooperative threaded application to pthreads, it is faster to add global locks to protect data structures that are accessed across I/O than to redesign the data structures and the work flow AFS 3.4 added pthread file servers by adding a minimum number of global locks to each package AFS 3.6 added finer grained but still global locks

The many locks • AFS file servers must acquire many mutexes during the processing of each RPC (* = global) • RX • peer_hash*, conn_hash*, peer, conn_call, conn_data, stats*, free_packet_queue*, free_call_queue*, event_queue*, and more • viced • H* [host table, callbacks] • FS* [stats] • VOL* [volume metadata] • VNODE [file/dir]

Lock Contention • Threads are scheduled to a processor and must give up their time slice whenever a required lock is unavailable • When there are multiple processors, threads are scheduled to a processor. • Any data not in the processor cache or that has been invalidated, must be fetched. Locks are represented as data in memory whose state changes when acquired and released. • Two side effects of global locks: • Only one thread at a time can make progress • Multiple processor cores hurt performance

AFS Cache Coherency via Callbacks An AFS file server promises its client that for a fixed period of time it notify the client if the metadata or data state of an accessed object changes For read write volumes, one callback promise per file object For read only volumes, one callback promise per volume regardless of how many file objects are accessed Today, many file servers are deployed with callback tables containing millions of entries

Host Table Contention A host table and hash tables for looking up host entries by IP address and UUID are protected by a single global lock. Host entries have their own locks. To avoid hard deadlocks, locking an entry requires dropping the global lock, obtaining the entry lock, and obtaining the global lock. Soft deadlocks occur when multiple threads are blocked on the entry lock but the thread holding it is blocked waiting for the global lock. Lock contention occurs multiple times for each new rx connection and each time a call is scheduled.

Callback Table Contention The Callback Table is protected by the same global lock as the Host Table Each new/updated callback promise requires exclusive access to the table Notifying registered clients of state changes (breaking callbacks) requires exclusive access Garbage collection of expired callbacks (5 minute intervals) requires exclusive access Callback Table Limit exceeded requires exclusive access for immediate garbage collection and premature callback notification

Impact of Host and Callback Table Contention The larger the callback table the longer exclusive access is maintained for garbage collection and callback breaks While exclusive access is maintained, no calls can be scheduled nor can existing calls be completed

AFS Worker Thread Pool Increasing the worker thread pool permits additional calls to be scheduled instead of blocking in the rx wait queue Primary benefit of scheduling is that locks provide a filtering mechanism to decide which calls can make progress. Calls on the rx wait queue can never make progress of thread pool is exhausted Downside of increased thread pool size is increased lock contention and more CPU time wasted on thread scheduling

Worker Thread Pool • Start with “large” configuration • -L • Make thread pool as large as possible • For 1.4, -p 128 • For 1.6, -p 256 • Set directory buffer size to twice the thread count • -b 512

Volume and Vnode Caches • Volume Cache larger than total volume count • -vc <number of volumes plus some> • Small vnode cache (files) • -s <10 x volume count> • Large vnode cache (directories) • -l <3 x volume count> • If volumes are very large, may require higher multiples

Callback Tables and Thrashing • The callback table must be large enough to avoid thrashing • -cb <volume-count * 13 * vnode-count> • Where that value *72 bytes should not exceed 10% of machine physical memory • Use “xstat_fs_test's-collId3–once” to monitor “GetSomeSpaces” value. If non-zero, increase –cb value

UDP Tuning • UDP Receive Buffer • Must be large enough to receive all packets for in process calls. • <thread-count * winsize (32) * packet-size> • -udpsize 16777216 • Won’t take effect unless OS is configured to match • UDP Send Buffer • -sendsize 2097152 • (2^21) unless client chunk size is larger

Mount vicep* with noatime AFS protocol does not expose the last access time to clients Nor does the AFS file server make use of it Turn off last access time updates to avoid large amounts of unnecessary disk i/o unrelated to serving the needs of clients

Syncing data to disk • Syncing data to disk is very expensive. If you trust your UPS and have a good battery backup caching storage adapter we recommend reducing the frequency of sync operations. • For 1.6.5, new option • -sync onclose

YFS File Servers Scale Far Beyond AFS • YFS File Server experience much less contention between threads • RPCs take less time to complete • Store operations do not block simultanenous Fetch requests • One YFS File Server can replace at least 30 AFS file servers • Max in-flight RPCs per AFS server = 240 • Max in-flight RPCs per YFS server = 16,000 (dynamic) • 240 * 30 = 7,200

How Fast can RX/UDP go? Up to 8.2 gbits/second per listener thread

SLAC Testing • SLAC has experienced file server meltdowns for years. Large number of file servers deployed to permit distribution of load isolation of volume accesses by users. • One YFS file server satisfied 500 client nodes for nearly 24 hours without noticeable delays • 1gbit NIC, 8 processor cores, 6gbit/sec local raid disk • 800 operations per second • 55MB/sec FetchData • 5MB/sec StoreData

Other Benefits • 2038 Safe • 100ns time • 2^64 volumes • 2^96 vnodes / volume • 2^64 max quota/vol/part size • Per File ACLs • Volume Security Policies • Max ACL / Wire Privacy • Servers do not run as “root” • Linux O_DIRECT • Mandatory Locking • IPv6 network stack

Security, Security, Security • RXGK • GSS-API Authentication • AES-256/SHA-1 wire privacy • File server wire security policies • File servers cannot serve volumes with stronger required policies • Combined Identity Tokens • Keyed cache managers / Machine IDs • Maximum Volume ACL prevents data leaks

File Server Performance

File Server Performance

Presentation Transcript

ASE119 Maximizing Server Performance

BizTalk Server Performance: Configuring BizTalk Server for Performance

Performance in SQL Server

Windows Server 2008 File Services

Managing SQL Server Performance

File System Performance

SQL Server Performance Tuning

Modern Performance - SQL Server

Mirror File System A Multiple Server File System

Server Side Performance Enhancements

MP2: P2P File Server

High Performance Web Server

Enhance your Server Performance

High Performance Dedicated Server

File System Performance

Mirror File System A Multiple Server File System

CDN File Server

File Share Server On Cloud