1 / 109

I/O Performance Analysis and Tuning: From the Application to the Storage Device

I/O Performance Analysis and Tuning: From the Application to the Storage Device. Henry Newman Instrumental, Inc. MSP Area Computer Measurement Group March 2, 2006. Tutorial Goal. To provide the attendee with an understanding of the history and techniques used in I/O performance analysis.

Download Presentation

I/O Performance Analysis and Tuning: From the Application to the Storage Device

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I/O Performance Analysis and Tuning: From the Application to the Storage Device Henry Newman Instrumental, Inc. MSP Area Computer Measurement Group March 2, 2006

  2. Tutorial Goal • To provide the attendee with an understanding of the history and techniques used in I/O performance analysis. “Knowledge is Power” - Sir Francis Bacon

  3. Agenda • Common terminology • I/O and applications • Technology trends and their impact • Why knowing the data path is Important • Understanding the data path • Application performance • Performance analysis • Examples of I/O performance Issues • Summary

  4. Common Terminology Using the same nomenclature

  5. The Data Path • Before you can measure system efficiency, it is important to understand how the H/W and S/W work together end-to-end or along the “data path”. • Let’s review some of the terminology along the data path…

  6. Terminology/Definitions • DAS • Direct Attached Storage • SAN • Storage Area Network • NAS • Network Attached Storage shared via TCP/IP • SAN Shared File System • File system that supports shared data between multiple servers • Sharing is accomplish via a metadata server or distributed lock manager

  7. Client Client Client Client Local Area Network (LAN) Server 1 Application File System Server N Application File System F/C Switch F/C Switch Disk Disk Direct Attached Storage

  8. Client Client Client Client Local Area Network (LAN) Server 1 Application File System Server 2 Application File System F/C Switch RAID Controller Disk Disk Storage Area Network

  9. Client Client Client Client Local Area Network (LAN) Server 1 Application Server 2 Application Server N Application NAS Server O/S File System Disk Disk Disk Network Attached Storage

  10. File System Terminology • File System Superblock • Describes the layout of the file system • The location and designation of volumes being used • The type of file system, layout, and parameters • The location of file system metadata within the file system and other file system attributes • File System Metadata • This is the data which describes the layout of the files and directories within a file system • File System Inode • This is file system data which describes the location, access information and other attributes of files and directories

  11. Other Terms • HSM - Hierarchical Storage Management • Management of files that are viewed as if they are on disk within the file system, but are generally stored on a near line or off-line device such as tape, SATA, optical and/or other lower performance media. • LUN - Logical Unit Number • Term used in SCSI protocol to describe a target such as a disk drive, RAID array and/or tape drive.

  12. Other Terms (cont.) • Volume Manager (VM) • Manages multiple LUNs grouped into a file system • Can usually stripe data across volumes or concatenate volumes • Filling a volume before moving to fill the next volume • Usually have a fixed stripe size allocated to each LUN before allocating the next LUN • For well tuned systems this is the RAID allocation (stripe width) or multiple

  13. File 5 File 1 File system stripe group of LUNS used to match bandwidth File 6 File 2 File 7 File 3 File 8 File 4 Round-Robin File System Allocation

  14. Populated with three (3) files File 1 File 2 File 3 removed File 3 Round-Robin File System

  15. With stripe allocation, all writes go to all devices based on the allocation within the volume manager Each file is not allocated on a single disk, but all disks File 5 File 1 File 6 File 2 File 7 File 3 File 8 File 4 Striped File System Allocation

  16. Populated with three (3) files File 1 File 2 Fragmented after File 3 removed File 3 Striped File System Allocation

  17. Boot MFT Free Space Metadata Free Space Microsoft NTFS Layout • Newly formatted NTFS volume • Data and metadata are mixed and can easily become fragmented • Head seeks on the disks are a big issue • Given the different data access patterns for data (long block sequential) and metadata (short block random)

  18. SAN Shared File System (SSFS) • The ability to share data between systems directly attached to the same devices • Accomplished through SCSI connectivity and a specialized file system and/or communications mechanism • Fibre Channel • iSCSI • Other communications methods

  19. SAN Shared File System (SSFS) • Different types of SAN file systems allow multiple writes to the same file system and the same file open from more than one machine • POSIX limitations were never considered for shared file systems • Let’s take a look at 2 different types of SSFS…

  20. Client Client Client Client Local Area Network (LAN) For metadata traffic Server Application Meta Data File System Server Application Client File System Server Application Client File System Server Application Client File System F/C Switch RAID Controller Disk Disk Disk Centralized Metadata SSFS

  21. Client Client Client Client (LAN) local data traffic for file system Depends on implementation Server Application Lock Mgr. File System Server Application Lock Mgr File System Server Application Lock Mgr File System Server Application Lock Mgr File System F/C Switch RAID Controller Disk Disk Disk Distributed Metadata SSFS

  22. More on SSFS • Metadata server approaches do not scale for clients counts over 64 as distributed lock managers • Lustre and GPFS are some examples of distributed metadata • Panasas behaves similarly in terms of scaling as to the distributed metadata approaches, but view the files as objects

  23. Definition - Direct I/O • I/O which bypasses server memory mapped cache and goes directly to disk • Some file systems can automatically switch between paged I/O and direct I/O depending on I/O size • Some file systems require special attributes to force direct I/O for specific files or directories or enable by API • For emerging technologies often times call data movement directly to the device “Direct Memory Addressing” or “DMA” • Similar to what is done with MPI communications

  24. Direct I/O Improvements • Direct I/O is similar to “raw” I/O used in database transactions • CPU usage for direct I/O • Can be as little as 5% of paged I/O • Direct I/O improves performance • If the data is written to disk and not reused by the application • Direct I/O is best used with large requests • This might not improve performance for some file systems

  25. Well Formed I/O • The application, operating system, file system, volume manager and storage device all have an impact on I/O • I/O that is “well formed” must satisfy requirements from all of these areas for the I/O to move efficiently • I/O that is well formed reads and writes data on multiples of the basic blocks of these devices

  26. Well Formed I/O and Metadata • If file systems do not separate data and metadata and their space is co-located • Can impact data alignment because metadata is interspersed with data • Large I/O requests are not necessarily sequential allocated sequentially • File systems allocated data based on internal allocation algorithms • Multiple writes streams prevent sequential allocation

  27. Well Formed & Direct I/O from the OS • Even if you use the O_DIRECT option, I/O cannot move from user space to the device unless it begins and ends on 512 byte boundaries • On some systems additional OS requirements are mandated • Page alignment • 32 KB requests often times related to page alignment • Just because memory is aligned does not mean that the file system or RAID is aligned • These are out of your control

  28. Well Formed & Direct I/O from Device • Just because data is aligned in the OS does not mean it is aligned for the device • I/O for disk drives must begin and end on 512 byte boundaries • And of course you have the RAID alignment issues • More on this later

  29. Volume Managers (VMs) • For many file systems, the VMs control the allocation to each device • VMs often times have different allocations than the file system • Making read/write requests equal to or in multiples of the VM allocation generally improves performance • Some VMs have internal limits that prevent large numbers of I/O requests from being queued

  30. Device Alignment • Almost all modern RAID devices have a fixed allocation per device • Ranges from 4 KB to 512 KB are common • File systems will have the same issue with RAID controllers as does memory with an operating system

  31. 0 0 512 512 1024 1024 262144 262144 Direct I/O Examples (Well Formed) • Any I/O request that begins and ends on a 512 word boundary is well formed* • Request of 262,144 begins at 0 and is 262,144 bytes long * Well formed in terms of the disk not the RAID

  32. 1 512 1024 262145 Direct I/O Examples (Not-Well Formed) • I/O that is not well formed can be broken into well formed parts and non-well formed parts by some file systems • Request that begin at byte 1 and end at byte 262,145 • 1st request 0-512 bytes of which 511 is moved to system buffer • 2nd request 513-262144 bytes (direct) • 3rd request 262145-262656 9 bytes buffered

  33. Well Formed I/O Impact • Having I/O that is not well formed causes • Significant overhead in the kernel to read data that is not aligned • Impact depends on other factors such as page alignment • Other impacts on the RAID depend on the

  34. I/O and Applications What is the data path?

  35. What Happens with I/O • I/O can take different paths within the operating system depending on the type of I/O request • These different paths have a dramatic impact on performance • Two types of applications that I/O can take different paths • C library buffered • System calls

  36. 1 4 5 C Library Buffer Program space Page Cache File System Cache (some systems) Storage 2 1 Raw I/O no file system or direct I/O 2 All I/O Under file system read/write calls 3 File system meta-data and data most file systems 4 As data is aged it is moved to storage from the page cache 5 Data moved to the file system cache on some systems 3 I/O Data Flow Example • All data goes through system buffer cache • High overhead as data must compete with user operations for system cache

  37. C Library Buffered • Library buffer size • The size of the stdio.h buffer. Generally, this is between 1,024 bytes and 8,192 bytes and can be changed on some systems by calls to setvbuf() • Moving data via the C library requires multiple memory moves and/or memory remapping to copy data from the user space to library buffer to the storage device • Library I/O generally has much higher overhead than system calls because you have to make more system given the small request sizes

  38. C Library I/O Performance • If I/O is random and buffer is bigger than request • More data will be read than is needed • If I/O is sequential • If buffer is bigger than request than data is read ahead • If buffer is smaller than request size then multiple system calls will be required • Unless the buffer is larger than the request • It needs to be significantly larger given the extra overhead to move the data or remap the pages

  39. System Calls • UNIX system calls are generally more efficient for random or sequential I/O • Exceptions for sequential I/O are for small requests as compared to C library I/O and large setvbuf() • System calls allow you to perform asynchronous I/O • Gives immediate control back to the program enabling management of ACK when you need the data on the device

  40. Vendor Libraries • Some vendors have custom libraries • Manage data alignment • That have circular asynchronous buffering • Allow readahead • Cray, IBM and SGI all have libraries which can significantly improve I/O performance for some applications • There is currently no standard in this area • There is an effort by DOE to develop similar technology for Linux

  41. Technology Trends and Their Impact What is changing and what is not

  42. Block Device History • The concept of block devices has been around for a long time…at least 35 years • A block device is a data storage or transfer device that manipulates data in groups of a fixed size • For example, a disk whose data storage size is usually 512 bytes for SCSI devices

  43. SCSI Technology History • The SCSI standard has been in place for a long time as well • There is an excellent historical account of SCSI http://www.pcguide.com/ref/hdd/if/scsi/over.htm • Though the SCSI history is interesting and the technology has been launched by many companies • The SCSI standard was published in 1986 • Which makes it nearly 19 years old

  44. Changes Have Been Limited • Since the advent of block devices and the SCSI protocol, modest changes have been made to support • Interface changes, new device types, and some changes for error recovery and performance • Nothing has really changed in the basic concepts of the protocol • Currently there is no communication regarding data topology between block devices and SCSI • Although one new technology has promise - more on OSD later

  45. 1.00E+11 1.00E+10 1.00E+09 1.00E+08 1.00E+07 1.00E+06 Times Difference 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 CPU L1 Cache L2 Cache Memory Disk NAS Tape Registers Min Times Increase Max Times Increases ~ Relative Latency for Data Access Note: Approximate values for various technologies and 12 orders of magnitude

  46. 1.0E+04 1.0E+03 1.0E+02 Times Difference 1.0E+01 1.0E+00 1.0E-01 1.0E-02 CPU L1 Cache L2 Cache Memory Disk NAS Tape Registers Min Relative BW in GB/sec Max Relative BW Reduction in GB/sec ~ Relative Bandwidth for Data Note: Approximate values for various technologies and 6 orders of magnitude

  47. 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 CPU Disk Drive Transfer Rate Transfer Rate RPMS Seek+Latency Seek+Latency Size RAID disk Read Write Performance Increases (1977-2005)

  48. 37.50 35 30 25 20 MB/sec. 15 10 5 0.30 0.33 0.17 0.17 0.08 0 Single Disk Single Disk RAID 4+1 300 RAID 8+1 300 SATA RAID 4+1 SATA RAID 8+1 1977 300 GB GB GB 300 GB 300 GB Bandwidth per GB of Capacity Bandwidth per GB of Capacity

  49. Modern Bandwidth/Capacity

  50. 1,655 1,700 1,600 1,500 1,400 1,300 1,200 1,100 1,000 900 Number of I/Os Per Second 765 800 700 600 500 400 270 300 200 100 0 1977 CDC Cyber 819 2005 300 GB Seagate 2005 400 GB SATA Cheetah 10K.7 4 KB IOPS for a Single Device

More Related