1 / 42

I/O Subsystem Chapter 8

I/O Subsystem Chapter 8. N. Guydosh 4/28/04+. Introduction. Amazing variation of characteristics and behavors Characteristics largely driven by technology Not as “elegant” as processors or memory systems Traditionally the study of I/O took a “back seat” to processors and memory

happy
Download Presentation

I/O Subsystem Chapter 8

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I/O SubsystemChapter 8 N. Guydosh 4/28/04+

  2. Introduction • Amazing variation of characteristics and behavors • Characteristics largely driven by technology • Not as “elegant” as processors or memory systems • Traditionally the study of I/O took a “back seat” to processors and memory • An unfortunate situation because a computer system is useless without I/O and Amdal’s law tells that ultimately I/O is the performance bottleneck. See example in section 8.1 Typical I/O configuration Fig. 8.1

  3. I/O Performance Metrics • A point of confusion: In I/O systems, KB, MB etc. are traditionally powers of 10: 1000, 1,000,000, bytes, but in memory/processor systems these are powers of 2: 1024, 1,048,576 • For simplicity lets ignore the small difference and use only one base, say 2. • “Supercomputer” I/O benchmarks • Typically for check-pointing the machine – want maximum bytes/sec on output. • Transaction processing (TP) • Response time and throughput important • Lots of small I/O events, thus number of disk accesses per second more important than “bytes/sec” • Reliability very important • File Systems I/O benchmarks • These exercise I/O system with I/O commands, example for UNIX: Makedir, copy, scandir (transverse directory tree), ReadAll (scan every byte in every file once), make (compiling and linking)

  4. Types & Characteristics of I/O Devices • Again diversity is the problem here • Devises differ significantly in:behavior“partner” – purely machine interfaced or human interfacedData rate- ranges from a few bytes/sec to 10’s of millions bytes/sec. • See text for descriptions of various devices commonly in use • Disk access time calculation: • See book on disk organization • Components of access time:Average seek time – move head to desired trackrotational latencies – wait for sector to get to head (0.5 rotation/RPM)transfer time - time to read or write a sectorsometimes queuing time included – waiting for a request to get serviced. • Disk density and size affect performance and usefulness

  5. Connecting The System:busses • A “bus” connects subsystems together • Connects processor, memory, and i/o devices together • Consists of a set of wires with control logic and a will defined protocol for using the bus • Protocol is implemented in hardware • A “standard” bus design was a prime factor in the success of personal computer • Purchase a base system and “grow” it by adding off the self components • Historically a very chaotic aspect of the computer industry The “bus wars” ... pci wins microchannels loses! • Busses are a key factor in the overall performance of a computer system

  6. Connecting The System:busses (cont.) • Some bus tradeoffs: • Advantage: flexibility in adding new devices & peripherals • Disadvantage: A Serial reusable resource ==> only one at a time communication bottleneck • Two performance goals • High bandwidth (data rate mb/sec) • Low latency • Bus consists of a set of data lines and control lines • Data lines include address and raw data • Because bus is shared, we need a protocol to decide who uses it next • Bus transaction (send address & receive or send data) • Terminology is from point of view of memory (confusing!) • Input: Writes data to memory from I/O • Output: Reads data from memory to I/O • See example in fig. 8.7, 8.8

  7. Connecting The System:busses (cont.) Mem read cmd address on bus Access data in memory  “Data ready” response • Data on bus Write to disk Fig. 8.7 Output operation: data from memory “outputted” to device

  8. Connecting The System:busses (cont.) Write reg cmd  addr on bus data on bus read from disk Fig. 8.8 Input operation: data to memory “inputted” from device

  9. Types of Busses • Backplane (mother board) bus • Interconnects backplane components • Plug-in feature • Typical “standard” busses (ISA, AT, PCI ...) • Connects to other busses • Processor - memory • Usually proprietary • High speed • Direct connection of processor to memory with links to other busses • I/O Bus • Typically does not connect directly to memory • Usually bridge to backplane or processor-memory bus • Example: SCSI, IDE, EIDE, …

  10. Types of Busses (cont.) • A lot of functional overlap in above 3 types of busses • Can put memory directly on backplane bus • Logic needs to interconnect busses (bridge chips) • Ex: backplane to I/O bus • A system may have a single backplane bus: • Ex: old pc’s (ISA/AT) • See fig 8.9, p. 659 for examples ==>

  11. Types of Busses Example Single backplane older PC’s Processor/memory bus for main bus – could be PCI backplane in modern computers. All 3 types of busses utilized here Ex: Proprietary (old IBM?) Ex: EIDE bus in a PC Ex: PCI Fig. 8.9

  12. Synchronous Vs. Asynchronous Busses • Synchronous • Bus includes clock line in control • Protocol is not very data dependent • Protocol tightly coupled to clock • Highly synchronized with clock • Completely clock driven • Only asynchronous stuff is generation of commands or requests • Lines must be short due to clock skew • A model for this type of bus is an FSM • Disadvantages:all devices on bus must run at same clock speedlines must be short due to clock skew problem • Advantage: can have high performance in special applications such as processor memory bussing • Sometimes used for processor-memory bus

  13. Synchronous Vs. Asynchronous Busses (cont.) • Asynchronous • Very little clock dependency • Event driven • Keeps in step via hand shakingSee example in figure 8.10 • Very versatile • Bus can be “arbitrarily” long • Common for standard busses • Ex: Sbus (SUN), microchannel, PCI • Can even connect busses/devices using different clocks • Disadvantage: lower performance … due to handshaking? • A model for this type of bus is a pair of interacting FSM’s • See fig 8.11, P. 664 ... see performance analysis pp. 662-663based on figure 8.10

  14. Handshaking on an Asynchronous Bus address data Operation: data from memory to device: Initially: device raised RedReq and puts address on data lines 1. mem see ReadReq & reads address from data bus & raises Ack 2. I/O device see Ack line high & releases readReq & data lines 3. Mem see readReq low & drops Ack line to ack ReadReq signal 4. mem puts data on data line and asserts DataRdy 5. I/O see DataRdy & reads the data and signals Ack 6. Mem see Ack & drops DataRdy and releases data lines 7. I/O see DataRdy drop and drops Ack line Note: bus is bi-directional Question: what happens if an Ack fails to get issued? Color coding: Colored signals are from devise Black signals are from memory Fig. 8.10

  15. FSM model of Asynchronous Busbased on example in fig. 8.10 The numbers in each state correspond to the numbered steps in fig. 8.10 Fig. 8.11

  16. An Example (pp.662-663) • Referring to the example in fig 8.10:We will compare the asynchronous bandwidth (BW) with a synchronous approach • Asynchronous: • 40 ns per handshake (one of the 7 steps) • Synchronous: • Clock cycle = 50ns • Each bus transmission takes one clock cycle • Both schemes: 32 bit dta bus and one word reads from a 200ns memory • Synchronous: • Send address to memory: 50ns, read memory: 200ns, send data to device: 50ns for a total tome of 300 ns • BW = 4bytes/300ns = 13.3 MB/sec • Asynchronous: • Can overlap steps 2, 3, and 4 with memory access time • Step 1: 40ns • Steps 2, 3, 4: maximum{3x40ns, 200ns} = 200ns (steps 2,3,4 “hidden by memory access) • Steps 5, 6, 7: 3x40 = 120ns • BW = 4 bytes/ (40+200+120)ns = 11.1 MB/sec • Observation: Synchronous is only 20% faster due to overlap in handshaking • Comment: asynchronous usually [referred because it ti more technology independent and more versatile in handling different device speeds

  17. An Example (pp.665-666)The Effect of Block Size on Synchronous Bus Bandwidth • Bus description • Two cases to consider: Memory & bus system supporting access of 4 word blocks (case 1)and 16 word blocks (case 2)where a word is 32 bits in each case • 64 bit (2 words) synchronous bus clocked at 200MHz (5 ns/cycle)each 64 bit transfer taking 1 clock cycle1 clock cycle needed to send the initial address • Two idle clock cycles needed between bus operations – bus assumed to be idle before an access • A memory access for the first 4 words is 200ns (40 cycles) and each additional set of 4 words is 20 ns (4 cycles) • Assume that a bus transfer of the most recently read data and a read of the next 4 words can be overlapped. • Summary: memory accessed 4 words at a time but must be send over bus in two 2 word shots (2 cycles) since bus is only 2 words wide. • Find: sustained bandwidth,latency (xfr time of 256 words, & # bus transactions/sec for a read of 256 words in two cases: 4-word blocks and 16-word blocks.Note: interpret a “bus transaction” as transferring a (4 or 16 word) block.

  18. An Example (pp.665-666)Case 1: 4-word Block Transfers • 1 clock cycle to send address of block to memory • 200 MHz bus is has a 5ns period (5ns/cycle)memory access time (1st (and only) 4 words) is 200ns#cycles to read memory = (memory access time)/(clock cycle time) = 200ns/5ns = 40 cycles • 2 clock cycles to send data from memorysince we transfer 64 bits = 2 words per cycle and a block is 4 words • 2 idle cycles between this transfer and the next • Note: no overlap here because entire block transferred in one access. Overlap is only within a block for multiple accesses – as in case 2 (next). • Total number of cycles for a block = 45 cycles256 words to be read results in 256/4 = 64 blocks (transactions)thus 45x64 = 2880 cycles needed for the transferlatency = 2880 cycles x 5ns/cycle = 14,400 ns# bus transactions /sec = 64/14400ns = 4.44M transactions/secBW = (256x4)bytes/14400ns = 71.11 MB/sec

  19. An Example (pp.665-666)Case 2: 16-word Block Transfers • Timing for a 1 block (16 word) transfer:  This portion is essentially case 1 Note: a 16 word block is read in four 4 word shots, thus there will be overlap. Total = 1 + 40 + 16 = 57 cycles …was 45 for 4 word block Number of transactions (blocks) needed = 256/16 = 16 transactions … was 64 for 4 word blk Total transfer time = 57x16 = 912 cycles … was 2880 for 4 word block Latency = 912 cycles x 5 ns/cycle = 4560 ns … was 14,400ns for 4 word block Transactions/sec = 16 /4560 ns = 3.51M transactions/sec … was 4.44M for 4 word block BW = (256x4)/4560ns = 244.56 MB/sec … was 71.11 for 4 word block

  20. Controlling Bus Access • Only one on at a time • Bus controls – The “bus master” • Controls access to bus • Initiates & controls all bus requests • Slave • Never generates own requests • Responds to read and write requests • Processor: always a master • Memory: usually a slave • Having a single bus master could create a bottle neck • Processor would be involved with every bus transaction • See fig 8.12 for an example

  21. Bus Control With a Single Master Disk makes request to processor: a data xfr from memory to disk. Processor responds by asserting read request line to memory. Processor acks to disk that request is being processed. Disk now places desired address on the bus. Fig. 8.12

  22. Controlling Bus Access – Multiple Masters • Bus arbitration – deciding which master gets control of the bus: p. 669 • A chip (arbiter) which decides which device gets the bus next • Typically each device has a dedicated line to the arbitrate for requests • Arbiter will eventually issue a grant (separate line to device) • Device now is master, uses the bus, and then signals the arbiter when is is done with the bus. • Devices have priorities • Bus arbiter may invoke a “fairness” rule to low priority device which is waiting • Arbitration time is overhead and should be overlapped with bus transfers whenever possible - maybe use physically separate lines for arbitration.

  23. Arbitration Schemes p. 670 • Daisy chain • Chain from a high to low priority devices • Device making request takes the grant but does not pass it on, grant passed on only by non-requesting devices - no fairness, possible starvation.

  24. Arbitration Schemes p. 670 • Centralized, parallel • Multiple request lineschosen device becomes masterrequires central arbiter – a potential botleneck • Used by PCI • Distributed arbitration - self selection • Multiple request lines • Request: place id code on bus - by examining bus can determine priority • No need for central arbiterneed more lines for requestsex: Nubus for Apple/Mac) • Distributed arbitration by collision detection • Free for all – request bus at will • Collision detector then resolves who gets it • Ethernet uses this.

  25. I/O To Memory, Processor, Os Interfaces • Questions (p. 673) • How do i/o requests transform to device commands and get transferred to a device? • How are data transfers between device and memory done? • What is the role of the Operating System? • The OS: • Device drivers operating at kernel/supervisory mode. • Performs interrupt handling & DMA services. • Functions: Commands to I/O. Respond to I/O signals ... some are interrupts. Control data transfer ... buffers, DMA, other algorithms, control priorities.

  26. Commands To I/O devices • Two basic approaches: • Direct I/O (programmed I/O or “PIO”) • Memory mapped I/O • PIO • Special I/O instructions: in/out for Intel • “Address” associated with in/out put on address bus but - the op-code context causes i/o interface to be access ... usually registers causes I/O activity • Address is an I/O port • Memory mapped => see next

  27. Commands To I/O devices(cont) • Memory mapped • Certain portion of address space reserved for i/o devices • Program communicates with device in same way it does with memory:memory instructions used • If the address is in “device space” range, the device controller responds with appropriate commands to device ... read/write • User programs not allowed to access memory mapped I/O space • Address used by instruction encodes both device identity & types of data transmission • Memory mapped is usually faster than PIO because DMA available

  28. I/O - PROCESSOR COMMUNICATIONpolling/memory mapped • Polling is simplest way for I/O to communicate with processor • Periodically check status bits to seen what to do nextI/O device posts status in a special register, Ex: “I am busy” • Processor Continually Checks For Status Using Either PIO Or Memory Mapped I/O • Wastes a lot of processor time because processors are faster than I/O devices. • Much of the polls occur when the waited for event has not yet happened • OK for slow devices such as a mouse • Under OS control, polls can be limited to periods only when the device is active – thus allowing polling even for faster devices – cheap I/O!

  29. I/O - Example • Examples for slow medium & high speed deviceDetermine impact of polling overhead for 3 devices.Assume number of clock cycles per poll is 400 and 500 MHz clock.In all cases no data can be missed. • Example 1 – a mouse polled 30 times/seccycle/sec for polling = 30 polls x 400 cyc/poll = 12,000 cyc/sec% of processor cycles consumed = 12000/500MHz = 0.002%Negligible impact on performance. • Example 2 – a floppy disk Transfers data to processor is in 16 bit (2 byte) units and has a data rate of 50 KB/sec Polling rate = ( (50 KB/sec)/ 2 bytes/poll) = 25K polls/secCycles/sec for polling = 25K polls/sec x 400 cyc/poll = 107 cyc/sec % of processor cycles consumed = (107cyc/sec)/500MHz = 2%Still tolerable

  30. I/O - Example (cont.) • Example 3 – a hard driveTransfers data in four-word chunksTransfer rate is 4MB/secMust poll at the data rate in 4-word chunks: (4MB/sec)/(16 bytes/xfr)or polling rate is 250K polls/secCycles/sec for polling = (250K polls/sec) x (400cyc/poll) = 108 cyc/sec% of processor cycles consumed = (108 cyc/sec) / 500MHz = 20% • 1/5 of processor would be used in polling the disk!Not acceptable. • The bottom line: polling works OK for low speed devices but not for high speed devices.

  31. Interrupt driven I/O • The problem with simple polling is that it must be done when nothing is happening – during a waiting period • When CPU processing is needed for an I/O event, the processor is interrupted. • Interrupts are asynchronous • Not associated with any particular instruction • Allows instruction completion (compare with exceptions in chapter 5) • Interrupt must convey further information such as identity of device and priority. • Convey this additional information by using vectored interrupts or a cause register.

  32. Interrupt Scheme The “granularity” of an interrupt is a single machine instruction. The check for pending interrupts and processing of interrupts is done between instructions being executed, ie., the current instruction is completed before a pending interrupt is processed

  33. Overhead for Interrupt driven I/O • Using the previous example of a hard drive (p. 676): data transfers in 4 – word chunksTransfer rate of 4MB/sec • Assume overhead for each transfer, including the interrupt is 500 clock cycles • Find the % of processor consumed if hard drive is only transferring data 5% of the time – causing CPU interaction. • Answer: Interrupt rate for busy disk would be same as previous polling rate to match the transfer rate:(250K interrupts/sec) x 500cycles/interrupt = 125x106 cyc/sec % processor consumed during an XFR = 125x106/500MHz = 25%assume disk is transferring data 5% of the time, then% processor consumed during an XFR (average) = 25% x 5% = 1.25% No overhead when disk is not actually transferring data – improvement over polling.

  34. DMA I/O • Polling and interrupt driven I/O best with lower bandwidth devices where cost is more a factor. • Both polling and interrupt driven I/O, puts burden of moving data and managing the transfer on the CPU. • Even though the processor may continue processing during an I/O access, it ultimately must move the I/O data from the device when tha data becomes available or perhaps from some I/O buffer to main memory. • In our previous example of an interrupt driven hard disk, even though the CPU does not have to wait for every I/O event to complete, it would still consume 25% of the CPU cycles while the disk is transferring data. See p. 680. • Interrupt driven I/O for high bandwidth devices can be greatly improved if we make a device controller transfer data directly to memory without involving the processor: DMA (Direct Memory Access).

  35. DMA I/O (cont.) • DMA is a specialized processor that transfers data between memory and an I/O device while the CPU goes on with other tasks. • DMA is external to the CPU ans must act as a bus master. • The CPU first sets up the “DMA registers” with a memory address & the number of bytes to be transferred. • To the requesting program, this may be seen as setting up a “control block” in memory. • DMA is frequently part of the controller for a device. • Interrupts still are used with DMA, but only to inform processor that the I/O transfer is complete or an error. • DMA is a form or multi or parallel processing – not a new idea: IBM Channels for main frames in the 60’s. • Channels are programmable (with channel control words), whereas DMA is generally not programmable.

  36. DMA I/O – How It Works • Three steps of DMA • Processor sets up DMA: Device id, operation, source/destination, number of bytes to transfer • DMA controller “arbitrates” for busSupplies correct commands to device, source, destination, etc.Then let’s the data “rip”.Fancy buffering may be used ... ping/pong buffers.May be multi-channeled • Interrupt processor on completion of DMA or error • DMA can still have contention with processor in competing for memory and bus contention. • Problem: “cycle stealing” - when there is bus/memory contention when CPU is executing a memory word during a DMA xfr, DMA wins out and CPU will pause instruction execution memory cycle (cycle was “stolen”).

  37. Overhead Using DMA • Again use the previous disk example on page 676. • Assume initial setup of DMA takes 1000 CPU cycles • Assume interrupt handling for DMA completion takes 500 CPU cycles • Hard drive has a transfer rate of 4MB/sec and uses DMA • The average transfer size from disk is 8KB • What % of the 500MHz CPU is consumed if the disk is actively transferring 100% of the time? Ignore any bus contention between CPU and DMA controller. • Answer:Each DMA transfer takes 8KB/(4MB/sec) = 0.002 sec/xfrwhen the disk is constantly transferring, it takes:[1000 + 500cyc/xfr]/0.002sec/xfr = 750,000 clock cyc/secsince the CPU runs at 500MHz, then:% of processor consumed = (750,000 cyc/sec)/500 MHz = 0.0015  0.2%

  38. DMA: Virtual Vs. Physical Addressing (p. 683)%%% • In a VM system, should DMA use virtual addresses or physical addresses?– this topic in the book is at best flaky-here is my take on it: • If virtual addresses used: • Contiguous pages in VM may not be contiguous in PM. • DMA request is made by specifying virtual address for starting point of data to be transferred and the number of bytes to be transferred. • DMA unit will have to translate VA to PA for all read/writes to/from memory – a performance problem – actually the address translation may be done by the OS which will provide DMA with the physical addresses: a “scatter/gather” operation – fancy DMA controllers may be able to chain series of pages for a single request of more than one page – OS provides list of physical page frame addresses corresponding to the multi-page DMA block in VM. orRestrict the DMA block sizes to integral pages & translate starting address. • If physical addresses used, they may be not contiguous in virtual memory - if page boundary crossed. Must constrain all DMA transfers to stay within a single page or requests must be for a page at a time. • Also the OS must be savvy enough so it would not relocate pages in the target/source region during a DMA transfer.

  39. DMA: Memory Coherency • DMA & memory/cached systems • W/O DMA all memory access is through address translation and cache • With DMA, data is transferred to/from main memory cache ==> coherency problem • DMA read/writes are to main memory • No cache between processor & DMA controller • Value of a memory location seen by DMA & CPU may differ • If DMA writes into main memory at location for which there are corresponding pages in the cache, the cache data seen by CPU will be obsolete. • If cache is write back, and the DMA reads a value directly from main memory before cache does a write back (due to “lazy” write backs), then the value read by DMA will be obsolete.… remember there is a possibility that DMA will take priority in accessing memory over the CPU – to its disadvantage. • Possible solutions: see next ==>

  40. DMA: Memory Coherency (cont.) • Some solutions: see pp. 683-684 • Route all I/O activity through cache: Performance hit and may be costly May flush out good data needed by processor ... I/O data may not be that critical to the processor at the time it arrivesthe working set may be messed up. • OS selectively invalidates the cache for I/Omemory operation, orforce a write back for an I/O read from memoryI/O operation – called cache flushing.(there may be some “read/write” terminology confusion here!).Some HW support needed here. • Hardware mechanism toselectivelyflush (or invalidate) cacheentriesThis is a common mechanism used in multiprocessor systems where there are many caches for a common main memory (the MP cache coherency problem). The same technique works for I/O – after all DMA is a form of multiprocessing.

  41. Designing an I/O System– The Problem • Specifications for a system • CPU maximum instruction rate: 300 MIPSaverage number of CPU instructions per I/O in the OS: 50,000 • Bandwidth of memory backplane bus: 100 MB/sec • SCSI-2 controller with a transfer rate of 20 MB/secthe SCSI bus on each controller can accommodate up to 7 disks • Disk drives:read/write bandwidth of 5 MB/sec average seek + rotational latency of 10 m • The workload this system must support: • 64 KB reads – sequential on a track • User program needs 100,000 instructions/sec per I/O operation.This is distinct from instructions in the OS. • The problem:Find the maximum sustainable I/O rate and the number of disks and SCSI controllers required. Assume that reads can always be done on an idle disk if one exists – ignore disk conflicts.

  42. Designing an I/O System – The Solution • Strategy: There are two fixed components in the system: memory bus and CPU. Find the I/O rate that each component can sustain and determine which of these is the bottleneck. • Each I/O takes 100,000 user instructions and 50,000 OS instructionsMax I/O rate for CPU = (instruction rate)/(Instructions per I/O) = (300x106) / [(50+100)x103] = 2000 I/Os per sec • Each I/O transfers 64KB, thus:Max I/O rate of backplane bus = (bus BW)/(bytes per I/O = (100x106)/(64x103) = 1562 I/Os per sec • The bus is the bottleneck … design the system to support the bus performance of 1562 I/Os per sec. • Number of disks needed to accommodate 1562 I/Os per sec:Time per I/O at the disk = seek/rotational latency + transfer time = 10ms + 64KB/(5MB/sec) = 22.8 msThus each disk can complete 1/22.8ms = 43.9 I/Os per secTo saturate the bus, we need (1562 I/Os per sec) / 43.9 I/Os per sec = 36 disks. • How many SCSI busses is this?Required transfer rate per disk = xfr size/xfr time = 64KB/22.8ms = 2.74MB/secAssume we can use all the SCSI bus BW. We can place SCSI BW/xfr rate per disk = (20MB/sec)/(2.74MB/sec) = 7.3 ==> 7 disks on each SCSI bus. Note SCSI bus can support a max of 7 disks.For 36 disks we need 36/7 = 5.14 ==>6 buses.

More Related