1 / 107

File Processing

File Processing. By Seonggyu Kim. File Structure Design. Memory – small, speedy, volatile Auxiliary Memory large capacity at much less cost slow access time nonvolatile File Structure – representations for data in files and of operations for accessing them .

chacha
Download Presentation

File Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. File Processing By Seonggyu Kim

  2. File Structure Design • Memory – small, speedy, volatile • Auxiliary Memory • large capacity at much less cost • slow access time • nonvolatile • File Structure – representations for data in files and of operations for accessing them

  3. Good File Structure Design • Good File Structure Design does not let applications spend a lot of time waiting for the files on disks • Reduce the number of disk accesses • Reduce the size of file while meeting the specification of an Application

  4. data processing information Information & Data • information  data D P I (in disk, tape) (computer) I = P(D)

  5. File Operations • Physical files, Logical files • Open, Read/Write, Close • Basic File Operations • Creation • Skeleton design (definition) • Data collection & Validation • loading

  6. File Operations continued • Update : insertion, deletion, modification • Retrieval for inquiry and report • Comprehensive retrieval – all • Selective retrieval value • Qualification based on relationship • Maintenance • Restructuring – changes in file structure • Reorganization – changes in file organization

  7. Performance Criteria • Memory • Constant access time • Number of comparisons • Files in Auxiliary Memory • Number of accesses • Access methods and time • Response time • File activity ratio

  8. File System Organization • Flat file system • Tree-structured file system / bin usr dev cc yacc bin lib console kbd TAPE

  9. Secondary Storage • Good file design uses knowledge of disk and tape performance to arrange data in ways that minimize access costs • Characteristics of secondary storage • DASD(Direct Access Storage Device) • SASD(Sequential Access Storage Device)

  10. Disks • Direct Access Storage Device • Magnetic disks • Hard disk • Floppy Disk • ZIP, Jazz • Optical disks • CD, DVD • Magneto-Optical disks Storage Device Storage Medium

  11. Hard Disks • Magnetic Disk • Measurement • Number of Recording Surfaces • (Sustained) Data Transfer Rate • Access Time • Rotational Delay • Recording Density • Mean Time Between Failure

  12. Sector, Track, Cylinder

  13. Organizing Tracks by Sector physical Placement - Interleaving

  14. Sector Organization 1 • Cluster –fixed number of contiguous sectors • requires just one seek for all sector in a cluster • FAT(File Allocation Table) • Extent –consists of entirely contiguous clusters

  15. Sector Organization 2 • Fragmentation –no convenient fit between records and sectors • Spanning • Block – organized to hold an integral number of logical records • Blocking factor – number of records in a block

  16. Estimating Capacities • Track size = number of sectors per track * sector size (512B) • Cylinder size = number of tracks per cylinder * track size • Drive Size = number of cylinders * cylinder size

  17. Estimating Space Needs • A Drive with • Number of sectors / track = 63 • Number of tracks / cylinder = 16 • Number of cylinder = 4092 • Cylinders need to store 50,000 256B data? • Cylinder = 63*16*512B • # of cylinders = 50000*256 / 63*15*512

  18. Cost of Disk Access • Access time = seek time + rotational delay + transfer time • Seek time – time taken to move the access arm to the cylinder where data resides • Rotational Delay – time taken to position the R/W head to the sector • Staggered tracks on a cylinder

  19. Rotational Delay • Rotation delay = • Transfer time =

  20. Some Timing Computation • A 9.1GB disk with 8 ms average seek time, 10000 rpm, 526 cylinders, 16 tracks per cylinder • How many sectors per track? • Rotational Delay? • The Time taken to read 34,000 256B records randomly and sequentially?

  21. Disk as Bottleneck • Striping • RAID(Redundant Array of Inexpensive Disks) • Buffering • RAM Disk • Disk Cache

  22. CD-ROM • Child of CD audio • Infrared laser reading pits and lands • Address = minute:second:sector number 75 sectors / second

  23. CLV & CAV • Constant Linear Velocity(CLV) • spiral track • same recording density • Constant Angular Velocity(CAV) • concentric tracks and pie-shaped sectors • same speed

  24. CD-ROM strengths & weaknesses • Slow seek time • Slow data transfer rate • Read-only access • Large capacity needs indexes and structures to overcome CD-ROM’s poor performance • Permanent archives • Mass production

  25. Digital Versatile / Video Disk • Double sided, Red laser • 4.7 GB ~17 GB

  26. SSD(Solid State Drive) trim

  27. A Journey of a Byte User’s program File Manager … write(textfile, ch, 1) invoke I/O processor …… I/O processor program ….. User’s data area ch: P P I/O processor P Disk controller System buffer

  28. Buffer Management • Buffer compensates the speed difference between memory & secondary storage for I/O bound job • Double buffering -two buffers - lets CPU fill or read a buffer and I/O be performed at the same time • Buffer Pooling employs a pool of buffers • LRU(Least Recently Used) replacement strategy • LFU(Least Frequently Used), FIFO

  29. A Stream File AmesMary123MapleEvanstonIL60201MasonAlan90EastgateAdaOK74820 • Name Ames Mary Mason Alan • Street 123 Maple 90 Eastgate • City Evanston IL60201 Ada OK74820

  30. Field & Record Organization • Field –the smallest logically (or conceptually) meaningful unit of information in a files • Record –a set of fields belong together Organization affects the way we save, retrieve, manages data in a file

  31. Field Structures • Fixed-length fields • Fast and easy access • Chopping or fragmentation • Each Field with a length indicator • Fields with delimiters • Field expressed in Keyword=value • AmesMary123MapleEvanstonIL60201MasonAlan90EastgateAdaOK74820 Ames Mary … Mason Alan … 04Ames 04Mary … 05Mason04Alan … Ames| Mary| … Mason| Alan| … last=Ames| first=Mary …

  32. Record Structures • Fixed-length Fields • Fixed number of Fields • Each record with length indicator • Second file to keep track of the beginning address for each record • Record with delimiter

  33. Record Access • Canonical form for a key • Ames, ames, AMES • Primary key, secondary key • Primary key – dataless not real data – unchanging • Sequential Search • File Access

  34. Beyond Record Structures • Headers & Self-describing files • Metadata –data describing the primary data in a file • Mixing object types in a file Header notes images header … Simple=“T” Maxis=4 Scale=0.015 • Portability and Standardization

  35. Secure Communication E-commerce Digital signature Masking– substitution Message is masked in such a way that the resulting message that goes out in an open communication channel, seems harmless and inconspicuous. Veiling– transposition Veiled messages are usually not masked at all, but simply combined within other items regularly in such a way that resulting message takes form of yet another message, called acrostics. Data Encryption

  36. Transposition • Transposition is simply moving the relative positions of letters within a message. • Usually used in a stage of more complex cryptosystems (such as in applying key-based encryption) • When performing a columnar transposition, a keyword is first needed. The message is then written into rows beneath the keyword. csetrmeseseasg

  37. Caesar Substitution • One of the simplest monoalphabetic substitutions • One of the easiest to break. • Using a simple substitution cipher, where the plain text letter was replaced by the cipher text three places down the alphabet, so that the letter M is replaced by P and so on. • plain text this is a simple ciphercipher text vjku ku c ukorng ekrjgt

  38. Polybius Chequerboard • Polybius was the name of the Greek who invented a system of converting alphabetic characters into numeric characters. It was devised to enable messages to be easily signaled using torches. 31345115 3215

  39. Map cipher • Map ciphers are maps that look normal but have a secret cipher hidden within. • To create a map cipher, create a set of symbols that stand for each letter in the alphabet. The example below uses tree branches and a matrix to create an alphabet of trees. The position of each letter in the matrix determines the number of branches on the left or right side of the tree.

  40. Key-based Encryption Plain Text Cipher Text decrypted encrypted n-bit KEY Plain text 1 2 3 4 5 6 5 4 3 2 1 4 2 3 2 4 2 3 2 4 2 3 repeat the key as many times as necessary to cover the whole message where Key is "4232". Encrypted text 5 4 6 6 9 8 8 6 7 4 4 Brute-force attack involves running through possible combinations of keys and applying them to the cryptosystem until the message is decrypted. Most 56-bit key cryptosystems can be broken in less than one week.

  41. Symmetrical Key • Private key easy to be implemented in hardware • Encrypted files stored on the hard disk, Data sent to someone close by • Stream ciphers can encrypt a single bit of plaintext at a time whereas block ciphers encrypt multiple bits (block) of data (normally 64 bits). • Disadvantages include: • The authenticity of the originator of the data cannot be verified • The private key has to be transmitted in a very secure channel • When used across a network of users, there may have to be a large number of keys to facilitate one-to-one communication between each user. DES (Data Encryption Standard) AES (Advanced Encryption Standard)

  42. Asymmetrical Key • Public key encryption was invented in 1976 to circumvent the problems of managing the private key. • No need to send both the encrypted message and the key to the target • Public key encryption can be used for authentication via the digital signature mechanism. Thus Message is not only protected in terms of secrecy, but also in integrity. • Disadvantages include: • Public key ciphers generally require longer keys than symmetric ciphers to achieve the same level of security • They also require much longer time to decrypt than symmetric method RSA – very large prime number PGP (Pretty Good Privacy)

  43. Organizing Files for Performance • Data Compression • Reclaim Spaces in Files • Searching • Keysorting

  44. Data Compression • Irreversible – speech compression • Reversible –compaction notation e.g.) 50 states run-length encoding e.g.) 22 23 24 24 24 24 24 24 24 25 .. 22 23 ff 24 07 25 … variable-length codes • Use less storage • Can be transmitted faster • Can be processed faster Small files

  45. Code – Fundamental Concepts • A code is a mapping of source messages into codewords • Distinct  1:1 • Uniquely decodable –prefix free code • aa bbb cccc dddd eeeee ffffff ggggggg • Block-block code : ASCII, EBCDIC • a 000, b 001, c 010, .. • Variable-variable code • aa 0, bbb 1, ccc 10, dddd 11, eeeee 100

  46. Shannon-Fano Coding • List source messages a(i) and their probabilities p(a(i)) in order of nonincreasing probability • Divide list in such a way as to form two groups of as nearly equal total probabilities as possible • Assign 0 to each message in the first group as the first digit of its codeword and 1 to the messages in the second half • Divide each of these groups according to the same criterion and append additional code digits until each subset contains only one message Shannon-Fano Huffman g 8/40 00 00 f 7/40 010 110 e 6/40 011 111 d 5/40 100 010 space 5/40 101 101 c 4/40 110 011 b 3/40 1110 1000 a 2/40 1111 1001 • How about for the set of probabilities { .35, .17, .17, .16, .15 }?

  47. Huffman Encoding Letter a b c d e f Probability 0.4 0.2 0.1 0.1 0.1 0.1 1.0 1 0 .4 .4 .4 .4 .6 1.0 .2 .2 .2 .4 .4 .1 .2 .2 .2 .1 .1 .2 .1 .1 .1 a 1 .6 0 b .4 1 0 .2 .2 0 1 1 0 c d e f Efficiency : 3 bits versus 0.4*1+0.2*2+0.1*4*4=2.4 bits Dynamic Huffman Encoding

  48. Redundancy • John F. Kennedy's 1961 inaugural address : "Ask not what your country can do for you -- ask what you can do for your country“ • Concept of separate words • "ask" appears two times 1 • "what" appears two times 2 • "your" appears two times 3 • "country" appears two times 4 • "can" appears two times 5 • "do" appears two times 6 • "for" appears two times 7 • "you" appears two times 8  "1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4"

  49. Searching for Patterns • LZ adaptive dictionary-based algorithm • Searching for Repeated patterns in "Ask not what your country can do for you -- ask what you can do for your country“ • ask__ 1 • what__ 2 • you 3 • r__country 4 • __can__do__for__you 5  "1not__2345__--__12354"

  50. Some Remarks on Data Compression • Communications, backup, database, still images, audio & video • Lossy compression – much smaller file, indistinguishable to human ear or eye • JPEG, MPEG-1,2,4,7,21 MP3 (MPEG-1 Layer 3), AC-3 • Lossless compression • Huffman, GIF, PNG, TIFF

More Related