1.1k likes | 1.39k Views
File Processing. By Seonggyu Kim. File Structure Design. Memory – small, speedy, volatile Auxiliary Memory large capacity at much less cost slow access time nonvolatile File Structure – representations for data in files and of operations for accessing them .
E N D
File Processing By Seonggyu Kim
File Structure Design • Memory – small, speedy, volatile • Auxiliary Memory • large capacity at much less cost • slow access time • nonvolatile • File Structure – representations for data in files and of operations for accessing them
Good File Structure Design • Good File Structure Design does not let applications spend a lot of time waiting for the files on disks • Reduce the number of disk accesses • Reduce the size of file while meeting the specification of an Application
data processing information Information & Data • information data D P I (in disk, tape) (computer) I = P(D)
File Operations • Physical files, Logical files • Open, Read/Write, Close • Basic File Operations • Creation • Skeleton design (definition) • Data collection & Validation • loading
File Operations continued • Update : insertion, deletion, modification • Retrieval for inquiry and report • Comprehensive retrieval – all • Selective retrieval value • Qualification based on relationship • Maintenance • Restructuring – changes in file structure • Reorganization – changes in file organization
Performance Criteria • Memory • Constant access time • Number of comparisons • Files in Auxiliary Memory • Number of accesses • Access methods and time • Response time • File activity ratio
File System Organization • Flat file system • Tree-structured file system / bin usr dev cc yacc bin lib console kbd TAPE
Secondary Storage • Good file design uses knowledge of disk and tape performance to arrange data in ways that minimize access costs • Characteristics of secondary storage • DASD(Direct Access Storage Device) • SASD(Sequential Access Storage Device)
Disks • Direct Access Storage Device • Magnetic disks • Hard disk • Floppy Disk • ZIP, Jazz • Optical disks • CD, DVD • Magneto-Optical disks Storage Device Storage Medium
Hard Disks • Magnetic Disk • Measurement • Number of Recording Surfaces • (Sustained) Data Transfer Rate • Access Time • Rotational Delay • Recording Density • Mean Time Between Failure
Organizing Tracks by Sector physical Placement - Interleaving
Sector Organization 1 • Cluster –fixed number of contiguous sectors • requires just one seek for all sector in a cluster • FAT(File Allocation Table) • Extent –consists of entirely contiguous clusters
Sector Organization 2 • Fragmentation –no convenient fit between records and sectors • Spanning • Block – organized to hold an integral number of logical records • Blocking factor – number of records in a block
Estimating Capacities • Track size = number of sectors per track * sector size (512B) • Cylinder size = number of tracks per cylinder * track size • Drive Size = number of cylinders * cylinder size
Estimating Space Needs • A Drive with • Number of sectors / track = 63 • Number of tracks / cylinder = 16 • Number of cylinder = 4092 • Cylinders need to store 50,000 256B data? • Cylinder = 63*16*512B • # of cylinders = 50000*256 / 63*15*512
Cost of Disk Access • Access time = seek time + rotational delay + transfer time • Seek time – time taken to move the access arm to the cylinder where data resides • Rotational Delay – time taken to position the R/W head to the sector • Staggered tracks on a cylinder
Rotational Delay • Rotation delay = • Transfer time =
Some Timing Computation • A 9.1GB disk with 8 ms average seek time, 10000 rpm, 526 cylinders, 16 tracks per cylinder • How many sectors per track? • Rotational Delay? • The Time taken to read 34,000 256B records randomly and sequentially?
Disk as Bottleneck • Striping • RAID(Redundant Array of Inexpensive Disks) • Buffering • RAM Disk • Disk Cache
CD-ROM • Child of CD audio • Infrared laser reading pits and lands • Address = minute:second:sector number 75 sectors / second
CLV & CAV • Constant Linear Velocity(CLV) • spiral track • same recording density • Constant Angular Velocity(CAV) • concentric tracks and pie-shaped sectors • same speed
CD-ROM strengths & weaknesses • Slow seek time • Slow data transfer rate • Read-only access • Large capacity needs indexes and structures to overcome CD-ROM’s poor performance • Permanent archives • Mass production
Digital Versatile / Video Disk • Double sided, Red laser • 4.7 GB ~17 GB
A Journey of a Byte User’s program File Manager … write(textfile, ch, 1) invoke I/O processor …… I/O processor program ….. User’s data area ch: P P I/O processor P Disk controller System buffer
Buffer Management • Buffer compensates the speed difference between memory & secondary storage for I/O bound job • Double buffering -two buffers - lets CPU fill or read a buffer and I/O be performed at the same time • Buffer Pooling employs a pool of buffers • LRU(Least Recently Used) replacement strategy • LFU(Least Frequently Used), FIFO
A Stream File AmesMary123MapleEvanstonIL60201MasonAlan90EastgateAdaOK74820 • Name Ames Mary Mason Alan • Street 123 Maple 90 Eastgate • City Evanston IL60201 Ada OK74820
Field & Record Organization • Field –the smallest logically (or conceptually) meaningful unit of information in a files • Record –a set of fields belong together Organization affects the way we save, retrieve, manages data in a file
Field Structures • Fixed-length fields • Fast and easy access • Chopping or fragmentation • Each Field with a length indicator • Fields with delimiters • Field expressed in Keyword=value • AmesMary123MapleEvanstonIL60201MasonAlan90EastgateAdaOK74820 Ames Mary … Mason Alan … 04Ames 04Mary … 05Mason04Alan … Ames| Mary| … Mason| Alan| … last=Ames| first=Mary …
Record Structures • Fixed-length Fields • Fixed number of Fields • Each record with length indicator • Second file to keep track of the beginning address for each record • Record with delimiter
Record Access • Canonical form for a key • Ames, ames, AMES • Primary key, secondary key • Primary key – dataless not real data – unchanging • Sequential Search • File Access
Beyond Record Structures • Headers & Self-describing files • Metadata –data describing the primary data in a file • Mixing object types in a file Header notes images header … Simple=“T” Maxis=4 Scale=0.015 • Portability and Standardization
Secure Communication E-commerce Digital signature Masking– substitution Message is masked in such a way that the resulting message that goes out in an open communication channel, seems harmless and inconspicuous. Veiling– transposition Veiled messages are usually not masked at all, but simply combined within other items regularly in such a way that resulting message takes form of yet another message, called acrostics. Data Encryption
Transposition • Transposition is simply moving the relative positions of letters within a message. • Usually used in a stage of more complex cryptosystems (such as in applying key-based encryption) • When performing a columnar transposition, a keyword is first needed. The message is then written into rows beneath the keyword. csetrmeseseasg
Caesar Substitution • One of the simplest monoalphabetic substitutions • One of the easiest to break. • Using a simple substitution cipher, where the plain text letter was replaced by the cipher text three places down the alphabet, so that the letter M is replaced by P and so on. • plain text this is a simple ciphercipher text vjku ku c ukorng ekrjgt
Polybius Chequerboard • Polybius was the name of the Greek who invented a system of converting alphabetic characters into numeric characters. It was devised to enable messages to be easily signaled using torches. 31345115 3215
Map cipher • Map ciphers are maps that look normal but have a secret cipher hidden within. • To create a map cipher, create a set of symbols that stand for each letter in the alphabet. The example below uses tree branches and a matrix to create an alphabet of trees. The position of each letter in the matrix determines the number of branches on the left or right side of the tree.
Key-based Encryption Plain Text Cipher Text decrypted encrypted n-bit KEY Plain text 1 2 3 4 5 6 5 4 3 2 1 4 2 3 2 4 2 3 2 4 2 3 repeat the key as many times as necessary to cover the whole message where Key is "4232". Encrypted text 5 4 6 6 9 8 8 6 7 4 4 Brute-force attack involves running through possible combinations of keys and applying them to the cryptosystem until the message is decrypted. Most 56-bit key cryptosystems can be broken in less than one week.
Symmetrical Key • Private key easy to be implemented in hardware • Encrypted files stored on the hard disk, Data sent to someone close by • Stream ciphers can encrypt a single bit of plaintext at a time whereas block ciphers encrypt multiple bits (block) of data (normally 64 bits). • Disadvantages include: • The authenticity of the originator of the data cannot be verified • The private key has to be transmitted in a very secure channel • When used across a network of users, there may have to be a large number of keys to facilitate one-to-one communication between each user. DES (Data Encryption Standard) AES (Advanced Encryption Standard)
Asymmetrical Key • Public key encryption was invented in 1976 to circumvent the problems of managing the private key. • No need to send both the encrypted message and the key to the target • Public key encryption can be used for authentication via the digital signature mechanism. Thus Message is not only protected in terms of secrecy, but also in integrity. • Disadvantages include: • Public key ciphers generally require longer keys than symmetric ciphers to achieve the same level of security • They also require much longer time to decrypt than symmetric method RSA – very large prime number PGP (Pretty Good Privacy)
Organizing Files for Performance • Data Compression • Reclaim Spaces in Files • Searching • Keysorting
Data Compression • Irreversible – speech compression • Reversible –compaction notation e.g.) 50 states run-length encoding e.g.) 22 23 24 24 24 24 24 24 24 25 .. 22 23 ff 24 07 25 … variable-length codes • Use less storage • Can be transmitted faster • Can be processed faster Small files
Code – Fundamental Concepts • A code is a mapping of source messages into codewords • Distinct 1:1 • Uniquely decodable –prefix free code • aa bbb cccc dddd eeeee ffffff ggggggg • Block-block code : ASCII, EBCDIC • a 000, b 001, c 010, .. • Variable-variable code • aa 0, bbb 1, ccc 10, dddd 11, eeeee 100
Shannon-Fano Coding • List source messages a(i) and their probabilities p(a(i)) in order of nonincreasing probability • Divide list in such a way as to form two groups of as nearly equal total probabilities as possible • Assign 0 to each message in the first group as the first digit of its codeword and 1 to the messages in the second half • Divide each of these groups according to the same criterion and append additional code digits until each subset contains only one message Shannon-Fano Huffman g 8/40 00 00 f 7/40 010 110 e 6/40 011 111 d 5/40 100 010 space 5/40 101 101 c 4/40 110 011 b 3/40 1110 1000 a 2/40 1111 1001 • How about for the set of probabilities { .35, .17, .17, .16, .15 }?
Huffman Encoding Letter a b c d e f Probability 0.4 0.2 0.1 0.1 0.1 0.1 1.0 1 0 .4 .4 .4 .4 .6 1.0 .2 .2 .2 .4 .4 .1 .2 .2 .2 .1 .1 .2 .1 .1 .1 a 1 .6 0 b .4 1 0 .2 .2 0 1 1 0 c d e f Efficiency : 3 bits versus 0.4*1+0.2*2+0.1*4*4=2.4 bits Dynamic Huffman Encoding
Redundancy • John F. Kennedy's 1961 inaugural address : "Ask not what your country can do for you -- ask what you can do for your country“ • Concept of separate words • "ask" appears two times 1 • "what" appears two times 2 • "your" appears two times 3 • "country" appears two times 4 • "can" appears two times 5 • "do" appears two times 6 • "for" appears two times 7 • "you" appears two times 8 "1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4"
Searching for Patterns • LZ adaptive dictionary-based algorithm • Searching for Repeated patterns in "Ask not what your country can do for you -- ask what you can do for your country“ • ask__ 1 • what__ 2 • you 3 • r__country 4 • __can__do__for__you 5 "1not__2345__--__12354"
Some Remarks on Data Compression • Communications, backup, database, still images, audio & video • Lossy compression – much smaller file, indistinguishable to human ear or eye • JPEG, MPEG-1,2,4,7,21 MP3 (MPEG-1 Layer 3), AC-3 • Lossless compression • Huffman, GIF, PNG, TIFF