CIS 402: File Management Techniques Chapter 5

CIS 402: File Management Techniques Chapter 5 Managing Files of Records

Chapter Objectives • Extend the file structure concepts of Chapter 4: • Search keys and canonical forms • Sequential search and Direct access • Files access and file organization • Examine other kinds of the file structures in terms of • Abstract data models • Metadata • Object-oriented file access • Extensibility • Examine issues of portability and standardization.

Record Access • Record Key • Canonical form : a standard form of a key • e.g. Ames or ames or AMES (need conversion) • Distinct keys : uniquely identify asingle record • Primary keys, Secondary keys, Candidate keys • Primary keys should be dataless (not updatable) • Primary keys should be unchanging • Social-securiy-number: good primary key • but, 999-99-9999 for all non-registered aliens • Measurement of work: • Comparisons: occur in main memory • Disk accesses: main bottleneck

Sequential Search Sequential search is least efficient. Our main pursuit for the duration of the term is to present improved search methods • O(n), n : the number of records • Use record blocking to reduce work • A block of several records • fields < records < blocks • O(n), but blocking decreases the number of seek • sequential within each block • e.g.- 4000 records, 512 bytes each, sector size 512 bytes • Unblocked (sector-sized buffers): 512 (½K buffer) • => average 2000 READ() calls • Blocked (16 recs / block) : 8K size buffer => average 125 READ() calls • Can further improve upon performance by using block key containing last record key to avoid searching within blocks where data can’t be

Sequential Search: Best Uses • UNIX sequential processing commands • cat, wc, grep • When is Sequential Search Superior? • Repetitive hits • Searching for patterns in ASCII files • Searching records with a certainsecondary key value • Small Search Set • Processing files with few records • Devices/media most hospitable to sequential access • tape • binary file on disk

Direct Access • Access a record without searching • O(1) operation • RRN ( Relative Record Number ) • Gives relative position of the record • O(n) process with variable-length records • Easy with fixed-length records: RRN*sizeof(record) • View file as collection of records, not bytes; all byte info is internal • Byte offset = N X R • r : record size • n : RRN value • Class IOBuffer includes • direct read (DRead) • direct write (DWrite) • take byte offset as argument, along with stream • use polymorphism to pick correct Read/Write fns.

OHIO 10847115 7264.9 4133035 3 1180317COLUMBUS OHIO|10847115|7|264.9|41330|35|3|1|1803|17|COLUMBUS\0....\0 Choosing Record Length and Structure • Record length is related to the size of the fields • Access vs. fragmentaion vs. implementation • Fixed length record • fixed-length fields • variable-length fields • Unused space portion is filled with null character in C • e.g. delimited

Header Records • File as a Self-Describing Object • General information about file • date and time of recent update, • number of records • size of record, fields (fixed-length record & field) • delimiter (variable-length field) • Often placed at the beginning of the file • Pascal did not naturally support header records (File is a repeated collection of the same type) • Use variant records (depending on context) • In C: union • polymorphic structure

Abstract base class for file buffers class IOBuffer public : virtual int Read( istream & ) = 0; // read a buffer from the stream virtual int Write( ostream &) const = 0; // write a buffer to the stream // these are the direct access read and write operations virtual int DRead( istream &, int recref ); //read specified record virtual int DWrite( ostream &, int recref ) const; // write specified record // these header operations return the size of the header virtual int ReadHeader ( istream & ); virtual int WriteHeader ( ostream &) const; protected : int Initialized ; // TRUE if buffer is initialized char *Buffer; // character array to hold field values IO Buffer Class definition

IO Buffer Class definition Full definition of buffer class hierarchy • WriteHeader method : • writes the header string at the beginning of the file. Possible strings: • “Variable” • “Fixed” • Returns size of header written • ReadHeader method : • reads the header id string. Must be the expected record type, variable or fixed length • If the string matches that subclass’ header string, returns size of header • any other string causes return of –1  header doesn’t match buffer • DWrite/DRead methods : • operates using the byte address of the record as the record reference. Methods begin by seeking to the requested spot.

Encapsulating Record I/O Operations in a Single Class • Good design for making objects persistent • provide operation to read and write objects directly • Write operation until now : • two operations : • pack into a buffer • write the buffer to a file • Class ‘RecordFile’ • supports a write operation that takes an object and writes it to a file. • use of buffers is encapsulated within the class • must be generalized, as it is built with a generic type

Encapsulation Record: I/O Operation in a Single Class • Class ‘RecordFile’ • uses C++ template features to become generic • definition of the template class RecordFile • template <class RecType> • class RecordFile : public BufferFile • { • public: • int Read(RecType& record, int recaddr = -1); • int Write(const RecType& record, int recaddr=-1); • RecordFile(IOBuffer& buffer) : BufferFile(buffer) { } • };

// template method bodies template <class RecType> int RecordFile<RecType>::Read (RecType &record, int recaddr) { int writeAdd, result; writeAddr = BufferFile::Read (recaddr); if (!writeAddr) return -1; result = record.Unpack(Buffer); if (!result) return -1; return writeAddr; } template <class RecType> int RecordFile<RecType>::Write (const RecType &record, int recaddr) { int result; result = record . Pack (Buffer); if (!result) return -1; return BufferFile::Write (recaddr); }

File Organization File Access Variable-length Records Sequential access Fixed-length records Direct access File Access and File Organization • There is difference between file access and file organization. • Variable-length records • Sequential access is suitable • Fixed-length records • Direct access and sequential access are possible • Note: Book references to Pascal are completely obsolete. It is unusual in present-day programming languages to be unable to freely maneuver within a file

Abstract Data Model • Data object such as document, images, sound • e.g. images, sound • Abstract Data Model does not view data as it appears on a particular medium. • application-oriented view • application shielded from details of storage on medium • How to specify a file’s content? • Headers and Self-describing files • e.g. images: jpg: ÿØÿà JFIF gif: GIF89a • e.g. sounds: mp3: ÿûD EQ¹à wav: RIFF$P WAVEfmt

Metadata • Data that describe the primary data in a file • e.g. <Meta> in html • Store in the header record • Standard format • As shown on previous slide

Mixing object Types in a file • Each field is identified using “keyword = value” • Index table with tags • e.g.

Object-oriented file access • Separate translating to and from the physical format and application (representation-independent file access) • provide a function to handle access (OO style) • encapsulate details • read_image() is image file type independent; method determines file type Program find_star : read_image(“star1”, image) process image : end find_star image : star1 star2 RAM Disk

Extensibility • Advantage of using tags • Identify object within files • do not require a priori knowledge of the types of objects • New type of object • implement method for reading and writing in appropriate module (separate concerns) • call the method.

Factor affecting Portability • Differences among operating systems • e.g. CR/LF in DOS • Differences among languages • physical layout of files may be constrained by language limitation • Differences in machine architectures • byte order: e.g. Unix: hton, ntoh • Differences on platforms • e.g. EBCDIC vs. ASCII

Achieving Portability • Standardization • Standard physical record format • extensible, simple • Standard binary encoding for data elements • IEEE, XDR • File structure conversion • Number and text conversion • Established, well-known methods of conversion

Achieving Portability • File system difference • Block size is 512 bytes on UNIX systems • Block size is 2880 bytes on many non-UNIX systems • UNIX and Portability • UNIX support portability by being commonly available on a large number of platforms • UNIX provides a utility called dd • dd : facilitates data conversion

CIS 402: File Management Techniques Chapter 5