150 likes | 315 Views
Organizing files for performance. Chapter 6. 6.1 Data compression. Advantages of reduced file size Redundancy reduction: state code example Repeating sequences: run length encoding Variable length code static (Morse code) dynamic (Huffman code) Irreversible compression (e.g., jpeg)
E N D
Organizing files for performance Chapter 6
6.1 Data compression • Advantages of reduced file size • Redundancy reduction: state code example • Repeating sequences: run length encoding • Variable length code • static (Morse code) • dynamic (Huffman code) • Irreversible compression (e.g., jpeg) • Unix routines (append .z to compressed files)
6.2 Reclaiming space • “Holes” arise when • variable length records are updated • fixed or variable length records are deleted • Compaction (for deleted records) • mark deleted records • allows undelete to be implemented • periodically run compaction program
6.2.2 Dynamic reclamation • Simple approach: search sequentially until space is found to insert a new record; drawback: very slow • Alternative uses linked list stack to allow immediate access to an empty slot, if available; stack may be kept in deleted record slots, with RRN of top in header record.
6.2.3 Variable length records • Same scheme (linked list stack) may be used, except byte offset rather than RRN must be used as link • Deleted records go on top of stack, but stack must be searched when adding records to find a space big enough to accommodate each new record
6.2.4 Fragmentation • Internal • fixed length records • “unsophisticated” variable length scheme • External: variable length records • smaller record is placed in a larger slot • leftover space is added to available list • Coalescing holes (good test question)
6.2.5 Placement strategies • First fit: first record slot that’s big enough • Best fit: sort slots in ascending order by size, then use first fit • Worst fit: sort in descending order • no need to search: just use first space if it’s big enough • leftover space may be enough for another record
6.3.2 Binary search • relational ops for search key • retrieval by RRN • object-oriented presentation of algorithm • implementation with templates • compilation with class definitions
6.3.3-4 Search performance • complexity for binary search is O(log2n), compared to O(n) for sequential search • records must be sorted on search key • disk sort is prohibitively expensive • “internal sort” allows direct accesses in memory
6.3.5 Limitations • number of disk accesses for binary search is still significant for large files • keeping a file sorted can be less efficient than using sequential search; merge technique addresses this problem • internal sort is limited to small files, that will fit entirely in memory
6.4 Keysort • only keys are kept in memory • each key is kept with its RRN (keynode) • keynode array is sorted in memory • data file can be sorted by reading records in order or sorted keynodes and writing them to a new file • keynodes can be written as an index file
6.4.4 Pinned records • available list (of deleted record slots) • records whose physical locations are referenced in other records are pinned