File organisation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 56

File Organisation PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on
  • Presentation posted in: General

File Organisation. Placing File on Disk. File – a sequence of records Records Record type Record fields Data type Number of bytes in a field fixed Variable. Record Characteristics. A logical view: SELECT * FROM STUDENTS or (Smith, 17, 1, CS) , (Brown, 8, 2, CS) or

Download Presentation

File Organisation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


File organisation

File Organisation


Placing file on disk

Placing File on Disk

  • File – a sequence of records

  • Records

    • Record type

    • Record fields

    • Data type

    • Number of bytes in a field

      • fixed

      • Variable


Record characteristics

Record Characteristics

  • A logical view:

    • SELECT * FROM STUDENTS or

    • (Smith, 17, 1, CS) , (Brown, 8, 2, CS) or

    • STUDENT(Name, Number, Class, Major)

  • A physical view:

    • (20 bytes + 4 bytes + 4 bytes + 3 bytes)

    • data types determine record length

      - -records can be of fixed or variable length


Fixed versus variable length records

Fixed Versus Variable Length Records

  • FIXED LENGTH:

    • every record has same fields

    • field can be located relative to record start

  • VARIABLE LENGTH - FIELDS:

    • Some fields have unknown length

    • use a field separator

    • Use a record terminator

  • WHAT IF RECORDS ARE SMALLER THAN A BLOCK? - BLOCKING FACTOR

  • WHAT IF RECORDS ARE LARGER THAN A BLOCK? - SPANNING RECORDS


Record blocking

Record blocking

Allocating records to disk blocks

  • Unspanned records

    • Each record is fully contained in one block

    • Many records in one block

    • Blocking factor bfr – number of recordsthat fit in one block

      Example: Block size B = 1024 record size (fixed) R = 150

      bfr = 1024/150  = 6 (floor and ceiling functions)

  • Spanned organization

    • Record ‘continued’ on the consecutive block

    • Required pointer to point the block with the remainder of a record

  • If records are of a variable length , then bfr could represent the average number of records per bloc (the rounding function does not apply)


File structure

File structure

  • File – as a set of pages (disk blocks) storing records

  • File header

    • Record format, types of separators

    • Block address(es)

  • Blocks allocated

    • Contiguous

    • Linked (use of block pointers)

    • Linked clusters

    • Indexed


Searching for a record

Searching for a record

Search for a record on disk,

  • one or more file blocks copied into buffers.

  • Programs search for the desired record in the buffers, using the information in the file header.

    • If the address of the block with desired record is not known, the search programs must do a linear search through the file blocks. Each file block is copied into a buffer and searched either until the record is located or all the file blocks have been searched unsuccessfully.

      The goal of a good file organization is to locate the block that contains a desired record with a minimal number of block transfers


Operations on files

Operations on Files

Because of complex path from stored data to user, DBMS offer a range of I/O operations:

OPEN - access the file and prepare pointer

FIND (LOCATE) - find first record

FINDNEXT

FINDALL - set

READ

INSERT

DELETE

MODIFY

CLOSE

REORGANISE - set

READ-ORDERED (FIND-ORDERED) - set


File organization and access method

File organization and access method.

  • Difference between the terms

    • file organization and

    • access method.

  • A file organization is organization of the data of a file into records, blocks, and access structures;

    • way of placing records and blocks on the storage medium

  • An access method provides a group of operations that can be applied to a file resulting in retrieval, modification and reorganisation.

    • One file organization can accept many different access methods Some access methods, though, can be applied only to files with specific file organization.

      For example, one cannot apply an indexed access method to a file without an index


Why do access methods matter

Why do Access Methods matter

The unit of transfer between disk and main memory is a block

  • Data must be in memory for the DBMS to use it

  • DBMS memory is handled in units of a page, e.g. 4K, 8K. Pages in memory represent one or more hardware blocks from the disk

  • If a single item is needed, the whole block is transferred

  • Time taken for an I/O depends on the location of the data on the disk and is lower if the number of seek times and rotational delays are small, we remember that:access time = seek times + rotational delays + transfer times

  • The reason many DBMS do not rely on the OS file system is:

    • higher level DB operations, e.g. JOIN, have a known pattern of page accesses and can be translated into known sets of I/O operations

    • buffer manager can PRE-FETCH pages by anticipating the next request. This is especially efficient when the required data are stored CONTIGUOUSLY on disk


Simple file organisations

Simple File Organisations

Unorderedfiles of records: Heap or Pile file

  • New records inserted at EOF, or anywhere

  • locating a record is by a linear search

  • insertion is easy

  • retrieval of an individual record, or in any order, is difficult (time consuming).

  • Question. How many blocks in average one needs to reed to find a single record ?

    Fast: Select * from Course

    Slow: Select count(*) from Course

    group by Course_Number


Operations on unordered file

Operations on Unordered File

Inserting a new record is very efficient:

  • The address of the last file block is kept in the file header

  • The last disk block of the file is copied into a buffer page;

  • The new record is added or new page is opened; the page is then rewritten back to disk block.

    Searching for a record using any search condition in a file stored in b blocks

  • Linear search through the file, block by block

    • Cost = b/2 block transfers. on average, if only one record satisfies the search condition,

    • Cost = b block transfers. If no records or several records satisfy the search condition. program must read and search all b blocks in the file.

      To delete a record,

  • find its block and copy the block into a buffer page,

  • delete the record from the buffer,

  • rewrite the updated page back to the disk block.

    Note: Unused space in the block could be used in future for a new record if suitable (some book keeping necessary on unused space in file blocks))


Special deletion procedures

Special Deletion Procedures

Technique used for record deletion

  • Each record has an extra byte or bit, called a deletion marker set to ‘1’ at insertion *)

  • DO not remove deleted record, but reset its deletion marker to ‘0’ when deleted

  • Record with deletion marker set to 0 is not used by application programs

  • From time to time reorganise the file: physically remove deleted records or reclaim unused space.

    *) Just for simplicity we assume that values of deletion markers are ‘0’ or ‘1’. A system actually can choose other characters or combination of bits as values of deletion markers.


Simple file organisations1

Simple File Organisations

Ordered files of records - sequential files

  • still extremely useful in DBM (auditing, recovery, security…)

  • A record field is nominated and records are ordered based on that field

    • Ordering key

  • insertion is expensive

  • retrieval is easy (efficient) if exploiting the sort order

  • binary search reduces time significantly

    Fast: Select * from Course order by <order>

    Slow: Select * from Course where <any other attribute> = c


Retrieval update in sorted files

Retrieval & Update in Sorted Files

  • Binary search on ordering field to find block with key = k:

    B = # of blocks; High:= B; Low := 0

    Do while not (Found or NotThere)

    Read Block

    Mid = (Low + High) / 2

    If k < key field of first record in the block

    Then High = Mid - 1

    Else If k > key field of last record

    Then Low = Mid + 1

    Else If k record is in the buffer

    Then FoundElse NotThere

    end


Operations on ordered file

Operations on Ordered File

Searching for records when criteria are specified in terms of ordering field

  • Reading the records in order of the ordering key values is extremely efficient,

  • Finding the next record from the current one in order of the ordering key usually requires no additional block accesses,

    • the next record is in the same block or in the next block

  • using a search condition based on the value of an ordering key field results in faster access when the binary search technique is used,

  • A binary search can be done on the blocks rather than on the records.. A binary search usually accesses log2(b) blocks, whether the record is found or not

  • No advantage if search criterion is specified in terms of non ordering fields


Operations on ordered file1

Operations on Ordered File

Inserting records is expensive. To insert a record

  • find its correct position in the file, based on its ordering field value, - cost log2(b)

  • make space in the file to insert the record in that position.

  • on the average, half the records of the file must be moved to make space for the new record.

  • these file blocks must be read and rewritten to keep the order. Cost of insertion is then =b/2 block transfers

    Deleting record.

  • Find the record using binary search based on ordering field value, - cost log2(b

  • Delete the record,

  • Reorganise part of the file (all records after that deleted one, b/2 blocks in average)

    Modifying record

  • Find record using binary search and update as required


Operations on ordered file2

Operations on Ordered File

Alternative ways for more efficient insertion

  • keep some unused space in each block for new records (not good - problem returns when that space is filled up)

  • create and maintain a temporary unordered file called an overflow file.

  • New records are inserted at the end of the overflow file

  • Periodically, the overflow file is sorted and merged with the main file during file reorganization.

  • Searching for a record must involve both files, main and overflow; the cost of searching is thus more expensive but for large main file will be still close to log2(b)

    Alternative way for more efficient deletion

  • Use the technique based on deletion marker, as described earlier


Access properties of simple files

R4 -------

R2 -------

R3 -------

R16 -------

R1 -------

R7 -------

R35 -------

R10 -------

R14 -------

R12 -------

R23 -------

R6 -------

R24 -------

R27 -------

R1 -------

R2 -------

R3 -------

R4 -------

R6 -------

R7 -------

R10 -------

R12 -------

R14 -------

R16 -------

R23 -------

R24 -------

R27 -------

R35 -------

Access Properties of Simple Files

  • Heap (sequential unordered)

  • Ordered (sequential) file

  • Note: in this and the following examples record numbers corresponds to values of ordering field in ascending order


Access properties of simple files1

R4 -------

R2 -------

R3 -------

R16 -------

R1 -------

R7 -------

R35 -------

R10 -------

R14 -------

R12 -------

R23 -------

R6 -------

R24 -------

R27 -------

R4 -------

R2 -------

R3 -------

R16 -------

R1 -------

R7 -------

R35 -------

R10 -------

R14 -------

R12 -------

R23 -------

R6 -------

R24 -------

R27 -------

R15 -------

Access Properties of Simple Files

Insert into Heap file record R15

  • And after insertion


Access properties of simple files2

R1 -------

R2 -------

R3 -------

R4 -------

R6 -------

R7 -------

R10 -------

R12 -------

R14 -------

R16 -------

R23 -------

R24 -------

R27 -------

R35 -------

R1 -------

R2 -------

R3 -------

R4 -------

R6 -------

R7 -------

R10 -------

R12 -------

R14 -------

R15 -------

R16 -------

R23 -------

R24 -------

R27 -------

R35 -------

Access Properties of Simple Files

Insert into Ordered file record R15

  • And after insertion

Notice that all records after R15 have changed their page location or position on the page


Access properties of simple files3

R1 -------

R2 -------

R3 -------

R4 -------

R1 -------

R2 -------

R3 -------

R4 -------

R6 -------

R7 -------

R10 -------

R12 -------

R6 -------

R7 -------

R10 -------

R12 -------

R14 -------

R16 -------

R23 -------

R24 -------

R14 -------

R16 -------

R23 -------

R24 -------

R27 -------

R35 -------

R27 -------

R35 -------

Access Properties of Simple Files

Insert into Ordered file records R15, R9, R17 using overflow file

Main File Overflow File

  • And after insertion

R15 -------

R9 -------

R17 -------

Main File Overflow File

Periodically overflow file is sorted and merged with the main file


Access properties of simple files4

R4 -------

R2 -------

R3 -------

R16 -------

R1 -------

R7 -------

R35 -------

R10 -------

R14 -------

R12 -------

R23 -------

R6 -------

R24 -------

R27 -------

R4 -------

R2 -------

R16 -------

R1 -------

R35 -------

R14 -------

R12 -------

R23 -------

R6 -------

R24 -------

R27 -------

Access Properties of Simple Files

  • Deletions from a Heap: R10, R3, R7:

  • Simple delete:

  • After delete operations


Access properties of simple files5

R4 ------- 1

R2 ------- 1

R3 ------- 0

R16 ------- 1

R4 ------- 1

R2 ------- 1

R3 ------- 1

R16 ------- 1

R1 ------- 1

R7 ------- 0

R35 ------- 1

R10 ------- 0

R1 ------- 1

R7 ------- 1

R35 ------- 1

R10 ------- 1

R14 ------- 1

R12 ------- 1

R23 ------- 1

R6 ------- 1

R14 ------- 1

R12 ------- 1

R23 ------- 1

R6 ------- 1

R24 ------- 1

R27 ------- 1

R24 ------- 1

R27 ------- 1

Access Properties of Simple Files

  • Deletions from a Heap: R10, R3, R7:

  • using deletion marker technique

  • After delete operations

Deletion markers set to ‘0’ and later these records will be physically removed when file is reorganised


Access properties of simple files6

R1 -------

R2 -------

R4 -------

R6 -------

R12 -------

R14 -------

R16 -------

R23 -------

R24 -------

R27 -------

R35 -------

R1 -------

R2 -------

R3 -------

R4 -------

R6 -------

R7 -------

R10 -------

R12 -------

R14 -------

R16 -------

R23 -------

R24 -------

R27 -------

R35 -------

Access Properties of Simple Files

  • Deletions from ordered file: R10, R3, R7:

  • Simple delete:

  • After delete operations


Access properties of simple files7

R1 ------- 1

R2 ------- 1

R3 ------- 0

R4 ------- 1

R1 ------- 1

R2 ------- 1

R3 ------- 1

R4 ------- 1

R6 ------- 1

R7 ------- 0

R10 ------- 0

R12 ------- 1

R6 ------- 1

R7 ------- 1

R10 ------- 1

R12 ------- 1

R14 ------- 1

R16 ------- 1

R23 ------- 1

R24 ------- 1

R14 ------- 1

R16 ------- 1

R23 ------- 1

R24 ------- 1

R27 ------- 1

R35 ------- 1

R27 ------- 1

R35 ------- 1

Access Properties of Simple Files

  • Deletions from ordered file: R10, R3, R7:

  • Using deletion marker technique:

  • After delete operations

Deletion markers set to ‘0’ and later these records will be physicaly removed when file is reorganised


Retrieval and update in heaps

Retrieval and Update In Heaps

Quick summary

  • can only use linear search

  • insertion is fast

  • deletion, update are slow

  • parameter search (e.g. SELECT…WHERE) is slow

  • unconditional search can be fast if

    • records are of fixed length

    • records do not span blocks:

  • j-th record located by position in block j / bfr

  • average time to find a single record = b / 2

    (b = number of blocks)


Retrieval update in sorted files1

Retrieval & Update in Sorted Files

  • Quick summary

  • retrieval on key field is fast - “next” record is nearby

  • any other retrieval either requires a sort, or an index, or is as slow as a heap

  • update, delete, insert are slow (find block, update block, rewrite block)


Fast access for database hashing

FAST ACCESS FOR DATABASE: HASHING

  • Types of hashing: static or dynamic

  • What is the point of hashing?

    • reduce a large address space

    • provide close to direct access

    • provide reasonable performance for all U,I,D,S

  • What is a hash function?

    • properties

    • behaviour

  • Collisions

  • Collision resolution

  • Open addressing

  • Summary


What is hashing and what is it for

What Is Hashing, and What Is It For?

  • “direct” access to block containing the desired record

  • reduce the number of blocks read or written

  • allow for file expansion and contraction with minimal file reorganising

  • permit retrieval on “hashed” fields without re-sorting the file

  • no need to allocate contiguous disk areas

  • if file is small, internal hashing; otherwise external

  • no direct access other than by hashing


A basic example of hashing

A basic example of hashing:

  • There are 25 rows of seats, with 3 seats per row (75 seats total)

  • We have to allocate each person to a row in advance, at random

  • We will hash on their family name so as to find the person’s row number directly, knowing only the name

  • The database is logically a single table

    ROOM (Name, Age, Attention)

    implemented as a blocked, hashed file


The hashing process

The hashing process

  • The hash process is:

    • Loc = 0

    • Until no more characters in YourName

      • Add the alphabetic position of the character to Loc

    • Calculate RowNum = Loc mod 25


Examples hashed names

Examples - Hashed Names

Where is MCWILLIAM?Hash(MCWILLIAM) = Row 20


Name hashing example continued

Name Hashing Example continued

NameRowNameRowNameRowNameRow

Lee22Alex17George7Anne9

West17Rita23Guy3Will6

James23Jodie18Dave7Wilf0

Anna5Jill18Don8Walt6

Anita18Lily8Dixy12Jack0

Jie24Ash3Jon14Lana3

Kenny19Ben21Nina13Olga10

Marie21Kay12May14Fred8

Lois5Peter14Max13Tania18

Best21Paul0Nora23Tom23

Rob10Phil20Cash6Julia3

Lou23Pat11Foot6Leah6

Axel17Ed9Tan10Ling17


The results after 52 arrivals

The results after 52 arrivals


The room as a hashed file

The Room as a Hashed File

  • Each person has a hash key - the name

  • Each person is a record

  • Each row is a hardware block (bucket)

  • Each row number is the address of a bucket

  • Records here are fixed-length(and 3 records per block)

  • The leftover people are collisions (key collisions)

  • They will have to be found a seat by collision resolution


Collision resolution

Collision Resolution

  • Leave an empty seat in each row

    • Under population - blocks 66% full

  • A notice on the end of the row: “extra seat for row N can be found at the rear exit”

    • bucket’s overflow chain points to an overflow page containing the record

  • “Everyone stand up while we reallocate seats”

    • file reorganisation


Collision resolution1

Collision Resolution

Strategy 1 (open addressing):

  • “Nora” is 4th arrival for 23

  • Place new arrival in next higher No block with a vacancy

  • Retrieval - search for “Nora”:

    • Retrieve block 23

    • Read blocks in 23 consecutively. If “Nora” not found try 24…


  • File organisation

    • Disadvantages:

      • May need to read whole file consecutively on some keys

      • Blocks will gradually fill up with out-of-place records

      • Deletions cause either immediate or periodic reorganisation


    Collision resolution2

    Collision Resolution

    Strategy 2:

    • Reserve some rows (buckets) for overflow Blocks 25, 26 and 27 or recalculate hash function for smaller mod, say 20 instead of 25

      • “Julia” is then 4th arrival for block 3

      • Place in overflow block with smaller label and with available space (26 ? and optionally placing a pointer in bucket 3 pointing to 26th).

    • Retrieval - search for “Julia”:

      • Retrieve block 3

      • Read blocks in 3 consecutively. If “Julia” not found, either:

        • search overflow consecutively, or

        • follow pointer to block 26 (chaining)


    File organisation

    • Disadvantages:

      • Overflow gradually fills up giving longer retrieval times

      • Deletions/additions cause periodic reorganisation


    Collision resolution3

    Collision Resolution

    • More formally

    • Open addressing: If location specified by hash address is occupied then the subsequent positions are checked in order until an unused (empty) position is found.

    • Chaining: various overflow locations are kept, a pointer field is added to each record location. A collision is resolved by placing the new record in an unused overflow location and setting the pointer of the occupied hash address location to the address of that overflow location.

    • Multiple hashing: A second hash function is applied if the first results in a collision.


    Performance on hashed files

    Performance on Hashed Files

    • Retrieve (SELECT): very fast if name is known, otherwise hopeless

      • SELECT * FROM ROOM

        WHERE NAME = ‘McWilliam’

      • SELECT * FROM ROOM

        WHERE AGE > 30

    • Update: same

      • UPDATE ROOM SET ATTENTION = ‘low’

        WHERE NAME = ‘McWilliam’

      • UPDATE ROOM SET ATTENTION = ‘high’

        WHERE AGE > 50 OR AGE < 10


    Performance on hashed files1

    Performance on Hashed Files

    • Delete: same as SELECT, UPDATE

      • DELETE FROM ROOM (uses hash - fast)

        WHERE NAME = ‘Nora’

      • DELETE FROM ROOM (can’t use hash - slow)

        WHERE NAME IS LIKE ‘No%’

        Insert: unpredictable

      • INSERT INTO ROOM

        VALUES (‘Smyth’, ‘high’)


    Internal hashing

    Internal Hashing

    • Internal hashing is used as an internal search structure within a program whenever a group of records is accessed exclusively by using the value of one field.

    • Applicable to smaller files

    • Hashed in main memory: fast lookup in store

    • R records, R-length array

    • Hash function transforms key field into subscript array in the range 0 to R - 1

    • hash (Key Value) = Key Value (mod R)

    • subscript is the record address in store


    External hashing

    External Hashing

    • Hashing for disk files is called external hashing.

    • address space is made of buckets, each of which holds multiple records.

    • A bucket is either one disk block or a cluster of contiguous blocks.

    • The hashing function maps a key into a relative bucket number,

    • A table maintained in the file header converts the bucket number into the corresponding disk block address


    External hashing static

    External Hashing (static)

    • The hashing scheme is called static hashing if a fixed number of buckets M is allocated.

    • If a record is to be retrieved with search condition specified for the key values, then the bucket number of the bucket potentially containing that record is determined using the hashing function applied on the key and then that bucket is examined for the containment of the desired record. If record is not in that bucket then further search could be activated in overflow buckets.


    External hashing static1

    External Hashing (static)

    Construction of hashed file

    • Identify size of the file, choose hashing function (according to the anticipated number of buckets) and decide about selection of the collision resolution procedure - for the life of the file

    • Apply hashing function to each inserted record to get the bucket number and place the record in the bucket with that number

    • If bucket is full then apply selected collision resolution procedure

    • If the number of records in overflow buckets is large and/or distribution of records in buckets is highly un-uniform , then reorganise the file using changed hashing function (tuning)


    External hashing static2

    External Hashing (static)

    0

    1

    N-1

    H(key) mod N

    Overflow Page

    key

    H

    Primary buckets


    Problems of static hashing

    Problems of Static Hashing

    • Number of buckets is fixed

    • shrinkage causes wasted space

    • growth causes long overflow chains

    • Solutions:

      • reorganise

      • re-hash

      • use dynamic hashing...


    Extendible hashing

    Extendible Hashing

    • Previously, to insert a new record into a full bucket

      • add overflow page, or

      • reorganise by doubling the bucket allocation and redistributing the records

    • This is a poor solution:

      • entire file is read

      • twice as many pages have to be written

    • Solution: Extendible Hashing

      • add a directory of pointers to buckets

      • double the number of buckets by doubling the directory

      • split only the bucket that has overflowed


    Linear hashing

    LINEAR HASHING

    • It does not have a directory at all

    • Instead, have a “family” of algorithms to manage dynamic expansion and contraction of the file

    • Start with a set number of M buckets 0..M-1 with hashing function mod M

    • Split them in linear order, when more space is needed. The next hashing function is mod 2M and subsequent 3M, 4M etc as required

      • Example: Block capacity is 2 records. Records with values 72, 62, 32 are colliding for hashing function (mod 10), but after application of next hashing function (mod 20) they do not (one bucket contains 72 and 32 and another 62).

    • Combines controlled overflow with new space acquisition


    Linear hashing advantages

    Linear Hashing - Advantages

    Another type of dynamic hashing

    • does not require a directory

    • manages collisions well

    • accommodates insertions and deletions well

    • allows overflow chain length to be traded against average space utilisation

    • uses several hash functions


    Hashing summary

    HASHING SUMMARY

    • Comparison of Simple and Hashed Files

    • Pages in a hashed file are grouped into buckets (1 block or a cluster of contiguous blocks)

    • reduce a large address space

    • provide close to direct access

    • Static hashing has some disadvantages which are addressed in dynamic hashing solutions (extendible, linear)


    File organisation

    • Static hashed files are kept at about 80% occupancy then reorganised. New pages are added when each existing page is about 80% full

    • Hence, time to read entire file is  1.25 non-hashed file

    • Dynamic hashing provides flexibility in usage of file storage space (expansion and contraction)


  • Login