Secondary storage management
Sponsored Links
This presentation is the property of its rightful owner.
1 / 336

Secondary Storage Management PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Secondary Storage Management. Chapter 13. 13.1 The Memory Hierarchy. The Memory Hierarchy. Computer systems have several different components in which data may be stored. Data capacities & access speeds range over at least seven orders of magnitude

Download Presentation

Secondary Storage Management

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Secondary Storage Management

Chapter 13

13.1 The Memory Hierarchy

The Memory Hierarchy

  • Computer systems have several different components in which data may be stored.

  • Data capacities & access speeds range over at least seven orders of magnitude

  • Devices with smallest capacity also offer the fastest access speed

Description of Levels

  • Cache

    • Megabyte or more of Cache storage.

    • On-board cache : On same chip.

    • Level-2 cache : On another chip.

    • Cache data accessed in few nanoseconds.

    • Data moved from main memory to cache when needed by processor

    • Volatile

Description of Levels

2.Main Memory

  • 1 GB or more of main memory.

  • Instruction execution & Data Manipulation - involves information resident in main memory.

  • Time to move data from main memory to the processor or cache is in the 10-100 nanosecond range.

  • Volatile

    3.Secondary Storage

  • Typically a magnetic disk.

  • Capacity upto 1 TB.

  • One machine can have several disk units.

  • Time to transfer a single byte between disk & main memory is around 10 milliseconds.

Description of Levels

4. Tertiary Storage

  • Holds data volumes measured in terabytes.

  • Significantly higher read/write times.

  • Smaller cost per bytes.

  • Retrieval takes seconds or minutes, but capacities in the petabyte range are possible.

Transfer of Data Between Levels

  • Data moves between adjacent levels of the hierarchy.

  • Each level is organized to transfer huge amounts of data to or from the below level

  • Key technique for speeding up database operations is to arrange data so that when one piece of a disk block is needed, it is likely that other data on the same block will also be needed at about the same time.

Volatile & Non Volatile Storage

  • A volatile device “forgets” what is stored in it when the power goes off.

  • Example: Main Memory

  • A nonvolatile device, on the other hand, is expected to keep its contents intact even for long periods when the device is turned off or there is a power failure.

  • Example: Secondary & Tertiary Storage

    Note: No change to the database can be considered final until it

    has migrated to nonvolatile, secondary storage.

Virtual Memory

  • Managed by Operating System.

  • Some memory in main memory & rest on disk.

  • Transfer between the two is in units of disk blocks (pages).

  • Not a level of the memory hierarchy

Thank you!

Secondary storage management

Section 13.2

CS-257 Database System Principles

AvinashAnantharamu (102)



13.2 Disks

13.2.1 Mechanics of Disks

13.2.2 The Disk Controller

13.2.3 Disk Access Characteristics

Structure of a Disk

Mechanics of Disks

Two principal moving pieces of hard drive

1- Head Assembly

2- Disk Assembly

Disk Assembly has 1 or more circular platters

that rotate around a central spindle.

Platters are covered with thin magnetic material

Top View of Disk Surface

Mechanics of Disks

  • Tracks are concentric circles on a platter.

  • Tracks are organized into sectors which are segments of circular platter.

  • Sectors are indivisible as far as errors are concerned.

  • Blocks are logical data transfer units.

Disk Controller

Control the actuator to move head assembly

Selecting the surface from which to read or write

Transfer bits from desired sector to main memory

Possibly, buffering an entire track or more in local memory of the disk

controller, hoping that many sectors of this track will be read soon, and

additional accesses to the disk can be avoided.

Simple Single Processor Computer

Disk Access characteristics

Seek time

Rotational latency

Transfer time

Latency of the disk.

Thank you

13.3 Accelerating Access to Secondary Storage

San Jose State University

Spring 2012

13.3 Accelerating Access to Secondary StorageSection Overview

  • 13.3.1: The I/O Model of Computation

  • 13.3.2: Organizing Data by Cylinders

  • 13.3.3: Using Multiple Disks

  • 13.3.4: Mirroring Disks

  • 13.3.5: Disk Scheduling and the Elevator Algorithm

  • 13.3.6: Prefetching and Large-Scale Buffering

13.3 Introduction

  • Average block access is ~10ms.

  • Disks may be busy.

  • Requests may outpace access delays, leading to infinite scheduling latency.

  • There are various strategies to increase disk throughput.

  • The “I/O Model” is the correct model to determine speed of database operations

13.3 Introduction (Contd.)

  • Actions that improve database access speed:

    • Place blocks closer, within the same cylinder

    • Increase the number of disks

    • Mirror disks

    • Use an improved disk-scheduling algorithm

    • Use prefetching

13.3.1 The I/O Model of Computation

  • If we have a computer running a DBMS that:

    • Is trying to serve a number of users

    • Has 1 processor, 1 disk controller, and 1 disk

    • Each user is accessing different parts of the DB

  • It can be assumed that:

    • Time required for disk access is much larger than access to main memory; and as a result:

    • The number of block accesses is a good approximation of time required by a DB algorithm

13.3.2 Organizing Data by Cylinders

  • It is more efficient to store data that might be accessed together in the same or adjacent cylinder(s).

  • In a relational database, related data should be stored in the same cylinder.

13.3.3 Using Multiple Disks

  • If the disk controller supports the addition of multiple disks and has efficient scheduling, using multiple disks can improve performance significantly

  • By striping a relation across multiple disks, each chunk of data can be retrieved in a parallel fashion, improving performance by up to a factor of n, where n is the total number of disks the data is striped over

13.3.4 Mirroring Disks

  • A drawback of striping data across multiple disks is that you increase your chances of disk failure.

  • To mitigate this risk, some DBMS use a disk mirroring configuration

  • Disk mirroring makes each disk a copy of the other disks, so that if any disk fails, the data is not lost

  • Since all the data is in multiple places, access speedup can be increased by more than n since the disk with the head closest to the requested block can be chose

  • Unfortunately, the writing of disk blocks does not speed up at all. The reason is that the new block must be written to each of the n disks.

13.3.4 Mirroring Disks

13.3.5 Disk Scheduling

  • One way to improve disk throughput is to improve disk scheduling, prioritizing requests such that they are more efficient

    • The elevator algorithm is a simple yet effective disk scheduling algorithm

    • The algorithm makes the heads of a disk oscillate back and forth similar to how an elevator goes up and down

    • The access requests closest to the heads current position are processed first

13.3.5 Disk Scheduling

  • When sweeping outward, the direction of head movement changes only after the largest cylinder request has been processed

  • When sweeping inward, the direction of head movement changes only after the smallest cylinder request has been processed

  • Example:

13.3.6 Prefetching and Large-Scale Buffering

  • In some cases we can anticipate what data will be needed

  • We can take advantage of this by prefetching data from the disk before the DBMS requests it

  • Since the data is already in memory, the DBMS receives it instantly

? Questions ?

Disk Failures

Presented by Timothy Chen

Spring 2013


13.4 Disk Failures13.4.1 Intermittent Failures13.4.2 Organizing Data by Cylinders13.4.3 Stable Storage13.4.4 Error- Handling Capabilities of Stable Storage13.4.5 Recovery from Disk Crashes13.4.6 Mirroring as a Redundancy Technique13.4.7 Parity Blocks13.4.8 An Improving: RAID 513.4.9 Coping With Multiple Disk Crashers

Intermittent Failures

If we try to read the sector but the correct content of that sector is not delivered to the disk controller

Controller will check good and bad sector

If the write is correct: Read is performed

Good sector and bad sector is known by the read operation


Read operation that determine the good or bad status

If, on reading, we find that the checksum is not proper for the data bits, then we know the is an error in reading.

If the checkum is proper, there is still a small chance that the block was not read correctly, but by using many checksum bits we can make the probability of missing a bad read arbitrarily small.

How CheckSum perform

Each sector has some additional bits

Set depending on the values of the data bits stored in each sector

If the data bit in the not proper we know there is an error reading

Odd number of 1: bits have odd parity(01101000)

Even number of 1: bit have even parity (111011100)

Find Error is the it is one bit parity

Stable Storage

Deal with disk error

Sectors are paired and each pair X showing left and right copies as Xl and Xr

It check the parity bit of left and right by subsituting spare sector of Xl and Xr until the good value is returned

Error-Handling Capabilities of Stable Storage

Since it has XL and XR, one of them fail we can still read other one

Chance both of them fail are pretty small

The write Fail, it happened during power outage

Recover Disk Crash

The most serious mode of failure for disks is “head crash” where data permanently destroyed.

The way to recover from crash , we use RAID method

Mirroring as a Redundancy Technique

it is call Raid 1

Just mirror each disk

Raid 1 graph

Parity Block

It often call Raid 4 technical

read block from each of the other disks and modulo-2 sum of each column and get redundant disk

disk 1: 11110000

disk 2: 10101010

disk 3: 00111000

get redundant disk 4(even 1= 0, odd 1 =1)

disk 4: 01100010

Raid 4 graphic

Parity Block- Fail Recovery

It can only recover one disk fail

If it has more than one like two disk

Then it can’t be recover us modulo-2 sum

An Improvement Raid 5

Coping with multiple Disk Crash

For more one disk fail

Either raid 4 and raid 5 can’t be work

So we need raid 6

It is need at least 2 redundant disk

Raid 6


Secondary Storage Management

13.5 Arranging data on disk



CS 257


Fixed-Length Records

Example of Fixed-Length Records

Packing Fixed-Length Records into Blocks

Example of Packing Fixed-Length Records into Blocks

Details of Block header

Arranging Data on Disk

  • A data element such as a tuple or object is represented by a record, which consists of consecutive bytes in some disk block.

Fixed Length Records

  • The Simplest record consists of fixed length fields.

  • The record begins with a header, a fixed-length region

    where information about the record itself is kept.

  • Fixed Length Record header

    1. A pointer to record schema.

    2. The length of the record.

  • This information helps us skip over records

  • without consulting the schema.

    3. A timestamp indicating when the record was created.

  • Also, record was last modified, or last read.

  • This information may be useful for implementing database transactions


CREATE TABLE employee(


address VARCHAR(255),

gender CHAR(1),

birthdate DATE


Packing Fixed Length Records into Blocks

Records are stored in blocks of the disk and moved into main memory when we need to access or update them.

A block header is written first and it is followed by series of blocks.


Along with the header we can pack as many record as we can in one block as shown in the figure and remaining space will be unused

Block header contains following information

Links to one or more other blocks that are part of a network blocks

Information about the role played by this block in such a network

Information about which relation the tuples of this block belong to.

A “directory” giving the offset of each round in the block

Timestamp(s) indicating the time of the block's last modification and / or access

Thank You

Variable Length Data and Records

- AshwinKalbhor

Class ID : 107


Records with Variable Length Fields

Records with Repeating Fields

Variable Format Records

Records that do not fit in a block

Example of a record




birth date

0 30 286 287 297

Records with Variable Length Fields

  • Simple and Effective way to represent variable length records is as follows –

    1. Fixed length fields are kept ahead of the variable length records.

    2. A header is put in front of the of the record.

    3. Record header contains

    • Length of the record

    • Pointers to the beginning of all variable length fields except the first one.


header information

record length

to address


birth date



Record with name and address as variable length field.

Records with repeating fields

Repeating fields simply means fields of the same length L.

All occurrences of Field F grouped together.

Pointer pointing to the first field F is put in the header.

Based on the length L the starting offset of any repeating field can be obtained.

Example of a record with Repeating Fields

other header information

record length

to address

to movie pointers



Movie star record with “movies” as the repeating field.

pointers to movies

Alternative representation

Record is of fixed length

Variable length fields stored on a separate block.

The record itself keeps track of -

1. Pointers to the place where each repeating field begins, and

2. Either how many repetitions there are, or where the repetitions end.

Storing variable length fields separately from the record.

Representing Null Values

Tuples often have fields that may be NULL. The record format of Fig. offers a convenient way to represent NULL values. If a field such as address is null,

then we put a null pointer in the place where the pointer to an address goes. Then, we need no space for an address, except the place for the pointer.

This arrangement can save space on average, even if address is a fixed-length field but frequently has the value NULL

Variable Format Records

  • Records that do not have fixed schema

  • Represented by sequence of tagged fields

  • Each of the tagged fields consist of information

    • Attribute or field name

    • Type of the field

    • Length of the field

    • Value of the field

Variable Format Records

code for name

code for restaurant owned

code for string type

code for string type






Clint Eastwood




Hog’s Breath Inn

Records that do not fit in a block

When the length of a record is greater than block size ,then record is divided and placed into two or more blocks

Portion of the record in each block is referred to as a RECORD FRAGMENT

Record with two or more fragments is called a SPANNED RECORD

Record that do not cross a block boundary is called UNSPANNED RECORD

Spanned Records

  • Spanned records require the following extra header information –

    • A bit indicates whether it is fragment or not

    • A bit indicates whether it is first or last fragment of a record

    • Pointers to the next or previous fragment for the same record

Spanned Records

block header

record header


2 - b


2 - a

record 1

record 3

block 1

block 2

Thank You.

13.8 Record Modifications


Lok Kei Leong ( 108 )


Record Insertion

Record Deletion

Record Update


Insert new records into a relation- records of a relation in no particular order- record of a relation in fixed order (e.g. sorted by primary key)

A pointer to a record from outside the block is a “structured address”

Fixed Order

There is more of a problem when the tuples must be kept in some fixed order, such as sorted by their primary key (e.g., see Section 14.1.1).

If we need to insert a new record, we first locate the appropriate block for that record.

What If The Block is Full?

  • If we need to insert the record in a particular block but the block is full. What should we do?

  • Find room outside the Block

  • There are 2 solutions

  • Find Space on Nearby Block

  • Create an Overflow Block

Insertion (solution 1)

Find space on a “nearby” block

Block B1 has no space

If space available on block B2 move records of B1 to B2

If there are external pointers to records of B1 moved to B2 leave forwarding address in offset table of B1

Insertion (solution 2)

Create an overflow block

Each block B has its header pointer to an overflow block where additional blocks of B can be placed


Slide around the block

Cannot slide records- maintain an available-space list in the block headerto keep track of space available

Avoid dangle or wind up pointing to a new record


  • What about pointer to deleted records ?

  • A tombstone is placed in place of each deleted record

  • A tombstone is a bit placed at first byte of deleted record to indicate the record was deleted ( 0 – Not Deleted 1 – Deleted)

  • A tombstone is permanent


For Fixed-Length Records, there is no effect on the storage system

For variable length records:

associated with insertion and deletion(never create a tombstone for old record)

Longer updated recordcreate more space on its block- sliding records - create an overflow block



Query ExecutionSection 15.1

Sweta Shah

CS257: Database SystemsID: 118


  • Query Processor

  • Query compilation

  • Physical Query Plan Operators

    • Scanning Tables

      • Table Scan

      • Index scan

  • Sorting while scanning tables

  • Model of computation for physical operators

  • Parameters for measuring cost

  • Iterators

Query Processor

  • The Query Processor is a group of components of a DBMS that turns user queries and data-modification commands into a sequence of database operations and executes those operations

  • Query processor is responsible for supplying details regarding how the query is to be executed

The major parts of the query processor

Query compilation

  • Query compilation itself is a multi-step process consisting of :

    • Parsing: in which a parse tree representing query and its structure is constructed

    • Query rewrite: in which the parse tree is converted to aninitial query plan

    • Physical plan generation: where the logical query plan is turned into a physical query plan by selecting algorithms.

Outline of query compilation

Physical Query Plan Operators

  • Physical query plans are built from operators

  • Each of the operators implement one step of the plan.

  • They are particular implementations for one of the operators of relational algebra.

  • They can also be non relational algebra operators like “scan” which scans tables.

Scanning Tables

  • One of the most basic things in a physical query plan.

  • Necessary when we want to perform join or union of a relation with another relation.

Two basic approaches to locating the tuples of a relation R

  • Table-scan

    • Relation R is stored in secondary memory with its tuples arranged in blocks

    • it is possible to get the blocks one by one

    • This operation is called Table Scan

Two basic approaches to locating the tuples of a relation R

  • Index-scan

    • there is an index on any attribute of Relation R

    • Use this index to get all the tuples of R

    • This operation is called Index Scan

Sorting While Scanning Tables

  • Why do we need sorting while scanning?

    • the query could include an ORDER BY clause requiring that a relation be sorted

    • Various algorithms for relational-algebra operations require one or both of their arguments to be sorted relation

    • Sort-scantakes a relation R and a specification of the attributes on which the sort is to be made, and produces R in that sorted order

Model of Computation for Physical Operators

  • Choosing physical plan operators wisely is an essential for a good query processor.

  • Cost for an operation is measured in number of disk i/o operations.

  • If an operator requires the final answer to a query to be written back to the disk, the total cost will depend on the length of the answer and will include the final write back cost to the total cost of the query.

Improvements in cost

  • Major improvements in cost of the physical operators can be achieved by avoiding or reducing the number of disk i/o operations

  • This can be achieved by passing the answer of one operator to the other in the main memory itself without writing it to the disk.

Parameters for Measuring Costs

  • Parameters that affect the performance of a query

    • Buffer space availability in the main memory at the time of execution of the query

    • Size of input and the size of the output generated

    • The size of memory block on the disk and the size in the main memory also affects the performance

Iterators for Implementation of Physical Operators

  • Many physical operators can be implemented as an iterator

  • It is a group of three functions that allows a consumer of the result of the physical operator to get the result one tuple at a time


  • The three functions forming the iterator are:


  • This function starts the process of getting tuples.

  • It initializes any data structures needed to perform the operation



  • This function returns the next tuple in the result

  • Adjusts data structures as necessary to allow subsequent tuples to be obtained

  • If there are no more tuples to return, GetNext returns a special value NotFound



  • This function ends the iteration after all tuples

  • it calls Close on any arguments of the operator

  • Open()

  • Getnext()

  • Close()

Thank You !!!

Query Execution

One-pass algorithm for database operations

Chetan Sharma



  • One-Pass Algorithm

  • One-Pass Algorithm Methods:

  • Tuple-at-a-time, unary operations.

  • Full-relation, unary operations.

  • Full-relation, binary operations.


While many algorithms for operators have been proposed,

they largely fall into three classes:

1. Sorting-based methods (Section 15.4).

2. Hash-based methods (Sections 15.5 and 20.1).

3. Index-based methods (Section 15.6).

One-Pass Algorithm

Reading the data only once from disk.

Usually, they require at least one of the arguments to fit in main memory


These operations do not require an entire relation, or even a large part of it, in memory at once. Thus, we can read a block at a time, use one main memorybuffer, and produce our output.

Ex- selection and projection


A selection or projection being performed on a relation R

Full-relation, unary operations

These one-argument operations require

seeing all or most of the tuples in memory

at once,

so one-pass algorithms are limited to relations that are approximately of size M (the number of main-memory buffers available) or less.

Ex - The grouping operator

-The duplicate-elimination operator.

Full-relation, unary operations

Managing memory for a one-pass duplicate-elimination


A grouping operation gives us zero or more grouping attributes and presumably one or more aggregated attributes. If we create in main memory one entry for each group — that is, for each value of the grouping attributes — then we can scan the tuples of R, one block at a time.

Ex- MIN(a) , MAX(a) , COUNT , AVG(a)

Full-relation, binary operations

All other operations are in this class:

set and bag versions of union,




and products.

Except for bag union, each of these operations requires at least one argument to be limited to size M, if we are to use a one-pass algorithm

Full-relation, binary operations examples

Set Union:

-We read S into M - 1 buffers of main memory and build a search structure where the search key is the entire tuple.

-All these tuples are also copied to the output.

-Read each block of R into the Mth buffer, one at a time.

-For each tuple t of R, see if t is in S, and if not, we copy t to the output. If t is also in S, we skip t.

Set Intersection :

-Read S into M - 1 buffers and build a search structure with full tuples as the search key.

-Read each block of R, and for each tuple t of R, see if t is also in S. If so, copy t to the output, and if not, ignore t.



Nested Loops Joins

Book Section of chapter 15.3

Guide: Prof. Dr. T.Y. LIN

Name: Sanya Valsan

Roll: 120

Topic to be covered

Tuple-Based Nested-Loop Join

An Iterator for Tuple-Based Nested-Loop Join

A Block-Based Nested-Loop Join Algorithm

Analysis of Nested-Loop Join


Nested – loop joins are, in a sense, one and a half passes since in each variation one of the two arguments has its tuples read only once, while the other argument has to be read repeatedly.

Nested-loop joins can be used for relations of any size; it is not necessary that one relation fit in main memory.

15.3.1 Tuple-Based Nested-Loop Join

The simplest variation of nested-loop join has loops that range over individual tuples of the relations involved. In this algorithm, which we call tuple-based nested-loop join, we compute the join as follows




  • For each tuple s in S DO

    For each tuple r in R Do

    if r and s join to make a tuple t THEN

    output t;

    • If we are careless about how to buffer the blocks of relations R and S, then this algorithm could require as many as T(R)T(S) disk I/O’s. There are many situations where this algorithm can be modified to have much lower cost.

      Note: T(R) – number of tuples in relation R.


One improvement looks much more carefully at the way tuples of R and S are divided among blocks, and uses as much of the memory as it can to reduce the number of disk I/O's as we go through the inner loop. We shall consider this block-based version of nested-loop join.

15.3.2 An Iterator for Tuple-Based Nested-Loop Join

  • Open() {

    • R.Open();


    • A:=S.getnext();


      GetNext() {

      Repeat {

      r:= R.Getnext();

      IF(r= Not found) {/* R is exhausted for the current s*/



IF( s= Not found) RETURN Not Found;

/* both R & S are exhausted*/


r:= R.Getnext();



until ( r and s join)

RETURN the join of r and s;


Close() {

R.close ();

S.close ();


15.3.3 A Block-Based Nested-Loop Join Algorithm

We can Improve Nested loop Join by compute R join S.

  • Organizing access to both argument relations by blocks.

  • Using as much main memory as we can to store tuples belonging to the relation S, the relation of the outer loop.

The nested-loop join algorithm

FOR each chunk of M-1 blocks of S DO BEGIN

read these blocks into main-memory buffers;

organize their tuples into a search structure whose

search key is the common attributes of R and S;

FOR each block b of R DO BEGIN

read b into main memory;

FOR each tuple t of b DO BEGIN

find the tuples of S in main memory that

join with t ;

output the join of t with each of these tuples;




15.3.4 Analysis of Nested-Loop Join

Assuming S is the smaller relation, the number of chunks or iterations of outer loop is B(S)/(M - 1). At each iteration, we read hf - 1 blocks of S and

B(R) blocks of R. The number of disk I/O's is thus

B(S)/M-1(M-1+B(R)) or B(S)+B(S)B(R)/M-1

Assuming all of M, B(S), and B(R) are large, but M is the smallest of these, an approximation to the above formula is B(S)B(R)/M. That is, cost is proportional to the product of the sizes of the two relations, divided by the amount of available main memory.


  • B(R) = 1000, B(S) = 500, M = 101

    • Important Aside: 101 buffer blocks is not as unrealistic as it sounds. There may be many queries at the same time, competing for main memory buffers.

  • Outer loop iterates 5 times

  • At each iteration we read M-1 (i.e. 100) blocks of S and all of R (i.e. 1000) blocks.

  • Total time: 5*(100 + 1000) = 5500 I/O’s

  • Question: What if we reversed the roles of R and S?

  • We would iterate 10 times, and in each we would read 100+500 blocks, for a total of 6000 I/O’s.

  • Compare with one-pass join, if it could be done!

  • We would need 1500 disk I/O’s if B(S) x M-1


  • The cost of the nested-loop join is not much greater than the cost of a one-pass join, which is 1500 disk 110's for this example.

  • Nested-loop join is mostly not the most efficient join algorithm.

Summary of the topic

In this topic we have learned about how the nested tuple Loop join are used in database using query execution and what is the process for that.

Any Questions


Two Pass Algorithm Based On Sorting

Section 15.4

CS257 Spring2013


Class ID : 131


Two-Pass Algorithms

Two-Phase, Multiway Merge-Sort

Duplicate Elimination Using Sorting

Grouping and Aggregation Using Sorting

A Sort-Based Union Algorithm

Sort-Based Intersection and Difference

A Simple Sort-Based Join Algorithm

A More Efficient Sort-Based Join

Two-Pass Algorithms

Data from operand relation is read into main memory, processed, written out to disk again, and reread from disk to complete the operation.

we concentrate on

two-pass algorithms because:

a) Two passes are usually enough, even for very large relations.

b) Generalizing to more than two passes is not hard;

15.4.1 Two-Phase, Multiway Merge-Sort

To sort very large relations in two passes.

Phase 1: Repeatedly fill the M buffers with new tuples from R and sort them, using any main-memory sorting algorithm. Write out each sorted sublist to secondary storage.

Phase 2 : Merge the sorted sublists. For this phase to work, there can be at most M — 1 sorted sublists, which limits the size of R. We allocate one input block to each sorted sublist and one block to the output.


Find the smallest key

Move smallest element to first available position of output block.

If output block full -write to disk and reinitialize the same buffer in main memory to hold the next output block.

If this block -exhausted of records, read next block from the same sorted sub­list into the same buffer that was used for the block just exhausted.

If no blocks remain- stop.

15.4.2 Duplicate Elimination Using Sorting

Same as previous…

Instead of sorting on the second pass, -repeatedly select first unconsidered tuple t among all sorted sub lists.

Write one copy of t to the output and eliminate from the input blocks all occurrences of t.

Output - exactly one copy of any tuple in R.

15.4.3 Grouping and Aggregation Using Sorting

Read the tuples of R into memory, M blocks at a time. Sort the tuples in each set of M blocks, using the grouping attributes of L as the sort key. Write each sorted sublist to disk.

Use one main-memory buffer for each sublist, and initially load the first block of each sublist into its buffer.

Repeatedly find the least value of the sort key present among the first available tuples in the buffers.

15.4.4 A Sort-Based Union Algorithm

In the first phase, create sorted sublists from both R and S.

Use one main-memory buffer for each sublist of R and S. Initialize each with the first block from the corresponding sublist.

Repeatedly find the first remaining tuple t among all the buffers

15.4.5 Sort-Based Intersection and Difference

For both set version and bag version, the algorithm is same as that of set-union except that the way we handle the copies of a tuple t at the fronts of the sorted sub lists.

For set intersection -output t if it appears in both R and S.

For bag intersection -output t the minimum of the number of times it appears in R and in S.

For set difference -output t if and only if it appears in R but not in S.

For bag difference-output t the number of times it appears in R minus the number of times it appears in S.

15.4.6 A Simple Sort-Based Join Algorithm

Given relations R(X, Y) and S(Y, Z) to join, and given M blocks of main memory for buffers

Sort R, using 2PMMS, with Y as the sort key

Sort S similarly

Merge the sorted R and S, use only two buffers

15.4.8 A More Efficient Sort-Based Join

  • If we do not have to worry about very large numbers of tuples with a common value for the join attribute(s), then we can save two disk 1/0's per block by combining the second phase of the sorts with the join itself

  • To compute R(X, Y) ►◄ S(Y, Z) using M main-memory buffers

    • Create sorted sublists of size M, using Y as the sort key, for both R and S.

    • Bring the first block of each sublist into a buffer

  • Repeatedly find the least Y-value y among the first available tuples of all the sublists. Identify all the tuples of both relations that have Y-value y. Output the join of all tuples from R with all tuples from S that share this common Y-value

Thank you

Two-Pass Algorithms Based on Hashing

Chapter – 15.5

Cs 257

Id 131 Swapna vemparala



Partitioning Relations by Hashing

A Hash-Based Algorithm for Duplicate Elimination

Hash-Based Grouping and Aggregation

Hash-Based Union, Intersection, and Difference

The Hash-Join Algorithm

Saving Some Disk I /O ’s

Differences between sort-based and corresponding hashbased algorithms


The essential idea behind all these previous algorithms is as follows:

If the data is too big to store in main-memory buffers, hash all the tuples of

the argument or arguments using an appropriate hash key.

For all the common operations, there is a way to select the hash key so all the tuples that need to be considered together when we perform the operation fall into the same bucket.

We then perform the operation by working on one bucket at a time (or on

a pair of buckets with the same hash value, in the case of a binary operation).

In effect, we have reduced the size of the operand(s) by a factor equal to the

number of buckets, which is roughly M.

15.5.1 Partitioning Relations by Hashing

Take a relation R and, using M buffers, partition R into M — 1 buckets of roughly equal size.

assume that h is the hash function, and that h takes complete tuples of R as its argument

associate one buffer with each bucket

The last buffer holds blocks of R , one at a time. Each tuple t in

the block is hashed to bucket h(t) and copied to the appropriate buffer.

If that buffer is full, we write it out to disk, and initialize another block for the same bucket.

At the end, we write out the last block of each bucket if it is not empty.


15.5.2 A Hash-Based Algorithm for Duplicate Elimination

We shall now consider the details of hash-based algorithms for the various operations of relational algebra that might need two-pass algorithms.

First, consider duplicate elimination, that is, the operation S(R). We hash R to M — 1 buckets, two copies of the same tuple t will hash to the same bucket.

Thus, we can examine one bucket at a time, perform <5 on that bucket in isolation, and take as the answer the union of S(Ri), where Riis the portion of R that hashes to the ith bucket.

The one-pass algorithm eliminates duplicates from each Riin turn and write out the resulting unique tuples

This method will work as long as the individual R i’s are sufficiently small to fit in main memory and thus allow a one-pass algorithm.

Since we may assume the hash function h partitions R into equal-sized buckets, each Riwill be approximately B(R)/(M — 1) blocks in size.

If that number of blocks is no larger than M , i.e., B(R) < M(M — 1), then the two-pass, hash-based algorithm will work.

Thus, a conservative estimate (assuming M and M — 1 are essentially the same) is B(R) < M 2, exactly as for the sort-based, two-pass algorithm for 6.

The number of disk I/O ’s is also similar to that of the sort-based algorithm.

We read each block of R once as we hash its tuples, and we write each block of each bucket to disk. We then read each block of each bucket again in theone-pass algorithm that focuses on that bucket.

Thus, the total number of disk I/O ’s is 3B(R).

15.5.3 Hash-Based Grouping and Aggregation

To perform the 7 l ( R ) operation, we again start by hashing all the tuples of R to M — 1 buckets.

However, in order to make sure that all tuples of the same group wind up in the same bucket, we must choose a hash function that depends only on the grouping attributes of the list L.

Having partitioned R into buckets, we can then use the one-pass algorithm for 7 to process each bucket in turn.

For S, we can process each bucket in main memory provided B{R) < M 2.

However, on the second pass, we need only one record per group as we process each bucket

.Thus, even if the size of a bucket is larger than M, we can handle the bucket in one pass provided the records for all the groups in the bucket take no more than M buffers.

As a consequence, if groups are large, then we may actually be able to handle much larger relations R than is indicated by the B(R) < M 2 rule.

On the other hand, if M exceeds the number of groups, then we cannot fill all buckets.

Thus, the actual limitation on the size of R as a function of M is complex, but B(R) < M 2 is a conservative estimate.

Finally, we observe that the number of disk I/O ’s for 7 , as for 8, is 3B(R).

15.5.4 Hash-Based Union, Intersection, and Difference

When the operation is binary, use the same hash function to hash tuples of both arguments. For example, to compute R Us 5, we hash both R and S to M — 1 buckets each, say i? i, -R2, - - • , R m - 1 and S i,5 2, • • • , S m - 1.

We then take the set-union of Riwith Si for all i, and output the result. Notice that if a tuple t appears in both R and S, then for some iwe shall find t in both Riand Si.

Thus, when we take the union of these two buckets, we shall output only one copy of t , and there is no possibility of introducing duplicates into the result.


To take the intersection or difference of R and S, we create the 2(M — 1) buckets exactly as for set-union and apply the appropriate one-pass algorithm to each pair of corresponding buckets.

Notice that all these one-pass algorithms require B(R) -I- B(S) disk I/O ’s.

To this quantity we must add the two disk I/O ’s per block that are necessary to hash the tuples of the two relations and store the buckets on disk, for a total of 3 (B{R) + 5 (5 )) disk I/O ’s.

In order for the algorithms to work, we must be able to take the one-pass union, intersection, or difference of Riand Si, whose sizes will be approximately B(R)/(M - 1) and B(S)/(M - 1), respectively.

Recall that the onepass algorithms for these operations require that the smaller operand occupies at most M — 1 blocks.

Thus, the two-pass, hash-based algorithms require that m in(B(R),B(S)) < M 2, approximately.

15.5.5 The Hash-Join Algorithm

To compute R{X, Y) txS(Y, Z) using a two-pass, hash-based algorithm, we act almost as for the other binary operations

The only difference is that we must use as the hash key just the join attributes,Y.

Then we can be sure that if tuples of R and S join, they will wind up in corresponding buckets Riand Si for some i.

A one-pass join of all pairs of corresponding buckets completes this algorithm, which we call hash-join.

15.5.6 Saving Some Disk I /O ’s

If there is more memory available on the first pass than we need to hold one block per bucket, then we have some opportunities to save disk I/O ’s.

One option is to use several blocks for each bucket, and write them out as a group, in consecutive blocks of disk.

Strictly speaking, this technique doesn’t save disk I/O ’s, but it makes the I/O ’s go faster, since we save seek time and rotational latency when we write.

Effective ,method called hybrid hash-join, works as follows.

In general, suppose we decide that to join R t x S, with S the smaller relation, we need to create k buckets, where k is much less than M, the available memory. When we hash S, we can choose to keep m of the k buckets entirely in main memory, while keeping only one block for each of the other k — m buckets. We can manage to do so provided the expected size of the buckets in memory, plus one block for each of the other buckets, does not exceed M ; that is:

m B ( S ) / k + k — m < M

expected size of a bucket is B ( S )/k, and there are m buckets in memory.

Now, when we read the tuples of the other relation, R, to hash that relation into buckets, we keep in memory:

1. The rnbuckets of 5 that were never written to disk, and

2. One block for each of the k — m buckets of R whose corresponding buckets of 5 were written to disk.

If a tuple t of R hashes to one of the first m buckets, then we immediately join it with all the tuples of the corresponding 5-bucket, as if this were a onepass, hash-join.

It is necessary to organize each of the in-memory buckets of 5 into an efficient search structure to facilitate this join, just as for the one-pass hash-join.

If t hashes to one of the buckets whose corresponding 5-bucket is on disk, then t is sent to the main-memory block for that bucket, and eventually migrates to disk, as for a two-pass, hash-based join.

On the second pass, we join the corresponding buckets of R and 5 as usual.

However, no need to join the pairs of buckets for which the 5-bucket was left in memory; these buckets already been joined

The savings in disk I/O ’s is equal to two for every block of the buckets of 5 that remain in memory, and their corresponding ft-buckets.

Since m / k of the buckets are in memory, the savings is 2(m/k)(B(R) + B(S)).

The intuitive justification is that all but k — m of the main-memory buffers can be used to hold tuples of 5 in main memory, and the more of these tuples, the fewer the disk I/O ’s.

Thus, we want to minimize k, the total number of buckets.

We do so by making each bucket about'as big as can fit in main memory; that is, buckets are of size M, and therefore k = B (S ) /M .

If that is the case, then there is only room for one bucket in the extra main memory; i.e., m — 1.

In fact, we really need to make the buckets slightly smaller than B( S ) /M , or else we shall not quite have room for one full bucket and one block for the other k — 1 buckets in memory at the same time.

Assuming, for simplicity, that k is about B ( S ) / M and m = 1, the savings in disk I/O ’s is 2 M ( B ( R ) + B { S ) ) / B { S ) and the total cost is (3 — 2M / B ( S ) ) (B ( R ) + B(S)).

15.5.7 Summary of Hash-Based Algorithms

Differences between sort-based and corresponding hashbased algorithms

1. Hash-based algorithms for binary operations have a size requirement that depends only on the smaller of two arguments rather than on the sum of the argument sizes, that sort-based algorithms require.

2. Sort-based algorithms sometimes allow us to produce a result in sorted order and take advantage of that sort later.

The result might be used in another sort-based algorithm for a subsequent operator, or it could be the answer to a query that is required to be produced in sorted order.

3. Hash-based algorithms depend on the buckets being of equal size.

there is generally at least a small variation in size, it is not possible to use buckets that

4. In sort-based algorithms, the sorted sublists may be written to consecutive blocks of the disk

5. Moreover, if M is much larger than the number of sorted sublists, then we may read in several consecutive blocks at a time from a sorted sublist, again saving some latency and seek time.

6 . On the other hand, if we can choose the number of buckets to be less than M in a hash-based algorithm, then we can write out several blocks of a bucket at once.

Thank you

15.6 Index Based Algorithms

By: Tomas Tupy (123)


  • Terminology

  • Clustered Indexes

    • Example

  • Non-Clustered Indexes

  • Index Based Selection

  • Joining Using an Index

  • Join Using a Sorted Index

Index-based algorithms are especially useful for the selection operator, but algorithms for join and other binary operators also use indexes to very good advantage.

What is an Index?

A data structure which improves the speed of data retrieval ops on a relation, at the cost of slower writes and the use of more storage space.

Enables sub-linear time lookup.

Data is stored in arbitrary order, while logical ordering is achieved by using the index.

Terminology Recap

B(R)– Number of blocks needed to hold R

T(R) – Number of tuples in R

V(R,a) – Number of distinct values of the column for a in R.

Clustered Relation – Tuples are packed into as few blocks as possible.

Clustered Indexes – Indexes on attribute(s) such that all tuples with a fixed value for the search key appear on a few blocks as possible.

Clustering Indexes

A relation is clustered if its tuples are packed into relatively few blocks.

Clustering indexes are indexes on an attribute or attributes such that all the tuples with a fixed value for the search key of this index appear in as little blocks as possible.

Tuples are stored to match the index order.

A relation that isn’t clustered cannot have a clustering index.

Clustering Indexes

Let R(a,b) be a relation sorted on attribute a.

Let the index on a be a clustering index.

Let a1be a specific value for a.

A clustering index has all tuples with a fixed value packed into minumum # of blocks.

a1 a1

a1 a1a1a1

a1 a1a1a1


All the a1 tuples


  • Pros

    • Faster reads for particular selections

  • Cons

    • Writing to a table with a clustered index can be slower since there might be a need to rearrange data.

    • Only one clustered index possible.

Clustered Index Example

  • Problem: We want to quickly retrieve all orders for a particular customer.

  • How do we do this?

Clustered Index Example

Solution: Create a clustered index on the “CustomerID” column of the Order table.

Now the tuples with the same CustomerID will be physically stored closed to one another on the disk.

Non Clustered Indexes

There can be many per table

Quicker for insert and update operations.

The physical order of tuples is not the same as index order.

Index Based Algorithms

Especially useful for the selection operator.

Join and other binary operators also benefit.

Index-Based Selection

  • No index

    • Without an index on relation R, we have to read all the tuples in order to implement selection oC(R), and see which tuples match our condition C.

    • What is the cost in terms of disk I/O’s to implement oC(R)? (For both clustered and non-clustered relations)

Index-Based Selection

  • No index

    • Answer:

      • B(R) if our relation is clustered

      • Upto T(R) if relations in not-clustered.

Index-Based Selection

Let us consider an index on attribute a where our condition C is a = v.


In this case we just search the index for value v and we get pointers to exactly the tuples we need.

Index-Based Selection

Let’s say for our selection oa=v(R), our index is clustering.

What is the cost in the # of disk I/O’s to retrieve the set oa=v(R)?

Index-Based Selection

  • Answer

    • the average is: B(R) / V(R,a)

  • A few more I/Os:

    • Index might not be in main memory

    • Tuples which a = v might not be block aligned.

    • Even if clustered, might not be packed as tight as possible. (Extra space for insertion)

Index-Based Selection

Non-clustering index for our selection oa=v(R)

What is the cost in the # of disk I/O’s to retrieve the set oa=v(R)?

Index-Based Selection

  • Answer

    • Worst case is: T(R) / V(R,a)

    • This can happen if tuples live in different blocks.

Joining by Using an Index(Algorithm 1)

Consider natural join: R(X,Y) |><| S(Y,Z)

Suppose S has and index on attribute Y.

Start by examining each block of R, and within each block consider each tuple t, where tY is a component of t corresponding to the attribute Y.

Now we use the index to find tuples of S that have tY in their Y component.

These tuples will create the join.

Joining by Using an Index(Algorithm 1) Analysis

Consider R(X,Y) |><| S(Y,Z)

If R is clustered, then we have to read B(R) blocks to get all tuples of R. If R is not clustered then up to T(R) disk I/O’s are required.

For each tuple t of R, we must read an average of T(S) / V(S,Y) tuples of S.

Total: B(R)T(S) / V(S,Y) for clustered index, and T(R)T(S) / V(S,Y) for non-clustered index.

Join Using a Sorted Index

Consider R(X,Y) |><| S(Y,Z)

Data structures such as B-Trees provide the best sorted indexes.

In the best case, if we have sorting indexes on Y for both R and S then we perform only the last step of the simple sort-based join. (Sometimes called zig-zag join)

Join Using a Sorted Index(Zig-zag join)

Consider R(X,Y) |><| S(Y,Z) where we have indexes on Y for both R and S.

Tuples from R with a Y value that does not appear in S never need to be retrieved, and vice-versa…

Index on Y in R

Index on Y in S

Thank You!


Buffer Management

By Snigdha Rao Parvatneni

SJSU ID: 008648978

Roll Number: 124

Course: CS257



Role of Buffer Management

Architecture of Buffer Management

Buffer Management Strategies

Relation Between Physical Operator Selection And Buffer Management



We assume that operators in relations have some main memory buffers, to store the needed data.

It is very rare that these buffers are allocated in advance, to the operator.

Task of assigning main memory buffers to process is given to Buffer Manager.

Buffer manager is responsible for allocating main memory to the process as per the need and minimizing the delays and unsatisfiable requests.

Role of Buffer Manager

Buffer Manager responds to the request of main memory access to disk blocks. Below picture depicts it.

Architecture of Buffer Management

  • There are two broad architectures for a buffer manager:

    • Buffer manager controls main memory directly like in many RDBMS.

    • Buffer manager allocates buffers in virtual memory and let OS decide which buffers should be in main memory and which buffer should be in OS managed disk swap space like in many Object Oriented DBMS and main memory DBMS.


  • Irrespective of approach there is a problem that buffer manager has to limit number of buffers, to fit in available main memory.

    • In case where buffer manager controls main memory directly

      • If requests exceeds available space then buffer manager has to select a buffer to empty by returning its contents to disk.

      • When blocks have not been changed then they are imply erased from main memory. But, when blocks have been changed then they are written back to its place on disk.

    • In case where Buffer manager allocates space in virtual memory

      • Buffer manager has the option of allocating more buffers, which can actually fit into main memory. When all these buffers will be in use then there will be thrashing.

      • It is an operating system problem where many blocks are moved in and out of disk’s swap space. Therefore, system will end up spending most of time in swapping blocks and getting very little work done.


To resolve this problem When DBMS is initialized then the number of buffers is set.

User need not worry about mode of buffering used.

For users there is a fixed size buffer pool, in other words set of buffers are available to query and other database actions.

Buffer Management Strategies

  • Buffer Manager needs to make a critical choice of which block to keep and which block to discard when buffer is needed for newly requested blocks.

  • Then buffer manager uses buffer replacement strategies. Some common strategies are –

    • Least-Recently Used (LRU)

    • First-In-First-Out (FIFO)

    • The Clock Algorithm (Second Chance)

    • System Control

Last-Recently Used (LRU)

  • This rule is to throw out the block which has not been read or written from long time.

  • To do this the Buffer Manager needs to maintain a table which will indicate the last time when block in each buffer was accessed.

  • It is also needed that each database access should make an entry in this table. Significant amount of is involved effort in maintaining this information.

  • Buffers which are not user from long time is less likely to be accessed before than those buffers which have been accessed recently. Hence, It is an effective strategy.

First-In-First-Out (FIFO)

  • In this rule, when buffer is needed then the buffer which has been occupied for longest by same block is emptied and used by new block.

  • To do this Buffer Manager needs to know only the time at which block occupying the buffer was loaded into the buffer.

  • Entry in the table is made when block is read from disk, not every time it is accessed.

  • Involves less maintenance than LRU but it is more prone to mistakes.

The Clock Algorithm

It is an efficient approximation of LRU and is commonly implemented.

Buffers are treated to be arranged in circle where arrow points to one of the buffers. Arrow will rotate clockwise if it needs to find a buffer to place a disk block.

Each buffer has an associated flag with value 0 or 1. Buffers with flag vale 0 are vulnerable to content transfer to disk whereas buffer with flag value 1 are not vulnerable.

Whenever block is read into buffer or contents of buffer are accessed, flag associated with it is set to 1.

Working of Clock’s Algorithm

Whenever buffer is needed for the

block arrow looks for first 0 it

can find in clockwise direction.

Arrow move changes flag value

from 1 to 0.

Block is thrown out of buffer only

when it remains unaccessed i.e.

flag value 0 for the time

between two rotations of the arrow.

First rotation when flag is set from

1 to 0 and second rotation when

arrow comes back to check flag value.

System Control

Query processor and other DBMS components can advice buffer manager to avoid some mistake which occurs with LRU, FIFO or Clock.

Some blocks cannot be moved out of main memory without modifying other blocks pointing to it. Such blocks are called pinned blocks.

Buffer Manager needs to modify buffer replacement strategy, to avoid expelling pinned blocks. That’s why some blocks are remains in main memory even though there is no technical reason for not writing it to the disk.

Relation Between Physical Operator Selection And Buffer Management

  • Physical operator is selected by query optimizer to execute the query. This selection may assume that certain number for the available buffers, for execution of these operators.

  • But as we know that Buffer Manager does not guarantee availability of certain number of buffers when query is executed.

  • Now two question arises

    • Can an algorithm adapt to changes in number of available main memory buffers?

    • When expected number of available buffers are not available and certain blocks are moved to disk which were expected in main memory, then how buffer replacement strategy of Buffer Manager impact the number of I/O performed?


Block based nested loop join – algorithm does not depends upon number of available buffers, M but performance depends.

For each M-1 blocks of outer loop relation S, read blocks in main memory, organize the tuple into search structure whose key is common attribute of R and S.

Now for each block b of R, read b into main memory and for each tuple t of b find tuples in main memory for S that join with t.

Here outer loop number M-1 depends upon average number of buffers available at each iteration and the outer loop uses M-1 buffers and 1 is reserved for relation of inner loop, block of R.

If we pin M-1 block that we use for S in one iteration of outer loop then we cannot loose these buffers during this round. In addition to this more buffers will become available and more than one block of R needs to be kept in the memory. Will it improve the running time?

Cases with LRU

  • Case1

    • When LRU is used as buffer replacement strategy then k buffers will be available to hold blocks of R.

    • R is read in order such that blocks that remains in the buffer at the end of iteration of outer look will be last k blocks of R.

    • For next iteration we will again start from beginning of R. Therefore, k buffers for R needs to be replace.

  • Case2

    • With better implementation of nested loop join when LRU is used visit blocks of R in order that alternates first to last then last to first.

    • In this we save k disk I/O on each iteration except first iteration.

With Other Algorithms

Other algorithms also are impacted by the fact that availability of buffer can vary and by the buffer-replacement strategy used by the buffer manager.

In sort based algorithm when availability of buffers reduces we can change the size of a sub-lists. Major limitation of this is we will be forced to create many sub-lists that we cannot then allocate a buffer for each sub-list in the merging process.

In hash based algorithm when availability of buffers reduces we can reduce the size of buckets, provided bucket then should not become so large that they do not fit into the allotted main memory.


DATABASE SYSTEMS: The Complete Book, Second Edition by Hector Garcia-Molina, Jeffrey D. Ullman & Jennifer Widom

Thank You

15.8 Algorithms using more than two passes

Presented By: Seungbeom Ma (ID 125)

Professor: Dr. T. Y. Lin

Computer Science Department

San Jose State University

Multipass Algorithms

  • Previously , most of algorithms are required two passes.

  • There is a case that we need more than two passes.

  • Case : Data is too big to store in main memory.

    • We have to hash or sort the relation with multipass algorithms.


1. Multipass Sort-Based Algorithm

2.Multipass Hash-Based Algorithm

Multipass sort-based algorithm.

M: Number of Memory Buffers

R: Relation

B(R) : Number of blocks for holding relation.


1. If R fits in M block (B (R) <= M).

2. Reading R into main memory.

3. Sorting R in the main memory with any sorting algorithm.

4. Write the sorted relation to disk.

Multipass sort-based algorithm.


1. If R does not fit into main memory then partitioning the blocks hold R into M groups, which call R1, R2, …, RM

2.Recursively sorting Ri from i =1 to M

3.Once sorting is done, the algorithm merges the M sorted sub-lists.

Performance: Multipass Sort-Based Algorithms

1) Each pass of a sorting algorithm:

1.Reading data from the disk.

2. Sorting data with any sorting algorithms

3. Writing data back to the disk.

2-1) (k)-pass sorting algorithm needs

2k B(R) disk I/O’s

2-2)To calculate (Multipass)-pass sorting algorithm needs

= > A+ B

A: 2(K-1 ) (B(R) + B(S) ) [ disk I/O operation to sort the sublists]

B: B(R) + B(S)[ disk I/O operation to read the sorted the sublists in the final pass]

Total: (2k-1)(B(R)+B(S)) disk I/O’s

Multipass Hash-Based Algorithms

  • 1. Hashing the relations into M-1 buckets, where M is number of memory buffers.

  • 2. Unary case:

  • It applies the operation to each bucket individually.

  • 1.Duplicate elimination (δ) and grouping (γ).

    • 1) Grouping:Min, Max, Count , Sum , AVG , which can group the data in the table

    • 2) Duplicate elimination: Distinct


      If the relation fits in M memory block,

      -> Reading relation into memory and perform the operations.

  • 3. Binary case: It applies the operation to each corresponding pair of buckets.

  • Query operations: union, intersection, difference , and join

    • If either relations fits in M-1 memory blocks,

    • -> Reading that relation into main memory M-1 blocks

    • -> Reading next relation to 1 block at a time into the Mth block

    • Then performing the operations.


  • If Unary and Binary relation does not fit into the main memory buffers.

    • Hashing each relation into M-1 buckets.

    • Recursively performing the operation on each bucket or corresponding pair of buffers.

    • Accumulating the output from each buckets or pair.

Hash-Based Algorithms : Unary Operatiors

Perfermance: Hash-Based Algorithms

R: Realtion.

Operations are like δ and γ

M: Buffers

U(M, k): Number of blocks in largest relation with k-pass hashing algorithm.



1. Assuming that the first step divides relation R into M-1 equal buckets.

2. The buckets for the next pass must be small enough to handle in k-1 passes

3.Since R is divided into M-1 buckets , we need to have (M-1)u(M, k-1).

Sort-Based VS Hash-Based

1. Sort-based can produce output in sorted order. It might be helpful to reduce rotational latency or seek time

2. Hash-based depends on buckets being of equal size. For binary operations, hash-based only limits size of smaller relation. Therefore, hash-based can be faster than sort-based for small size of relation.

Sort-based can produce output in sorted order.

Hash-based depends on buckets being of equal size.


David LeCS257, ID: 126Feb 28, 2013




  • QueryProcessing

  • Outline of Query Compilation

  • Table Scanning

  • CostMeasures

  • Review of Algorithms

    • One-passMethods

    • Nested-LoopJoin

    • Two-pass

      • Sort-based

      • Hash-based

    • Index-based

    • Multi-pass


  • Queryiscompiled. This involves extensive optimizationusingoperations of relationalalgebra.

  • First compiledinto a logicalquery plans, e.g. using expressions of relationalalgebra.

  • Thenconverted to a physicalquery plan such as selectingimplementation for eachoperator, ordering joins and etc.

  • Queryisthenexecuted.


Query Compilation

query plan

Query Execution



Outline of Query Compilation

  • Parsing: A parse tree for the query is constructed.

  • Query Rewrite: The parse tree is converted to an initial query plan and transformed into logical query plan.

  • Physical Plan Generation: Logical plan is converted into physical plan by selecting algorithms and order of executions.

SQL query

Parse query



Select logical plan



logical query

plan tree

Select physical plan

physical query

plan tree

Execute plan

Table Scanning

  • There are two approaches for locating tuples of relation R:

    • Table-scan: Get the blocks one by one.

    • Index-scan: Use index to lead us to all blocks holding R.

  • Sort-scan takes a relation R and sorting specifications and produces R in a sorted order. This can be accomplished with SQL clause ‘ORDER BY’.

  • CostMeasures

  • Estimates of cost are essential for query optimization.

  • It allows us to determine the slow and fast parts of a query plan.

  • Reading many consecutive blocks on a track is extremely important since disk I/O’s are expensive in term of time.


  • Review of Algorithms

  • One-pass Methods

  • Tuple-at-a-time: Selection and projection that do not require an entire relation in memory at once.

  • Full-relation, unary operations. Must see all or most of tuples in memory at once. Uses grouping and duplicate-eliminator operators.

  • Full-relation, binary operations. These include union, intersection, difference, product and join. they largely fall into three classes:

  • 1. Sorting-based methods (Section 15.4).

  • 2. Hash-based methods (Sections 15.5 and 20.1).

  • 3. Index-based methods (Section 15.6).

  • Review of Algorithms

  • Nested-Loop Joins

  • In a sense, it is ‘one-and-a-half’ passes, since one argument has its tuples read only once, while the other will be read repeatedly.

  • Can use relation of any size and does not have to fit all in main memory.

  • Two variations of nested-loop joins:

    • Tuple-based: Simplest form, can be very slow since it takes T(R)*T(S) disk I/O’s if we are joining R(x,y) with S(y,z).

    • Block-based: Organizing access to both argument relations by blocks and use as much main memory as we can to store tuples.

  • Review of Algorithms

  • Two-pass Algorithms

  • Usually enough even for large relations.

  • Based on Sorting:

    • Partition the arguments into memory-sized, sorted sublists.

    • Sorted sublists are then merged appropriately to produce desired results.

  • Based on Hashing:

    • Partition the arguments into buckets.

    • Useful if data is too big to store in memory.

  • Review of Algorithms

  • Two-pass Algorithms

  • Sort-based vs. Hash-based:

    • Hash-based are often superior to sort-based since they require only one of the arguments to be small.

    • Sorted-based works well when there is reason to keep some of the data sorted.

  • Review of Algorithms

  • Index-based Algorithms

  • Index-based joins are excellent when one of the relations is small, and the other has an index on join attributes.

  • Clustering and non-clustering indexes:

    • Clustering index has all tuples with fixed value packed into minimum number of blocks.

    • A clustered relation can have non-clustering indexes.

  • Review of Algorithms

  • Multi-pass Algorithms

  • Two-pass algorithms based on sorting or hashing can usually take three or more passes and will work for larger data sets.

  • Each pass of a sorting algorithm reads all data from disk and writes it out again.

  • Thus, a k-pass sorting algorithm requires 2·k·B(R) disk I/O’s.

Chapter 18

Chapter 18.1 Topics

  • Intro

    • Concurrency Control

    • Scheduler

  • Serializability

  • Schedules

    • Serial and Serializable

Intro: Concurrency Control & Scheduler

  • Concurrently executing transactions can cause inconsistent database state

  • Concurrency Control assures transactions preserve consistency

  • Scheduler:

    • Regulates individual steps of different transactions

    • Takes reads/writes requests from transactions and executes/delays them

Intro: Scheduler

Transaction manager

Transaction requests passed to Scheduler

Scheduler determines execution of requests




Reads and writes



  • How to assure concurrently executing transactions preserve database state correctness?

    • Serializability – schedule transactions as if they were executed one-at-a-time

      • Determine a Schedule


  • Schedule – sequence of important actions performed by transactions

    • Actions: reads and writes

  • Example: Transactions and actions

Serial Schedules

  • All actions of one transactions are followed by all actions of another transaction, and so on.

  • No mixing of actions

  • Depends only on order of transactions

  • Serial Schedules:

    • T1 precedes T2

    • T2precedes T1

Serial Schedule: Example

  • T1 precedes T2

    • Notation: (T1 ,T2)

  • Consistency constraint:

    • A = B

  • Final value:

    • A = B = 250

    • Consistency is preserved

Serializable Schedules

  • Serial schedules preserve consistency

  • Any other schedules that also guarantee consistency?

    • Serializable schedules

  • Definition:

    • A schedule Sis serializable if there’s a serial schedule S’ such that for every initial database state, the effects of S and S’ are the same.

Serializable Schedule: Example

Serializable, but not serial,


T2 acts on A after T1,

but before T1 acts on B

Effect is same as

serial schedule (T1, T2 )

Notation: Transactions and Schedules

  • Transaction: Ti(for example T1, T2,…)

  • Database element: X

  • Actions: read/write

    • rTi (X)= ri(X)

    • wTi(X) = wi(X)

  • Examples

    • Transactions:

      • T1: r1 (A); w1(A); r1(B); w1 (B);

      • T2: r2(A); w2(A); r2(B); w2(B);

    • Schedule:

      • r1 (A); w1 (A); r2 (A); w2 (A); r1 (B); w1 (B); r2 (B); w2 (B);

18.2 Conflict-Serializability

18.2.1 Conflicts

18.2.2 Precedence Graphs and a Test for Conflict-Serializability

18.2.3 Why the Precedence-Graph Test Works

18.2.1 Conflicts

Conflict -a pair of consecutive actions in a schedule such that, if their order is interchanged, the final state produced by the schedule is changed.

18.2.1 Conflicts

  • Non-conflicting situations:

    Assuming Ti and Tj are different transactions, i.e., i ≠ j:

    • ri(X); rj(Y) will never conflict, even if X = Y.

    • ri(X); wj(Y) will not conflict for X ≠ Y.

    • wi(X); rj(Y) will not conflict for X ≠ Y.

    • wi(X); wj(Y) will not conflict for X ≠ Y.

18.2.1 Conflicts

  • Two actions of the same transactions always conflictri(X); wi(Y)

  • Two writes of the same database element by different transactions conflictwi(X); wj(X)

  • A read and a write of the same database element by different transaction conflictri(X); wj(X)wi(X); rj(X)

Conflicting situations:

Three situations where actions may not be swapped:

18.2.1 Conflicts

  • Conclusions:

    Any two actions of different transactions may be swapped unless:

    • They involve the same database element, and

    • At least one is a write

  • The schedules S and S’ are conflict-equivalent, if S can be transformed into S’ by a sequence of non-conflicting swaps of adjacent actions.

  • A schedule is conflict-serializable if it is conflict-equivalent to a serial schedule.

18.2.1 Conflicts

  • Example 18.6

    Conflict-serializable schedule

    S: r1(A); w1(A); r2(A); w2(A); r1(B); w1(B); r2(B); w2(B);

    Above schedule is converted to the serial schedule S’ (T1, T2) through a sequence of swaps.

    r1(A); w1(A); r2(A); w2(A); r1(B); w1(B); r2(B); w2(B);r1(A); w1(A); r2(A); r1(B); w2(A); w1(B); r2(B); w2(B);r1(A); w1(A); r1(B); r2(A); w2(A); w1(B); r2(B); w2(B);r1(A); w1(A); r1(B); r2(A); w1(B); w2(A); r2(B); w2(B);

    S’: r1(A); w1(A); r1(B); w1(B); r2(A); w2(A); r2(B); w2(B);

18.2.2 Precedence Graphs and a Test for Conflict-Serializability

  • Given a schedule S, involving transactions T1 and T2, T1 takes precedence over T2 (T1 <sT2), if there are actions A1of T1 and A2of T2, such that:

    • A1 is ahead of A2 in S,

    • Both A1 and A2 involve the same database element, and

    • At least one of A1 and A2 is a write action

  • We cannot swap the order of A1 and A2.

  • A1 will appear before A2 in any schedule that is conflict-equivalent to S.

  • A conflict-equivalent serial schedule must have T1 before T2.

  • 18.2.2 Precedence Graphs and a Test for Conflict-Serializability

    • Precedence graph:

      • Nodes represent transactions of S

      • Arc from node i to node j, if Ti <S Tj

  • Example 18.7

    Given Schedule

    S: r2(A); r1(B); w2(A); r3(A); w1(B); w3(A); r2(B); w2(B);

    Precedence Graph

    Acyclic graph  Schedule is conflict-serializable

  • T3



    18.2.2 Precedence Graphs and a Test for Conflict-Serializability

    • Example 18.8

      S: r2(A); r1(B); w2(A); r3(A); w1(B); w3(A); r2(B); w2(B);

      Convert S to serial schedule S’ (T1, T2, T3).

      r1(B); r2(A); w2(A); r3(A); w1(B); w3(A); r2(B); w2(B);

      r1(B); r2(A); w2(A); w1(B); r3(A); w3(A); r2(B); w2(B);

      r1(B); r2(A); w1(B); w2(A); r3(A); w3(A); r2(B); w2(B);

      r1(B); w1(B); r2(A); w2(A); r3(A); w3(A); r2(B); w2(B);

      r1(B); w1(B); r2(A); w2(A); r3(A); r2(B); w3(A); w2(B);

      r1(B); w1(B); r2(A); w2(A); r2(B); r3(A); w3(A); w2(B);

      r1(B); w1(B); r2(A); w2(A); r2(B); r3(A); w2(B); w3(A);

      r1(B); w1(B); r2(A); w2(A); r2(B); w2(B); r3(A); w3(A);

      S’: r1(B); w1(B); r2(A); w2(A); r2(B); w2(B); r3(A); w3(A);

    18.2.2 Precedence Graphs and a Test for Conflict-Serializability

    • Example 18.9

      Given Schedule

      S: r2(A); r1(B); w2(A); r2(B); r3(A); w1(B); w3(A); w2(B);

      Precedence Graph

      Cyclicgraph  Schedule is NOT conflict-serializable




    18.2.3 Why the Precedence-Graph Test Works

    • Consider a cycle involving n transactions

      T1 —> T2 ... —> Tn —> T1

    • In the hypothetical serial order, the actions of T1must precede those of T2, which precede those of T3, and so on, up to Tn.

    • But the actions of Tn, which therefore come after those of T1, are also required to precede those of T1. This puts constraints on legal swaps between T1 and Tn.

    • Thus, if there is a cycle in the precedence graph, then the schedule is not conflict-serializable.

    18.3 Enforcing Serializability by locks


    • Enforcing serializability by locks

      • Locks

      • Locking scheduler

      • Two phase locking


    • Maintained on database element to prevent unserializable behavior

    • It works like as follows :

      • A request from transaction

      • Scheduler checks in the lock table to guide the decision

      • Generates a serializable schedule of actions.

    Consistency of transactions

    • Actions and locks must relate each other

      • Transactions can only read & write only if it has a lock on the database elements involved in the transaction.

      • Unlocking an element is compulsory.

    • Legality of schedules

      • No two transactions can acquire the lock on same element without the prior one releasing it.

    Locking scheduler

    • Grants lock requests only if it is in a legal schedule.

    • Lock table stores the information about current locks on the elements.

    • Consider

      • li(X): Transaction Ti requests a lock on database element X

      • ui(X): Transaction Ti releases its lock on database element X

    Locking scheduler (contd.)

    A legal schedule of consistent transactions but unfortunately it is not a serializable but it is legal.

    Locking schedule (contd.)

    The locking scheduler delays requests in order to maintain a consistent database state.

    Two-phase locking(2PL)

    • Guarantees a legal schedule of consistent transactions is conflict-serializable.

    • All lock requests proceed all unlock requests.

    • The growing phase:

      • Obtain all the locks and no unlocks allowed.

    • The shrinking phase:

      • Release all the locks and no locks allowed.

    Working of Two-Phase locking

    • Assures serializability.

    • Two protocols for 2PL:

      • Strict two phase locking : Transaction holds all its write locks till commit / abort.

      • Rigorous two phase locking : Transaction holds all locks till commit / abort.

    • Two phase transactions are ordered in the same order as their first unlocks.

    Instantaneously executes now




    Two Phase Locking

    Every two-phase-locked transaction has a point at which it may be thought to execute instantaneously

    Failure of 2PL.

    2PL fails to provide security against deadlocks.

    T1: l1(A); r1(A); A:=A+100; w1(A); l1(B); u1(A); r1(B); B:=B+100; w1(B); u1(B);

    T2: l2(B); r2(B); B:=B*2; w2(B); l2(A); u2(B); r2(A);A:=A*2; w2(A); u2(A);

    Concurrency ControlArchitecture for a Locking SchedulerSection 18.5


    Overview of Locking Scheduler

    Scheduler That Inserts Lock Actions

    The Lock Table

    Handling Locking and Unlocking Request

    Principles of simple scheduler architecture

    The transactions themselves do not request locks, or cannot be relied upon to do so. It is the job of the scheduler to insert lock actions into the stream of reads, writes, and other actions that access data.

    Transactions do not release locks. Rather, the scheduler releases the locks when the transaction manager tells it that the transaction will commit or abort.

    Scheduler That Inserts Lock Actions into the transactions request stream

    From Transaction



    Scheduler, Part 1




    Scheduler, Part 2


    The scheduler maintains a lock table, which, although it is shown as secondary-storage data, may be partially or completely in main memory

    Actions requested by a transaction are generally transmitted through the scheduler and executed on the database.

    Under some circumstances a transaction is delayed, waiting for a lock, and its requests are not (yet) transmitted to the database.

    The two parts of the scheduler perform

    Part I takes the stream of requests generated by the transactions and inserts appropriate lock actions ahead of all database-access operations, such as read, write, increment, or update.

    Part II takes the sequence of lock and database-access actions passed to it by Part I, and executes each appropriately

    Determine the transaction (T) that action belongs and status of T (delayed or not). If T is not delayed then

    1. Database access action is transmitted to the database and executed

    2. If lock action is received by PartII, it checks the L Table whether lock can be granted or not

    • i> Granted, the L Table is modified to include granted lock

    • ii>Not G. then update L Table about requested lock then PartII delays transaction T

    3. When a T = commits or aborts, PartI is notified by the transaction manager and releases all locks.

    - If any transactions are waiting for locks PartI notifies PartII.

    4. Part II when notified about the lock on some DB element, determines next transaction T’ to get lock to continue.

    The Lock Table

    A relation that associates database elements with locking information about that element

    Implemented with a hash table using database elements as the hash key

    Size is proportional to the number of lock elements only, not to the size of the entire database

    DB element A

    Lock information for A

    Lock table Entry Field

    • Group Mode

      • S means that only shared locks are held.

      • U means that there is one update lock and perhaps one or more shared locks.

      • X means there is one exclusive lock and no other locks.

  • Waiting

    • Waiting bit tells that there is at least one transaction waiting for a lock on A.

  • A list

    • A list describing all those transactions that either currently hold locks on A or are waiting for a lock on A.

  • Handling Lock Requests

    Suppose transaction T requests a lock on A

    If there is no lock table entry for A, then there are no locks on A, so create the entry and grant the lock request

    If the lock table entry for A exists, use the group mode to guide the decision about the lock request

    How to deal with existing Lock

    • 1) If group mode is U (update) or X (exclusive)

    • No other lock can be granted

      • Deny the lock request by T

      • Place an entry on the list saying T requests a lock

      • And Wait? = ‘yes’

    • 2) If group mode is S (shared)

    • Another shared or update lock can be granted

      • Grant request for an S or U lock

      • Create entry for T on the list with Wait? = ‘no’

      • Change group mode to U if the new lock is an update lock

    Handling Unlock Requests

    • Now suppose transaction T unlocks A

      • Delete T’s entry on the list for A

      • If T’s lock is not the same as the group mode, no need to change group mode

      • Otherwise check entire list for new group mode

    Handling Unlock Requests

    • If the value of waiting is “yes" need to grant one or more locks using following approaches

      • First-Come-First-Served:

      • Grant the lock to the longest waiting request.

      • No starvation (waiting forever for lock)

      • Priority to Shared Locks:

      • Grant all S locks waiting, then one U lock.

      • Grant X lock if no others waiting

      • Priority to Upgrading:

      • If there is a U lock waiting to upgrade to an X lock, grant that first.

    18. Concurrency Control

    18.6 Hierarchies of

    Database Elements

    Hierarchies of Database Elements

    • Two problems arises when tree structure in data is encountered – How to provide locking in each case.

      • Hierarchy of lockable elements (This section )

        - Large elements ( relations )

        - Smaller elements within relations ( blocks

        or individual tuples )

        2) Data it-self organized as trees. ( next section 18.7 )

    Locks with Multiple Granularity

    A database element can be a relation, block or a tuple

    Different Database Systems use different size of data base elements for locking purpose


    • Consider a database for a bank

      • Choosing relations as database elements means one lock for an entire relation

      • If we were dealing with a relation having account balances, this kind of lock would be very inflexible and thus provide very little concurrency

      • This means only 1 deposit/withdrawal can be made

      • Better way to lock individual blocks or pages

        such that two accounts in different blocks can be updated simultaneously

    Warning (Intention) Locks

    These protocol helps manage lock at different level of hierarchy

    Database Elements Organized in Hierarchy

    Warning Protocol Rules

    • These involve both ordinary (S and X) and warning (IS and IX) locks

    • Below Rules Followed

      • Begin at the root of hierarchy

      • Request the S or X lock if we are at the desired element

      • If the desired element id further down the hierarchy, place a warning lock (IS if S and IX if X)

      • When the warning lock is granted, we proceed to the child node and repeat the above steps until desired node is reached

        Uses Compatibility matrix

    Compatibility Matrix for Shared, Exclusive and Intention Locks


    • Movie(title, year, length, studioName)

      Transaction T1

      - Select * from Movie where Title = ‘King kong’

      Transaction T2

      - Update Movie set year = 1939 where title=‘ gone with the wind’

    If T2 had updated tuples related to ‘King Kong’ then it would have to wait untill T1 release S lock.

    S and X are not compatible

    Phantoms and Handling Insertions Correctly

    This arises when transactions create new sub elements of lockable elements

    Since we can lock only existing elements the new elements fail to be locked

    Example followed


    • Consider a transaction T3

      • Select sum(length) from moviewhere studioName =‘Disney’

      • This calculates the total length of all tuples having studioName=Disney

      • Thus, T3 acquires IS for relation and S for targeted tuples

      • Now, if another transaction T4 inserts a new tuple for studioName = ‘Disney’, the result of T3 becomes incorrect


    • Not a concurrency problem since the serial order (T3, T4) is maintained

    • Consider below now: T4 writing new tuple while T3 is reading tuples on same relation

      • r3(d1);r3(d2);w4(d3);w4(X);w3(L);w3(X)

      • Want to find total length of all movies under studioName – outcome should be consistent.

      • Above scenario is not consistent. T3 will not get correct sum because lock on new element (d3) was not obtained.


    • This problem is due to the relation has a phantom tuple (the new inserted tuple), which should have been locked but it did not existed when lock was acquired by T3.

    • The occurrence of phantoms can be avoided if all insertion and deletion transactions are treated as write operations (Excelusive lock – X ) on the whole relation.





    Rules for Access to Tree-Structured Data

    Why the Tree Protocol Works


    • In this section, we deal with tree structures that are formed by the link pattern.

      • Database elements are disjoint pieces of data, but the only way to get to a node is through its parent.

      • B-trees are an important example of this sort of data.

      • Traversing a particular path to an element gives us freedom to manage locks differently from the two-phase locking approaches we have seen previously.

    B-tree Details-

    • Basic DS:

      - Keeps records in sorted order, allows searches, sequential access, insertions and deletions in logarithmic time.

      - It is used in databases and file systems.

    • Locking structure:

      - Granularity is at node level. Treating Smaller pieces as elements is not beneficial and entire B-tree as a single element is infeasible!

    Motivation for Tree based locking-

    If we use a standard set of lock modes (shared, update and exclusive) and two phase locking, then concurrent use of the B-tree is almost impossible.

    The reason is that every transaction must begin by locking the root node of the B-tree.

    Any transaction that inserts or deletes could wind up rewriting the root of the B-tree.

    Thus, only one transaction that is not read-only can access the B-tree at any time

    However, in most situations a B-tree node will not be rewritten, even if the transaction inserts or deletes a tuple.

    Thus, as soon as a transaction moves to a child of the root and observes the situation that rules out a rewrite of the root, we would like to release the lock on the root.

    Releasing the lock on the root early will violate two phase locking, so we cannot be sure that the schedule of several transactions accessing the B-tree will be serializable.

    The solution is a specialized protocol for transactions that access tree structured data like B-trees.

    Rules for Access to Tree-Structured Data-

    • The following restrictions on locks form tree protocol.


      • There is only ONE kind of a lock, represented by the form li(X)

      • Transactions are consistent, and schedules must be legal but there is no two-phase locking requirement on transactions.


      • A transaction’s first lock may be at any node of the tree

      • Subsequent locks may only be acquired if the transaction currently has a lock on the parent node.

      • Nodes may be unlocked at any time.

      • A transaction may not relock a node on which it has released a lock, even if it still holds a lock on the node’s parent


    A tree of lockable elements








    3 transactions following the protocol…


    l1 (A);r1(A);

    l1 (B); r1(B);

    l1(C); l1 (C);


    l1(D); r1(D);

    w1 (B); u1(B);





    l2(E) denied






    w2 (B); u2(B);

    w1 (E); u1(E);

    Why Tree Protocol works-

    • The tree protocol implies a serial order on the transactions involved in a schedule.

    • The order of precedence can be defined as-

      Ti<S Tj ;If in a schedule S, the transactions Ti and Tj lock a node in common, and Ti locks the node first.

    • If precedence graph drawn from the precedence relations that we defined above has no cycles, then we claim that any topological order of transactions is an equivalent serial schedule.

    • For Example either ( T1,T2,T3) or (T3,T1,T2) is an equivalent serial schedule the reason for this serial order is that all the nodes are touched in the same order as they are originally scheduled.

    If two transactions lock several elements in common, then they are all locked in same order.

    Precedence graph derived from schedule:




    Example:--4 Path of elements locked by two transactions

    X T locks first


    U locks first


    Y U locks first

    Now Consider an arbitrary set of transactions T1, T2;.. . . Tn,, that obey the tree protocol and lock some of the nodes of a tree according to schedule S.

    First among those that lock, the root. they do also in same order.

    If Ti locks the root before Tj, Then Ti locks every node in common with Tj does. That is Ti<sTj, But not Tj>sTi.


    SECTION 18.8


    What is Timestamping?

    • Scheduler assign each transaction T a unique number, it’s timestamp TS(T).

    • Timestamps must be issued in ascending order, at the time when a transaction first notifies the scheduler that it is beginning

    Timestamp TS(T)

    • Two methods of generating Timestamps.

      • Use the value of system, clock as the timestamp.

      • Use a logical counter that is incremented after a new timestamp has been assigned.

  • Scheduler maintains a table of currently active transactions and their timestamps irrespective of the method used

  • Timestamps for database element X and commit bit

    • RT(X):- The read time of X, which is the highest timestamp of transaction that has read X.

    • WT(X):- The write time of X, which is the highest timestamp of transaction that has write X.

    • C(X):- The commit bit for X, which is true if and only if the most recent transaction to write X has already committed.

    U writes X

    T reads X

    T start

    U start

    Physically Unrealizable Behavior

    • Read too late:

    • A transaction U that started after transaction T, but wrote a value for X before T reads X.

    U reads X

    T writes X

    T start

    U start

    Figure: Transaction T tries to write too late

    Physically Unrealizable Behavior

    • Write too late:

    • A transaction U that started after T, but read X before T got a chance to write X.

    U writes X

    T reads X

    U start

    T start

    U aborts

    T could perform a dirty read if it reads X when shown

    Dirty Read

    • It is possible that after T reads the value of X written by U, transaction U will abort.

    Thomas Write Rule

    • Too late to repair the damage

    U writes X

    T writes X

    U Aborts

    T Start

    U Start

    T Commits

    Rules for Timestamps-Based scheduling

    • The scheduler, in response to a read or write request from a transaction T has the choice of:

      • a) Granting the request,

      • b) Aborting T (if T would violate physical reality) and restarting T with a new timestamp (abort followed by restart is often called rollback), or

      • c) Delaying T and later deciding whether to abort T or to grant the request(if the request is a read, and the read might be dirty)

    Rules for Timestamps-Based scheduling

    • Scheduler receives a request rT(X)

      • If TS(T) ≥ WT(X), the read is physically realizable.

        • If C(X) is true, grant the request, if TS(T) > RT(X), set RT(X) := TS(T); otherwise do not change RT(X).

        • If C(X) is false, delay T until C(X) becomes true or transaction that wrote X aborts.

    • If TS(T) < WT(X), the read is physically unrealizable. Rollback T.

    Rules for Timestamps-Based scheduling (Cont.)

    • Scheduler receives a request WT(X).

      • If TS(T) ≥ RT(X) and TS(T) ≥ WT(X), write is physically realizable and must be performed.

        • Write the new value for X,

        • Set WT(X) := TS(T), and

        • Set C(X) := false.

    • if TS(T) ≥ RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later values in X.

      • If C(X) is true, then the previous writers of X is committed, and ignore the write by T.

      • If C(X) is false, we must delay T.

  • if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.

  • Rules for Timestamps-Based scheduling (Cont.)

    • Scheduler receives a request to commit T. It must find all the database elements X written by T and set C(X) := true. If any transactions are waiting for X to be committed, these transactions are allowed to proceed.

    • Scheduler receives a request to abort T or decides to rollback T, then any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write.

    Three transactions executing under a timestamp-based scheduler

    Timestamps and Locking

    • Generally, timestamping performs better than locking in situations where:

      • Most transactions are read-only.

      • It is rare that concurrent transaction will try to read and write the same element.

  • In high-conflict situation, locking performs better than timestamps

  • The argument for this rule-of-thumb is:

    • Locking will frequently delay transactions as they wait for locks.

    • But if concurrent transactions frequently read and write elements in common, then rollbacks will be frequent in a timestamp scheduler, introducing even more delay than a locking system.

  • Concurrency control by validation

    CS 257 : Database System Principles

    Section 18.9

    At a Glance

    What is Validation?

    Architecture of Validation based Scheduler

    Validation Rules

    Comparison between Concurrency Control Mechanisms

    Validation (p1)

    Another type of Optimistic Concurrency control

    Allows transactions to access data without locks

    Validation Scheduler: Keeps record of what active transactions are doing

    Goes through ‘Validation Phase’ before the transaction starts to write values of database elements

    If there is a physically unrealizable behavior, the transaction is rolled back

    18.9.1 Architecture of Validation based Scheduler (p1)

    Scheduler must be told for each transaction T

    • Read Set , RS(T) - Sets of database elements T reads

    • Write Set , WS(T) - Sets of database elements T writes

    • Three phases of the Validation Scheduler

      • Read

        • Transaction reads from Database all elements in its Read Set.

        • Also computes in its local address space all results its going to write.

      • Validate

        • Validates the transaction by comparing its read and write sets with those of other transactions.

        • If validation fails, transaction is rolled back, else proceeds to write phase.

      • Write

        • Writes to the database its values for the elements in its write set.

    Validation based Scheduler

    • Scheduler has an assumed serial order of the transactions to work with.

    • Maintains three sets

      • START : Set of transactions that have started but not yet completed

        • START (T) – time at which transaction started

      • VAL : Set of transactions that have been validated but not yet finished the writing of phase 3

        • START(T) & VAL(T) – time at which T validated

      • FIN : Set of transactions that completed phase 3

        • START(T), VAL(T), FIN(T) – time at which T finished

    18.9.2 Validation Rules

    T reads X

    U writes X

    • Case 1:

      • U is in VAL or FIN, that is U is validated

      • FIN(U) > START(T) , that is U did not finish before T started

      • RS(T) ∩ WS(U) is not empty (let it contain database element X)

      • Since we don’t know whether or not T got to read U’s value, we must rollback T to avoid a risk that the actions of T and U will not be consistent with the assumed serial order.

    U validated

    U starts

    T validating

    T starts

    18.9.2 Validation Rules

    T writes X

    U writes X

    • Case 2:

      • U is in VAL , i.e. U has successfully validated

      • FIN(U) > VAL(T) , i.e. U did not finish before T entered its validation phase

      • WS(T) ∩ WS(U) is not empty (let database element X be in both write sets)

      • T and U must both write values of X , and if we let T validate, it is possible that it will write X before U does. Since we cannot be sure, we rollback T to make sure it does not violate the assumed serial order in which it follows U.

    U finish

    U validated

    T validating

    Rules for Validating a transaction T

    • Check that RS(T) ∩ WS(U) = ᶲ for any previously validated U that did not finish before T started, i.e., if FIN(U) > START(T).

    • Check that WS(T) ∩ WS(U) = ᶲ for any previously validated U that did not finish before T validated, i.e., if FIN(U) > VAL(T).

      1. RS(T) ∩ WS(U) = ᶲ ; FIN(U) > START(T)

      2. WS(T) ∩ WS(U) = ᶲ ; FIN(U) > VAL(T)

    Example 18.2.9

    4 Transactions T, U,V,W attempt to execute and validate

    U : RS = {B}

    WS = {D}

    W : RS ={A,D}

    WS = {A,C}

    • Read

    • Validate

    • Write


    RS = {A,B}

    WS ={A,C}

    V : RS = {B}

    WS = {D, E}

    Example 18.2.9

    • Validation of U [RS = {B}; WS = {D}]

      • Nothing to check ; Reads {B} – U validates successfully – Writes {D}

    • Validation of T [RS = {A,B}; WS ={A,C}]

      • FIN(U) > START(T) ; RS(T) ∩ WS(U) should be empty  {A,B} ∩ {D} = ᶲ

      • FIN(U) > VAL(T) ; WS(T) ∩ WS(U) should be empty  {A,C} ∩ {D} = ᶲ

    • Validation of V [RS = {B}; WS = {D, E}]

      • FIN(T) > START(V); RS(V) ∩ WS(T) should be empty  {B} ∩ {A,C} = ᶲ

      • FIN(T) > VAL(V) ;WS(V) ∩ WS(T) should be empty  {D,E} ∩ {A,C} = ᶲ

      • FIN(U) > START(V) ;RS(V) ∩ WS(U) should be empty  {B} ∩ {D} = ᶲ

    • Validation of W [RS ={A,D}; WS = {A,C}]

      • FIN(T) > START(W); RS(W) ∩ WS(T) should be empty  {A,D} ∩ {A,C} = {A}

      • FIN(V) > START(W);RS(W) ∩ WS(V) should be empty  {A,D} ∩ {D,E} = {D}

      • FIN(V) > VAL(W);WS(W) ∩ WS(V) should be empty  {A,C} ∩ {D,E} = ᶲ

      • W is not validated , Is rolled back and hence does not write values A and C

    18.9.3 Comparison between Concurrency Control Mechanisms – Storage Utilization

    • Timestamping and validation may use slightly more space than a locking.

    • A potential problem with validation is that the write set for a transaction must be known before the writes occur

    18.9.3 Comparison between Concurrency Mechanisms - Delay

    • The performance of the three methods depends on whether interaction among transactions is high or low.

      (Interaction  the likelihood that a transaction will access an element that is also being accessed by a concurrent transaction)

      • Locking delays transactions but avoids rollbacks, even when interaction is high.

      • Timestamps and validation do not delay transactions, but can cause them to rollback, which is a more serious form of delay and also wastes resources.

    • If interference is low, then neither timestamps nor validation will cause many rollbacks, and is preferable to locking

    Thank You

  • Login