Why Concerning Storage and Indexing?

Part III: Storage and Indexing(Chapters 8-11)Part V: Transctions, Concurrency control, Scheduling, and Recovery(Chapters 16-18)

Why Concerning Storage and Indexing? • Performance: another major factor in user satisfaction • Depends on • Efficient data structures for data representation • Efficiency of system operation on those structures • Disks contains data files and system files including dictionary and index files • Disk access: one of the most critical factor in performance.

Why Not Store Everything in Main Memory? • Cost and size • Main memory is volatile:What’s the problem? • Typical storage hierarchy: • Factors: access speed, cost per unit, reliability • Cache and main memory (RAM) for currently used data: fast but costly • Flash memory: limited number of writes (and slow), non-volatile, disk-substitute in embedded systems • Disk for the main database (secondary storage). • Tapes for archiving older versions of the data (tertiary storage).

DB Buffer Management in a DBMS Page Requests from Higher Levels • CC & Recovery may require additional I/O when a frame is chosen for replacement. Why? BUFFER POOL disk page free frame MAIN MEMORY DISK choice of frame dictated by replacement policy

Indexes • An index on a file speeds up selections on the search key fieldsfor the index. • Any subset of the fields of a relation can be the search key for an index on the relation. • Search key is not the same as key(minimal set of fields that uniquely identify a record in a relation). • An index contains a collection of data entries, and supports efficient retrieval of all data entries k*with a given key value k. • Given data entry k*, we can find record with key k quickly. • Classes: dense/sparse index, primary/secondary, clustered/un-clustered

Dense vs Sparse Index • Dense index: one index entry per search key value. • Sparse index: index records for only some of the records • Every sparse index is clustered! • Sparse indexes are smaller • Which one is faster? • Which one has less overhead? Ashby, 25, 3000 22 Basu, 33, 4003 25 Bristow, 30, 2007 30 Ashby 33 Cass, 50, 5004 Cass Smith Daniels, 22, 6003 40 Jones, 40, 6003 44 44 Smith, 44, 3000 50 Tracy, 44, 5004 Sparse Index Dense Index on on Data File Name Age

Clustered vs. Unclustered Index • Suppose that Alternative (2) is used for data entries, and that the data records are stored in a Heap file. • To build clustered index, first sort the Heap file (with some free space on each page for future inserts). • Overflow pages may be needed for inserts. (Thus, order of data recs is `close to’, but not identical to, the sort order.) Index entries UNCLUSTERED CLUSTERED direct search for data entries Data entries Data entries (Index File) (Data file) Data Records Data Records

Index Trees • As for any index, 3 alternatives for data entries k*: • Data record with key value k • <k, rid of data record with search key value k> • <k, list of rids of data records with search key k> • Choice is orthogonal to the indexing technique used to locate data entries k*. • Tree-structured indexing techniques support both range searches and equality searches. • ISAM (indexed sequential access method): static structure (data entries reside in leaf pages and overflow pages) • B+ tree: dynamic, adjusts gracefully under inserts and deletes.

Overflow page index entry ISAM P K P K P P K m 0 1 2 1 m 2 • Index file may still be quite large. But we can apply the idea repeatedly – making a tree Non-leaf Pages Leaf Pages Primary pages • Leaf pages contain data entries.

Root 40 20 33 51 63 46* 55* 10* 15* 20* 27* 33* 37* 40* 51* 97* 63* Example ISAM Tree • Each node can hold 2 entries; no need for `next-leaf-page’ pointers. Why? • Sequential allocation of leaf pages. • Insert 23. • Insert 48, 41, 42.

After Inserting 23*, 48*, 41*, 42* ... Root 40 Index Pages 20 33 51 63 Primary Leaf 46* 55* 10* 15* 20* 27* 33* 37* 40* 51* 97* 63* Pages 41* 48* 23* Overflow Pages 42*

Index Entries (Direct search) Data Entries ("Sequence set") B+ Tree: Most Widely Used Index • Insert/delete at log F N cost; keep tree height-balanced.F = fanout , N = # leaf pages; • F is typically >> 2. Why? • Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. • Supports equality and range-searches efficiently.

Extendible Hashing • Situation: Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets? • Reading and writing all pages for reorganizing a data file is expensive • Idea: Use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed! • Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. Nooverflowpage! • Trick lies in how hash function is adjusted!

LOCAL DEPTH 2 Example Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 • Directory is array of size 4. • To find bucket for r, take last `global depth’ # bits of h(r); we denote r by h(r). • If h(r) = 5 = binary 101, it is in bucket pointed to by 01. 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES • Insert: If bucket is full, splitit (allocate new page, re-distribute). • If necessary, double the directory. (When is splitting a • bucket does not require doubling? We can tell by • comparing global depth with local depth for the split bucket.)

Insert h(r)=20 (Causes Doubling)--- How to determine keys in A and A2? 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 16* 32* 32* 16* GLOBAL DEPTH Bucket A GLOBAL DEPTH 2 2 2 3 Bucket B 5* 21* 13* 1* 00 1* 5* 21* 13* 000 Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* Bucket C 011 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 2 3 Bucket A2 4* 12* 20* DIRECTORY 12* 20* Bucket A2 4* (`split image' of Bucket A) (`split image' of Bucket A)

Linear Hashing • This is another dynamic hashing scheme, an alternative to Extendible Hashing. • LH handles the problem of long overflow chains without using a directory, and handles duplicates. • Idea: Use a family of hash functions h0, h1, h2, ... • hi(key) = h(key) mod(2iN); N = initial # buckets • h is some hash function (range is not 0 to N-1) • If N = 2d0, for some d0, hi consists of applying h and looking at the last di bits, where di = d0 + i. • hi+1 doubles the range of hi (similar to directory doubling)

Overview of LH File • In the middle of a round. Buckets split in this round: Bucket to be split If ( h search key value ) Level Next is in this range, must use h ( search key value ) Level+1 Buckets that existed at the to decide if entry is in beginning of this round: `split image' bucket. this is the range of h Level `split image' buckets: created (through splitting of other buckets) in this round

Lawyers Recently reported in the Massachusetts Bar Association Lawyers Journal, the following are questions actually asked of witnesses by attorneys during trials: "Now doctor, isn't it true that when a person dies in his sleep, he doesn't know about it until the next morning?“ "Were you alone or by yourself?“ "Was it you or your younger brother who was killed in the war?“ "How far apart were the vehicles at the time of the collision?“ Q: "How was your first marriage terminated?" A: "By death." Q: "And by whose death was it terminated?"

Part V:Transactions, Concurrency control, Scheduling, and Recovery Ch. 16 - 18

Overview • Transactions and ACID properties • Serial execution and serializable execution • Serializability (dependency) graph • Serializability theorem • Conflict equivalence and view equivalence • Properties of schedules: SR, RC, ACA, ST • Schedulers • 3 options to handle the request from TM • optimistic (aggressive) vs pessimistic (conservative) • 2PL and Strict 2PL • Timestamp ordering and Strict TO

Atomicity of Transactions • A transaction mightcommitafter completing all its actions, or it could abort(or be aborted by the DBMS) after executing some actions. • A very important property guaranteed by the DBMS for all transactions is that they are atomic. • Atomicity: a transaction is assumed to be executing all its actions in one step, or not executing any actions at all. • Not easy to achieve. Why?

Consistency, Isolation, Durability • A transaction executed in isolation must preserve DB consistency. • Even if multiple transactions executed concurrently, each should be unaware of other transactions are being executed concurrently. • When a transaction complete successfully, the changes it made must persist, even with failures afterwards.

Conflicts and Equivalence • When do two operations conflict? • They are issued by different transactions • They operate on the same data object • At least one of them is a write operation • Conflict equivalent • Two executions are conflict equivalent if in both executions, all conflicting operations have the same order.

Serializability • Correctness criterion • Serializability is the correctness definition of DB • All serializable schedules are equally correct • Scheduling algorithms enforce certain ordering • In distributed DBMS, variable delays may disturb any particular ordering which is supposed to occur • Serialization graph (dependency graph) shows dependency relationship among transactions • Serialization Theorem • For a schedule H, if SG(H) is acyclic, then H is serializable.

Properties of Schedules • Recoverability • To ensure that aborting a transaction does not change the semantics of committed transactions w1(x)r2(x)w2(y)C2 • Is it recoverable? What if T1 aborts? • Recoverable execution depends on commit order • A transaction cannot commit until all values it read are guaranteed not to be aborted. How to do it? • Delaying commit: T2 cannot commit until T1 commits • Cascaded abort is sometimes necessary. Why? w1(x)r2(x)w2(x)A1

Properties of Schedules • Recoverability • Cascaded abort is sometimes necessary w1(x)r2(x)w2(x)A1 • Avoiding cascaded aborts • Achieved if every transaction reads only the values written by committed transactions • Must delay each r(x) until all transactions that issued w(x) is either committed or aborted w1(x) ….. C1 r2(x) w2(y) …

Properties of Schedules • Restoring before images • Implementing transaction abort by simply restoring before images of all writes is very convenient w0(x)w1(x)w2(x) A1 A2 • Value of x must be restored to the initial value, not the value written by T1 • Solution: delay w(x) until all transactions that have written x are either committed or aborted • Strictness • Executions that satisfy both requirements • Delay both r(x) and w(x) until all transactions that have written w(x) are either committed or aborted r1(x)w1(x) w1(y) w2(z) w2(x) C1 --- is it strict?

Relationships among Properties • Recoverability (RC) • RC if Ti reads from Tj and Ci is in H, then Ci follows Cj • Avoiding cascaded aborts (ACA) • ACA if Ti reads x from Tj then ri(x) follows Cj • Strictness (ST) • ST if whenever Oi(x) follows wj(x), then Oi(x) follows either Aj or Cj • What is the relationship among ST, ACA, and RC? ST < ACA < RC • What about with SR and Serial execution?

Two-Phase Locking (2PL) • Two-Phase Locking Protocol • Each Xact must obtain a S (shared) lock on object before reading, and an X (exclusive) lock on object before writing. • A transaction can not request additional locks once it releases any locks. • If an Xact holds an X lock on an object, no other Xact can get a lock (S or X) on that object.

Database Tables Pages Tuples Multiple-Granularity Locks • Why consider it? • Database consists of tables, pages, tuples (records) • Hard to decide what granularity to lock (tuples vs. pages vs. tables). • Shouldn’t have to decide. How? • Data “containers” are nested: contains

The Phantom Problem • T1 implicitly assumes that it has locked the set ofall sailor records with rating = 1. • Assumption only holds if no sailor records are added while T1 is executing! • Why did this problem happen? • Example shows that conflict serializability guarantees serializability only if the set of objects is fixed! • Need some mechanism to enforce this assumption -- index locking and predicate locking

Index Locking Index Data r=1 • If there is a dense index on the rating field using Alternative (2), T1 should lock the index page containing the data entries with rating = 1. • If there are no records with rating = 1, T1 must lock the index page where such a data entry would be, if it existed! • What if there is no suitable index? • T1 must lock all pages, and lock the file/table to prevent new pages from being added, to ensure that no new records with rating = 1 are added.

Predicate Locking • Grant lock on all records that satisfy some logical predicate, e.g. age > 2*salary. • Index locking is a special case of predicate locking for which an index supports efficient implementation of the predicate lock. • What is the predicate in the sailor example? • rating=1 • Why not using predicate locks in commercial DBMS? • In general, predicate locking has a significant locking overhead.

Locking in B+ Trees • How can we efficiently lock a particular leaf node? • Don’t confuse this with multiple granularity locking -- How are they different? • One solution: Ignore the tree structure, just lock pages while traversing the tree, following 2PL -- What’s wrong? • This has terrible performance! • Root node (and many higher level nodes) becomes a bottleneck. Why? • Because every tree access begins at the root.

B+ Tree Locking • Higher levels of the tree only direct searches for leaf pages. • For inserts, a node on a path from root to modified leaf must be locked (in X mode, of course), only if a split can propagate up to it from the modified leaf. (Similar point holds w.r.t. deletes.) • We can exploit these observations to design efficient locking protocols that guarantee serializability even though they violate 2PL.

A Simple Tree Locking Algorithm • Search: Start at root and go down; repeatedly, S lock child then unlock parent. • Insert/Delete: Start at root and go down, obtaining X locks as needed. Once child is locked, check if it is safe: • If child is safe, release all locks on ancestors. • Safe node: Node such that changes will not propagate up beyond this node. • When is a node safe for inserts? • Node is not full. • When is a node safe for deletes? • Node is not half-empty.

Timestamp Ordering • Idea: Any conflicting operations are executed in their timestamp order • Simple and aggressive • Schedule immediately and reject requests that arrive too late • How do you know a request has arrived too late? • Give each object a read-timestamp (RTS) and a write-timestamp (WTS), give each transaction a timestamp (TS) when it begins

Timestamp Ordering • Timestamp ordering rule: • If Oi(x) and Oj(x) are conflicting operation, Oi(x) is processed before Oj(x), if and only if TS(Ti) < TS(Tj). • Request arriving too late: • Oi(x) arrives after the scheduler has sent conflicting operation Oj(x) with TS(Tj) > TS(Ti)

Basic Timestamp Ordering • Ri(x):if TS(Ti) < WTS(x), reject it; otherwise (TS(Ti) >=WTS(x)), then schedule it and set RTS(x) to max (RTS(x), TS(Ti)) • Wi(x): if TS(Ti) < RTS(x) or TS(Ti) <WTS(x), reject it; otherwise (TS(Ti) >=WTS(x)), then schedule it and set WTS(x) to max (WTS(x), TS(Ti)) • When restarted, Ti is assigned a new timestamp • Thomas Write Rule: • For wi(x), if TS(Ti) < WTS(x) and TS(Ti) >= RTS(x), then wi(x) can be ignored, rather than being rejected. • Why is it correct? • Ignoring obsolete write

Exercise: Non-equivalence of 2PL and TO • H1=r2(x) w3(x) C3 w1(y)C1 r2(y) w2(y) C2 • Is this schedule possible to timestamp ordering? • Is it possible with 2PL? H1 is legal with strict timestamp ordering. What is the equivalent serial schedule? T1 T2 T3 It is not possible with 2PL T2 must release lock on x for T3, but then gets lock on y – violation of two-phaseness

Relationship between 2PL and TO • Schedules generated by 2PL and TO • They are all correct (serializable) • They are not the same set: H1 shows that • Is the relationship inclusive? S {schedules by 2PL} subset of S {schedules by TO}? S {schedules by TO} subset of S {schedules by 2PL}? • Consider w3(x) C3 w2(x) C2 r1(x) Is it legal with TO? Is it legal with 2PL? • Two sets of schedules are intersecting, but subset SR 2PL TO

Failure and Recovery • Failure and consistency • Transaction failures • System failures • Media failures • Principle of recovery • Redundancy • Database can be protected by ensuring that its correct state can be reconstructed from information stored redundantly in the system • Recovering database – restart operation • Bringing the stable DB to a consistent state by removing effects of uncommitted transactions and applying missing effects of committed transactions.

Recovery and Restart • Types of storage media • Volatile storage: fast , but not surviving system failures • Non-volatile storage • Stable storage: information never lost (practically) • Recovery • Ideally, stable DB should contain, for each data item, the last value written by committed transaction • Practically, stable DB may contain values written by uncommitted transactions, or may not contain the last committed values. • Why? • Updating of uncommitted T • Buffering of committed values in the cache

Function of Recovery Manager • Atomicity: • Transactions may abort (“Rollback”). • Durability: • What if DBMS stops running? (Causes?) • Desired Behavior after system restarts: • T1, T2 & T3 should be durable. • T4 & T5should be aborted (effects not seen). crash! T1 T2 T3 T4 T5

Recovery Management • Design rules for recovery manager • Undo rule: committed values must be saved before overwritten by uncommitted values in the stable DB • Redo rule: before commit, new values it wrote must be in the stable storage (DB or log) • Restart activity • Preparation: during normal operation • Actual recovery: after failure • Preparation • Logging • Checkpointing

Cache Manager • Two operations: fetch and flush • Use dirty bit for deciding flushing operation • Flush: if the slot in cache is not dirty, do nothing; otherwise, copy the value into stable storage • Fetch: select a slot, using replacement algorithm if full (and flush if necessary), copy the value into slot, reset dirty bit, update cache directory • When to flush? • Depends on recovery strategy of the system • Different recovery algorithms use different strategies • Idempotence of restart • Any sequence of incomplete execution, followed by a complete execution of restart has the same effect of just one complete execution

Handling the Memory Pool • Write to disk: force/no-force • Cache page: steal/no-steal • Force every write to disk? • Poor response time. • But provides durability. • Steal buffer-pool frames from uncommited Xacts? • If not, poor throughput. • If so, how can we ensure atomicity? No Steal Steal Force Trivial Desired No Force

More on Steal and Force • STEAL (why enforcing Atomicity is hard) • To steal frame F:Current page in F (say P) is written to disk; some Xact holds lock on P. • What if the Xact with the lock on P aborts? • Must remember the old value of P at steal time (to support UNDOing the write to page P). • NO FORCE(why enforcing Durability is hard) • What if system crashes before a modified page is written to disk? • Write as little as possible, in a convenient place, at commit time, to support REDOing modifications.

Recovery Algorithms • Undo/redo algorithm • Most complicated of the four recovery algorithms • Flexible in deciding when to flush (no-force) • Maximize efficiency during normal operation at the expense of less efficient recovery • Comparison with other recovery algorithms • Issues: disk I/O, log space, recovery time • No-redo requires more frequent flush (force) • Uncommitted transaction is allowed to replace dirty slot for in-place update – undo might be necessary • Restart procedure • Process log forward and backward for redo and undo

Why Concerning Storage and Indexing?