DISTRIBUTED TRANSACTION

DISTRIBUTEDTRANSACTION FASILKOM UNIVERSITAS INDONESIA

What is a Transaction? • An atomic unit of database access, which is either completely executed or not executed at all. • It consists of an application specified sequence of operation, beginning with a begin_transaction primitive and ending with either commit or abort.

E.g. • Transfer $200 from account A in London to account B in Depok: begin_transaction amntA = lookup amount in account A amntB = lookup amount in account B if (amntA < $200) abort set account A = amntA - $200 set account B = amntB + $200 commit

Transaction Properties • Four main properties, the ACID properties: • Atomicity: A transaction must be all or nothing. • Consistency: A transaction takes the system form one consistent state to another consistent state. • Isolation: The results of an incomplete transactions are not allowed to be revealed to other transactions. • Durability: The results of a committed transaction will never be lost, independent of subsequent failures. • Atomicity & durability -> failure tolerance

Failure Tolerance • Atomicity & durability -> failure tolerance • Types of failures : • Transaction-local failures detected by the application (e.g.insufficient funds) • Transaction-local failures not detected by the application (e.g. divide by zero) • System failures affecting volatile storage (e.g. CPU failure) • Media failures (e.g. HD crash) • What is a volatile storage? • What is a stable storage?

Recovery • Based on redundancy. • For example : 1.Periodically archive database 2.Every time a change is made, record old and new values to a log. 3.If a failure occurs : • If not damage to physical database undo all ‘unreliable’ changes. • If database physically damaged, restore from archive and redo changes

Logging (1) • Database vs transaction log. • For each change (begin transaction, commit, and abort), write a log record with: • Transaction ID (TID) • Record ID • Type of action • Old value of record • New value of record • Other info, e.g. pointer to previous log record of this transaction.

Logging (2) • After a failure we need to undo or redo changes. • Undo and redo must be idempotent as there may be a failure whilst they are executing.

Log Write-ahead Protocol (1) • Before performing any update, at least the undo portion of the log record must be written to stable storage. • Before committing a transaction, all log records must have been fully recorded on stable storage. The commit record is written after these.

Log Write-ahead Protocol (2) • Reason for first rule : • If we change log before database : • log -- change -- crash  • log -- crash  • If we change log after database : • change -- log -- crash  • change -- crash can’t undo

Checkpointing (1) • How does the recovery manager know which transaction to undo an which to redo after a failure. • Naive approach : • Examine entire log from the start. Look for begin transaction records: • if a corresponding commit record exists, redo; • if there’s an abort, do nothing; and • if neither, undo.

Checkpointing (2) • Alternative: • Every so often: 1) Force all log buffers to disk. 2) Write a checkpoint record to disk containing: a) A list of all active transactions b) The most recent log records for each transaction in a) 3) Force all database buffers to disk - disk is now totally up-to-date. 4) Write address of checkpoint record to fixed ‘restart location’ (had better be atomic).

Time Leave T1 Redo T2 Undo T3 Redo T4 Undo T5 Checkpointing Crash Checkpointing (3) • There are 5 categories of transaction:

Recovery (1) • Look for most recent checkpoint record. • For all records active at checkpoint must: • undo all active at failure • redo all others

Recovery (2) • Have 2 lists: undo and redo • Initially, undo contains all TIDs in checkpoint record & redo is empty • 3 passes through log: • Forwards from checkpoint to end: • If we find ‘begin_transaction’ add undo list. • If we find ‘commit’, transfer from undo to redo list. • If we find ‘abort’, remove from undo list. • Backwards from end to checkpoint: undo. • Forwards from checkpoint to end: redo.

Commit Protocols • Commit protocols. • Assume a set of cooperating managers which deal with parts of a transaction. • For atomicity we must ensure that • At each site, either all actions or none are performed. • All sites take the same decision on whether to commit or abort

Two Phase Commit (2PC) Protocol - 1 • One node, the coordinator, has a special role, the others are participants. • The coordinator initiates the 2PC protocol. • If any participant cannot commit, then all site must abort.

2PC – 2 • Phase I: • reach a common decision on whether to abort or commit • Phase II: • Implement the decision at all sites

2PC I I PM/AAM ua/ - -/PM PM/RM U R tm/ACM RM/CCM CCM/ - ACM/ - AAM/ACM A C C A Coordinator Participant States: I = Initial state U = Undecided R = Ready to Commit A = Abort C = Commit Messages: PM = Prepare Message RM = Ready Message AAM = Abort Answer Message ACM = Abort Command Message CCM = Commit Command Message Other Transitions: ua = Unilateral Abort tm = timeout 2PC - 3

2PC – Phase 1 • Coordinator: • Write prepare record to log • Multicast prepare message and set timeout • Participant: • Wait for prepare message • If we are willing to commit then • force log records to stable storage • write ready record in log • send ready message to coordinator • else • write ABORT in log • send abort answer message to coordinator

2PC – Phase 2 (1) • Coordinator: • wait for a reply messages (ready or abort) or timeout • If timeout expires or any message is abort • write global abort record in the log • send abort command message to all participants • else • if all answers were ready • write global commit record to log • send commit command message to all participants

2PC – Phase 2 (2) • Participants: • Wait for command message (abort or commit) • write abort or commit in the log • send ack message to coordinator • execute command (may be null) • Coordinator: • wait for ack messages from all participants • write complete in the log

2PC – Site Failures • Resilient to all failures in which no log information is lost. • Site failures • participants fails before having written ready to log: • timeout expires ---> ABORT • Participants fails after having written ready to log: • Msg sent -- others take decision. This node gets outcome from the coordinator or other participants after restart • Msg unsent -- timeout expires ---> ABORT

2PC – Coordinator Failures • Coordinator fails after writing prepare but before global commit/global abort (globalX). • All participants must wait for recovery of coordinator -> BLOCKING • Recovery of coordinator involves restarting protocol from identities in prepare log record • Participants must identify duplicate prepare messages • Coordinator fails after having written global X but before writing complete. • On restart, coordinator must resend decision, to ensure blocked processes get it. Others must discard duplicate. • Coordinator fails after having written complete. • No action needed

2PC – Lost Messages • A reply message (ready or abort) from a participant is lost. • Timeout expires -- coordinator ABORTs • A prepare message is lost. • Timeout expires -- coordinator ABORTs • A commit/abort command message is lost. • Timeout in participant -- request repetition of command from the coordinator. • An ack message is lost • Timeout in coordinator -- coordinator resends command

2PC - Partitions • Everything aborts as coordinator can’t contact all participants. Those participants in partition without coordinator may remain blocked & the resources are still retained until the blocked participants are unblocked.

2PC - Comments • Blocking is a problem if the coordinator or network fails which reduces availability -> use 3PC. • Unilateral abort. • Any node can abort until it sends ready (site autonomy before the ready state). • Efficiency can be increased: • Elimination of prepare messages. The participants, that can commit, will automatically send RM. • Presumed commit/abort , if there’s no information found in the log. See [CER84] 13.5.1,2,&3.

Impossible Termination in 2PC • No operational participant has received the command. The operational participants are in the R state, but they haven’t received the ACM or CCM, AND • At least one participant failed. Unfortunately the failed participant acted as the coordinator.

Impossible Termination in 2PC • The failed participant might have already performed an undone action (commit or abort), i.e. in the C or A state. • The operational participants can’t know what the failed participant had done, and can’t take an independent decision. • The problem is solved by the 3PC.

I I PM/AAM ua/ - -/PM PM/RM U R tm/ACM RM/PCM PCM/OK AAM/ACM ACM/ - A BC PC A tm/ACM Coordinator OK/CCM CCM/ - Participant C C New Messages: PCM = Prepare to Commit OK = Entered PC state New States: PC = Prepared to Commit BC = Before Commit possible restart transitions 3PC (1) 3PC Restart 2 Restart 1

3PC (2) • Case study: • See slide no 3. • London: Coordinator & Participant1 • Depok: Participant2

3PC (3) • 3PC avoids problems with 2PC: • If any operational participant has received an abort then all can abort. The failed participant will abort at restart if it hasn’t already. [As 2PC] E.g. Depok fails, London is operational and has received an ACM. • If any participants has received the PCM, then all can commit. The failed participant (e.g.cannot have aborted unilaterally, because it had answered READY (RM). The failed participant will commit at restart (see “restart 1”). E.g. London fails, Depok is operational and has received the PCM.

3PC (4) • If none of the operational participants has received the PCM participant, i.e. all of the operational participants are in the R state, then 2PC would block. With 3PC we can abort safely since the failed participant cannot have committed. At most it has received the PCM -> it can abort at restart (see “restart 2”). E.g. London fails, Depok is operational and has NOT received the PCM (in the R state).

3PC (5) • 3PC guarantees that there won’t be blocking condition caused by all possible failures during the 2nd phase. • Failures during the 3rd phase -> blocking??? • If coordinator fails in 3rd phase, then elect another and continue the commit process (since all must be in the PC state).

Transaction 1 Transaction 2 Read X time Read X Update X Update X Lost update Consistency & Isolation • Consistency & isolation -> concurrency control. • The Lost Update Problem:

Transaction 1 Transaction 2 Update X time Read X ABORT temporary incorrect value of X,because Trasaction2 is aborted. The Uncommitted Dependency (Temporary Update) Problem

time The Inconsistent Analysis Problem before the update by transaction2 Transaction 1 Transaction 2 sum := 0Read Asum := sum + A Read A Read B Update A Update B COMMIT Read Bsum := sum + B after the update by transaction2

Concurrent Transactions • If we have concurrent transactions, we must prevent interference. • c.f. lost update problem • Prevent T2’s read (because T1 has seen it and may update it) [Locking] • Prevent T1’s update (because T2 has seen it) [Locking] • Prevent T2’s update (because T1 has already updated it and so this is based on obsolete values) [timestamping] • Have them work independently and resolve difficulties on commit.[Optimistic concurrency control]

Serializability • What we need is some notion of correctness. • Serializability is usually used write to transactions.

Serial Transactions • Two transactions execute serially if all operations of one precede all operations of the other. e.g: S1: Ri(x) Wi(x) Ri(y) Rj(x) Wj(y) Rk(y) Wk(x), or S1: TiTjTk, S2: TkTjTi, ……….. • S1 = Schedule 1, S2 = Schedule 2 • All serial schedules are correct, but restrictive of concurrency .

Transaction Conflict • Two operations are in conflict if: • At least one is a write • They both act on the same data • They are issued by different transactions Which of the following are in conflict? Ri(x) Rj(x) Wi(y) Rk(y)Wj(x)

Computationally Equivalent • Two schedules (S1 & S2) are computationally equivalent if: • The same operations are involved (possibly reordered) • For every pair of operations in conflict (Oi & Oj),such that Oi precedes Oj in S1, then also Oi precedes Oj in S2.

Serializable Schedule • A schedule is serializable if it is computationally equivalent to a serial schedule. e.g: Ri(x) Rj(x) Wj(y) Wi(x) (which is not a serial schedule) is computationally equivalent to:Rj(x) Wj(y) Ri(x) Wi(x) (which is a serial schedule: TjTi) • The following is NOT a serial schedule. But is it serialisable?Ri(x) Rj(x) Wi(y) Rk(y)Wj(x)The above schedule is computationally equivalent to serial schedules: TiTjTk,TiTkTj.

Serializability in Distributed Systems (1) • A local concurrency control mechanism isn’t sufficient. e.g: • Site 1: Ri(x) Wi(x) Rj(y) Wj(x) i.e: Ti < Tj • Site 2: Rj(y) Wj(y) Ri(y) Wi(y) i.e: Tj < Ti

Serializability in Distributed Systems (2) • Let T1…Tn be a set of transactions and E be an execution of these modeled by schedules S1…Sm on machines 1…m. • Each local schedule (S1…Sm)is serialisable. • Then E is serialisable (in distributed systems) if, for all i and j, all conflicting operations from Ti and Tj in each of the schedules have the same order i.e. there is a global total ordering for all sites.

Locking (1) • How to implement serializability  use locking • Shared/eXclusive (Read/Write) locks: • A transaction T must have SLockx or XLockx before any Read X. • A transaction T must have XLockx before any Write X. • A transaction T must issue unLockx after Read x or Write x is completed.

Locking (2) • A transaction T can upgrade the lock, i.e. issuing a XLockx after having SLockx, as long as T is the only transaction havingSlockx. Otherwise T must wait. • A transaction T can downgrade the lock, i.e. issuing a SLockx after having XLockx.

Locking (3) • E.g.T1: X = X + Y T2: Y = X + Y • If initially X=20, Y=30 then either: • S1: T1 < T2: X=50, Y=80 • S2: T2 < T1: X=70, Y=50 • Both are serial schedules, thus both are correct.

Locking (4) • However using Shared/eXclusive (Read/Write) locks does NOT guarantee serializability. • If any transaction releases a lock and then acquires another, it may produce incorrect results.

T1 T2 SLock y temp1=y 30 unLock y SLock x temp2 = x 20 unLock x XLock y time temp3 = y y = temp2 + temp3 50 unLock y COMMIT XLock x temp4 = x 20 x = temp4 + temp1 50 The schedule is NOT serializable!!!So it is NOT correct unLock x COMMIT Locking (5)

DISTRIBUTED TRANSACTION