CORAL network glitch

CORAL network glitch Andrea Valassi (IT-ES) IT-ES Persistency Team meeting, 19th November 2010

“Network glitch” overview • Reported by all experiments in various cases • “A transaction is not active” in CORAL server (bug #65597) • ORA-24327 “need explicit attach” in ATLAS/CMS (bug #24327) • OracleAccess crash after losing session in LHCb (bug #73334) • What should CORAL do? Many different scenarios • e.g. non serializable R/O transaction: should reconnect and restart it • e.g. DDL not committed in update transaction: cannot do anything • What is CORAL doing now? • Correctly reconnecting in some cases (existing useful features) • Not doing anything in other cases (missing useful features) • Reconnecting in the wrong way in other cases (bugs!)

General directions • 1. Catalog the different scenarios • 2. Prepare tests for each different scenario • Using CppUnit… • 3. Prototype the implementation changes • ConnectionSvc and/or plugins?

Connection, session, transaction • A network glitch causes a loss of many states: • The state of the connection • The state of the session • The state of the transaction • We must separately keep track of each ‘old’ state • And then separately restore each state (only if possible/correct) • Example: two sessions over a shared connection • We must reconnect once, restart two logical sessions, and then restart up to two transactions if possible • It may be appropriate to restore/refresh the states in the three separate classes • Connection, Session, Transaction

Detecting a network glitch • ‘I am not connected’ does not mean ‘I lost the connection’ • We need a separate method/mechanism than just “isConnected” or “isUserSessionActive” or “Transaction::isActive” • for instance: connectionWasLost(), sessionWasLost(), transactionWasLost() • Again: we must keep track of the old state… • Example: we should NOT start a new transaction if there was no transaction active before the glitch! • Add some tests also for some similar scenarios…

Recovering from the glitch • In general: refresh instances rather than create new ones • Previous CORAL was closing session and creating a new one • This leads to segmentation faults and other problems • Better approach: keep existing C++ instances and refresh them • Add new methods specific to refreshing the states • Separately for the three (or more) classes • Encapsulate all loops in those methods • e.g. for how long should we retry to reconnect?

Generic coding conventions • Long discussions last year and some hints on twiki • But not completely formalised(sorry…) • Please avoid • names that are not clear/relevant • file names that contain classes with different names • egScopedTransactionStatus in QueryMgr • Please do • keep it simple whenever possible! • avoid very general approaches to solve simple specific issues • avoid adding classes/headers that are not relevant

CORAL network glitch

CORAL network glitch

Presentation Transcript

Coral Reefs

CORAL

Glitch Investigation Update

Dropbox security glitch

Coral Reefs

Reacting to database and network instabilities in CORAL

Glitch Art

2010 Glitch in Vela Pulsar

CORAL

Fixing a Local Aid Glitch

International Coral Reef Action Network (ICRAN)

Fixing a Local Aid Glitch

Post-glitch relaxation in pulsars

Coral

Alva Noto and glitch

Coral

CORAL

Fifa 16 Coin Glitch

Pixel Car Racer Crate Glitch

Glitch study requirements

Fixing a Local Aid Glitch

Fixing a Local Aid Glitch