CORAL network glitch Andrea Valassi (IT-ES) IT-ES Persistency Team meeting, 19th November 2010
“Network glitch” overview • Reported by all experiments in various cases • “A transaction is not active” in CORAL server (bug #65597) • ORA-24327 “need explicit attach” in ATLAS/CMS (bug #24327) • OracleAccess crash after losing session in LHCb (bug #73334) • What should CORAL do? Many different scenarios • e.g. non serializable R/O transaction: should reconnect and restart it • e.g. DDL not committed in update transaction: cannot do anything • What is CORAL doing now? • Correctly reconnecting in some cases (existing useful features) • Not doing anything in other cases (missing useful features) • Reconnecting in the wrong way in other cases (bugs!)
General directions • 1. Catalog the different scenarios • 2. Prepare tests for each different scenario • Using CppUnit… • 3. Prototype the implementation changes • ConnectionSvc and/or plugins?
Connection, session, transaction • A network glitch causes a loss of many states: • The state of the connection • The state of the session • The state of the transaction • We must separately keep track of each ‘old’ state • And then separately restore each state (only if possible/correct) • Example: two sessions over a shared connection • We must reconnect once, restart two logical sessions, and then restart up to two transactions if possible • It may be appropriate to restore/refresh the states in the three separate classes • Connection, Session, Transaction
Detecting a network glitch • ‘I am not connected’ does not mean ‘I lost the connection’ • We need a separate method/mechanism than just “isConnected” or “isUserSessionActive” or “Transaction::isActive” • for instance: connectionWasLost(), sessionWasLost(), transactionWasLost() • Again: we must keep track of the old state… • Example: we should NOT start a new transaction if there was no transaction active before the glitch! • Add some tests also for some similar scenarios…
Recovering from the glitch • In general: refresh instances rather than create new ones • Previous CORAL was closing session and creating a new one • This leads to segmentation faults and other problems • Better approach: keep existing C++ instances and refresh them • Add new methods specific to refreshing the states • Separately for the three (or more) classes • Encapsulate all loops in those methods • e.g. for how long should we retry to reconnect?
Generic coding conventions • Long discussions last year and some hints on twiki • But not completely formalised(sorry…) • Please avoid • names that are not clear/relevant • file names that contain classes with different names • egScopedTransactionStatus in QueryMgr • Please do • keep it simple whenever possible! • avoid very general approaches to solve simple specific issues • avoid adding classes/headers that are not relevant