1 / 37

ORDB Implementation Discussion

ORDB Implementation Discussion. From RDB to ORDB. Issues to address when adding OO extensions to DBMS system. Layout of Data. Deal with large data types : ADTs/blobs special-purpose file space for such data, with special access methods Large fields in one tuple :

hester
Download Presentation

ORDB Implementation Discussion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ORDB ImplementationDiscussion

  2. From RDB to ORDB Issues to address when adding OO extensions to DBMS system

  3. Layout of Data • Deal with large data types : ADTs/blobs • special-purpose file space for such data, with special access methods • Large fields in one tuple : • One single tuple may not even fit on one disk page • Must break into sub-tuples and link via disk pointers • Flexible layout : • constructed types may have flexible sized sets, , e.g., one attribute can be a set of strings. • Need to provide meta-data inside each type concerning layout of fields within the tuple • Insertion/deletion will cause problems when contiguous layout of ‘tuples’ is assumed

  4. Layout of Data • More layout design choices (clustering on disk): • Lay out complex object nested and clustered on disk (if nested and not pointer based) • Where to store objects that are referenced (shared) by possibly several other and different structures • Many design options for objects that are in a type hierarchy with inheritance • Constructed types such as arrays require novel methods, like array chunking into (4x4) subarrays for non-continuous access

  5. Why Identifier ? • Distinguish objects regardless of content and location • Evolution of object over time • Sharing of objects without copying • Continuity of identity (persistence) • Versions of a single object

  6. Objects/OIDs/Keys • Relational keys: RDB human meaningful name (mix data value with identity) • Variable name : PL give name to objects in program (mix addressability with identity) • Object identifier : ODB system-assigned globally unique name (location- and data-independent )

  7. OIDs • System generated • Globally unique • Logical identifier (not physical representation; flexibility in relocation) • Remains valid for lifetime of object (persistent)

  8. OID Support • OID generation : • uniqueness across time and system • Object handling : • Operations to test equality/identify • Operations to manipulate OIDs for object merging and copying. • Deal with avoiding dangling references

  9. OID Implementation • By address (physical) • 32 bits; direct fast access like a pointer • By structured address • E.g., page and slot number • Both some physical and logical information • By surrogates • Purely logical oid • Use some algorithm to assure uniqueness • By typed surrogates • Contains both type id and object id • Determine type of object without fetching it

  10. ADTs • Type representation: size/storage • Type access : import/export • Type manipulation: special methods to serve as filter predicates and join predicates • Special-purpose index structures : efficiency

  11. ADTs • Mechanism to add index support along with ADT: • External storage of index file outside DBMS • Provide “access method interface” a la: • Open(), close(), search(x), retrieve-next() • Plus, statistics on external index • Or, generic ‘template’ index structure • Generalized Search Tree (GiST) – user-extensible • Concurrency/recovery provided

  12. Query Processing • Query Parsing : • Type checking for methods • Subtyping/Overriding • Query Rewriting: • May translate path expressions into join operators • Deal with collection hierarchies (UNION?) • Indices or extraction out of collection hierarchy

  13. Query Optimization Core • New algebra operators must be designed : • such as nest, unnest, array-ops, values/objects, etc. • Query optimizer must integrate them into optimization process : • New Rewrite rules • New Costing • New Heuristics

  14. Query Optimization Revisited • Existing algebra operators revisited : SELECT • Where clause expressions can be expensive • So SELECT pushdown may be bad heuristic

  15. Selection Condition Rewriting • EXAMPLE: • (tuple.attribute < 50) • Only CPU time (on the fly) • (tuple.location OVERLAPS lake-object) • Possibly complex CPU-heavy computations • May Involve both IO and CPU costs • State-of-art: • consider reduction factor only • Now, we must consider both factors: • Cost factor : dramatic variations • Reduction factor: unrelated to cost factor

  16. Operator Ordering op1 op2

  17. Ordering of SELECT Operators • Cost factor : now could be dramatic variations • Reduction factor: orthogonal to cost factor • We want maximal reduction and minimal cost: Rank ( operator ) = (reduction) * ( 1/cost ) • Order operators by increasing ‘rank’ • High rank : • (good) -> low in cost, and large reduction • Low rank • (bad) -> high in cost, and small reduction

  18. Access Structures/Indices ( on what ?) • Indexes that are ADT specific • Indexes on navigation path • Indexes on methods, not just on columns • Indexes over collection hierarchies (trade-offs) • Indexes for new WHERE clause expressions not just =, <, > ; but also “overlaps”,”similar”

  19. Registering New Index (to Optimizer) • What WHERE conditions it supports • Estimated cost for “matching tuple” (IO/CPU) • Given by index designer (user?) • Monitor statistics; even construct test plans • Estimation of reduction factors/join factors • Register auxiliary function to estimate factor • Provide simple defaults

  20. Methods • Use ADT/methods in query specification • Achieve flexibility and extensibility

  21. Methods • Extensibility : Dynamic linking of methods defined outside DB • Flexibility : Overwriting methods for type hierarchy • Semantics : • Use of “methods” with implied semantics? • Incorporation of methods into query process may cause side-effects? • Termination may not be guaranteed?

  22. Methods • “Untrusted” methods : • methods corrupt server or • modify DB content (side effects) • Handling of “untrusted” methods : • restrict language; • interpret vs compile, • separate address space as DB server

  23. Query Optimization with Methods • Estimation of “costs” of method predicates • See earlier discussion • Optimization of method execution: • Methods may be very expensive to execute • Idea: • Apply similar idea as handling correlated nested subqueries • Recognize repetition and rewrite physical plan. • Provide some level of pre- computation and reuse

  24. Strategies for Method Execution • 1. If called on same input, cache that one result • 2. If on full column, presort column first (groupby) • 3. Or, precompute results of methods for each possible value in domain; and put in hash-table : fct (val ); Look up in hash-table val  fct (val) during query processing or even join with it, instead of recomputing

  25. Query Processing • User-defined methods • User-defined aggregate functions: • E.g., “second largest” or “most brightest picture” • Distributive aggregates: • incremental computation

  26. Incremental Computation :Query Processing • For incremental computation of distributive aggregates: • Provide: • Initialize(): set up state space • Iterate(): per tuple update the state • Terminate(): compute final result based on state; and cleanup state • For example : “second largest” • Initialize(): 2 fields • Iterate(): per tuple compare numbers • Terminate(): remove 2 fields

  27. Following Disk Pointers? • Complex object structures with object pointers may exist (~ disk pointers) • Navigate complex objects following pointers • Long-running transaction like in CAD design may work with complex object for longer duration • What to do about “pointers” between subobjects or related objects ?

  28. Following Disk Pointers? • Swizzle : • Swizzle = Replace OIDs references by in-memory pointers, • Unswizzle = back to disk-pointers when flushing to disk. • Issues : • In-memory table of OIDs and their state; • Indicate in each object pointer via a bit. • Different policies for swizzling: • never • on access • attached to object brought in

  29. Persistence? • We may want both persistent and transient data • Why ? • Programming language variables • Handle intermediate data • May want to apply queries to transient data

  30. Properties for Persistence? • Orthogonal to types : • Data of any type can be persistent • Transparent to programmer : • Programmer can treat persistent and non-persistent objects the same way • Independent from mass storage: • No explicit read and write to persistent database

  31. Models of Persistence • Different models of persistence for OODB implementations

  32. Models of Persistence • Persistence by type • Persistence by call • Persistence by reachability

  33. Models of Persistence • Parallel type systems: • Persistence by type, e.g., int and dbint • Programmer is responsible to make objects persistent • Programmer must make decision at object creation time • Allow for user control by “casting” types

  34. Models of Persistence • Persistence by explicit call • Explicit create/delete to persistent space • E.g., objects must be placed into “persistent containers” such as relations in order to be kept around • Eg., Insert object into Collection MyBooks; • Could be rather dynamic control without casting • Relatively simple to implement by DBMS

  35. Models of Persistence • Persistence by reachability : • Use global (or named) variables to objects and structures • Objects being referenced by other objects that are reachable by application, then they are also persistent by transitivity . • No explicit deletes; rather need garbage collection to garbage the objects away once no longer referenced • Garbage collection techniques : • mark&sweep : mark all objects reachable from persistent roots; then delete others • scavenging: : copy all reachable objects from one space to the other; but may suffer in disk-based environment due to IO overhead and distruction of clustering

  36. Tradeoffs

  37. Summary • A lot of work to get to OO support : From physical database design/layout issues up to logical query optimizer extensions • ORDB: Reuses existing implementation base and incrementally adds new features on (but relation is first-class citizen)

More Related