300 likes | 304 Views
A Comparison of Column, Row and Array DBMSs to Process Recursive Queries. Carlos Ordonez University of Houston. Acknowledgments. Michael Stonebraker Visited MIT 2013-2015 Wellington Cabrera; Achyuth Gurram Divesh Srivastava, sabbatical at ATT Labs. Introduction. Classical topic in Datalog
E N D
A Comparison of Column, Row and Array DBMSs to Process Recursive Queries Carlos Ordonez University of Houston
Acknowledgments Michael Stonebraker Visited MIT 2013-2015 Wellington Cabrera; Achyuth Gurram Divesh Srivastava, sabbatical at ATT Labs
Introduction Classical topic in Datalog Recursion defined in ANSI SQL Graph algorithms: paths, reachability, neighborhood analysis Complexity: Cubic, NP-completeness Before: Deductive databases: datalog Harder query optimization than traditional SPJ queries
Directed Graphs Definitions: Directed Graph G=(V,E), maybe cyclic! A vertex in V : i or j and i,j=1..n. and edge (i,j ) has a direction and weight v storage: adjacency matrix E : |E|=N Problems: Transitive closure: vertices j reachable from i Power matrix Ek
Examples V=cities,E=roads. Is there some path from San Diego to NYC?: path from i to j? shortest one? V=employees, E=manager -> employee q1: all employees under i q2: Is j supervised by i? Bill of materials: The well-known part/subpart manufacturing DB: all subparts Y from part X
Technical details Linear recursion; Intuition: R=R*E Inner joins; no negation No GROUP-BY, DISTINCT, HAVING, NOT IN, OUTER JOIN clauses inside R Any SQL query on R is valid Seminaive; recursion depth k: loop with k-1 joins or (rarely) fixpoint
Stonebraker: One size does not fit all! • Storage in a DBMS: • Row: OLTP, point queries, cubes • Column: cube queries, ad-hoc queries • Array: math, science • Other: • Stream: one pass; in-RAM • MMDB: OLTP • Hadoop/noSQL: yawn (but evolving)
DBMS storage elevator storyrow | column | array Row: old, single file, block, B-trees/hash, hash horizontal partitioning Column: new, multiple files, var. size blocks, ordered values, compressed, no row-level index!, hash-segment Array: very different storage; attributes={dimensions|columns}; chunk==subarray; multidimensional; grid index in RAM; still hash but on chunk
Algorithms Semi-naïve (Datalog): classical, general, reasonably efficient (Delta), expressive Direct: very efficient; TC only; in-place update; matrix-based; requires arrays; not good for SQL; not used today! [TKDE 2010, Teradata DBMS]
Optimizations: SPJ Relational algebra + physical operators Join: hash or sort-merge (nested loop does not make sense with E) Projection: push dup elimination & aggreg. Selection: push filter To be explored later Left outer joins External joins Secondary indexes: row-level only?
Join: hash versus sort-mergeGoal: O(N) • Main computation: • Join optimization: • Column: projection={unordered, ordered values} • Row: unordered, ordered versus index • Array: default={ordered, indexed} choice={sparse,dense}
Projection • Duplicate elimination • reachability • binary edges • Aggregation • shortest/longest path • count # paths • length vs weight/cost
Issues with select operator Incorrect to use a predicate involving a join expression column in recursive step Cycles => Infinite recursion Monotonically increasing v, OK to prune Recursion depth k: required in practice
Benchmark with graphsgoal: understand O() and plan • Real • Skewed • Complex structure; sample==different • But Fixed size • Synthetic • Vary n,N • Vary shape • NEW: Cliques!
Conclusions Query optimizations Confirm decades of research: required But impact definitely varies G knowledge helps (catch 22) Benchmark with tuned query processing Column DBMS faster; cliques/skewed degree OK Array DBMS competitive for dense/clique G Row DBMS reasonable
Conclusions • Graph features impacting time and I/O • Density: avg vertex degree; deg(i) skew • Cliques: K • Cycles: deep k • Lessons • Tune storage • Tune join (skewed hash join) • Beware of large cliques and short cycles • Increment recursion depth k gradually
Future work Develop operators for Array DBMS Time complexity based on G Beyond semi-naïve: Direct, logarithmic Different non-path G problems like CC, Neighborhood analysis Beyond paths => clique detection Beyond O(): Query plan cost model
References on RQs • Many Datalog papers • Hadoop: SPARQL gaining interest • SQL much less • IBM • Direct: Agrawal, but not used today • My group: TKDE 2010 paper; Teradata papers; DOLAP 2014: IS journal: revisiting RQs in Columnar DBMS