1 / 30

A Comparison of Column, Row and Array DBMSs to Process Recursive Queries

A Comparison of Column, Row and Array DBMSs to Process Recursive Queries. Carlos Ordonez University of Houston. Acknowledgments. Michael Stonebraker Visited MIT 2013-2015 Wellington Cabrera; Achyuth Gurram Divesh Srivastava, sabbatical at ATT Labs. Introduction. Classical topic in Datalog

dacton
Download Presentation

A Comparison of Column, Row and Array DBMSs to Process Recursive Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparison of Column, Row and Array DBMSs to Process Recursive Queries Carlos Ordonez University of Houston

  2. Acknowledgments Michael Stonebraker Visited MIT 2013-2015 Wellington Cabrera; Achyuth Gurram Divesh Srivastava, sabbatical at ATT Labs

  3. Introduction Classical topic in Datalog Recursion defined in ANSI SQL Graph algorithms: paths, reachability, neighborhood analysis Complexity: Cubic, NP-completeness Before: Deductive databases: datalog Harder query optimization than traditional SPJ queries

  4. Directed Graphs Definitions: Directed Graph G=(V,E), maybe cyclic! A vertex in V : i or j and i,j=1..n. and edge (i,j ) has a direction and weight v storage: adjacency matrix E : |E|=N Problems: Transitive closure: vertices j reachable from i Power matrix Ek

  5. Examples V=cities,E=roads. Is there some path from San Diego to NYC?: path from i to j? shortest one? V=employees, E=manager -> employee q1: all employees under i q2: Is j supervised by i? Bill of materials: The well-known part/subpart manufacturing DB: all subparts Y from part X

  6. Recursive view Rk: recursion depth n=|V|,N=|E|

  7. Transitive Closure: G+=(V,E’)

  8. Power matrix

  9. Technical details Linear recursion; Intuition: R=R*E Inner joins; no negation No GROUP-BY, DISTINCT, HAVING, NOT IN, OUTER JOIN clauses inside R Any SQL query on R is valid Seminaive; recursion depth k: loop with k-1 joins or (rarely) fixpoint

  10. Stonebraker: One size does not fit all! • Storage in a DBMS: • Row: OLTP, point queries, cubes • Column: cube queries, ad-hoc queries • Array: math, science • Other: • Stream: one pass; in-RAM • MMDB: OLTP • Hadoop/noSQL: yawn (but evolving)

  11. DBMS storage elevator storyrow | column | array Row: old, single file, block, B-trees/hash, hash horizontal partitioning Column: new, multiple files, var. size blocks, ordered values, compressed, no row-level index!, hash-segment Array: very different storage; attributes={dimensions|columns}; chunk==subarray; multidimensional; grid index in RAM; still hash but on chunk

  12. Algorithms Semi-naïve (Datalog): classical, general, reasonably efficient (Delta), expressive Direct: very efficient; TC only; in-place update; matrix-based; requires arrays; not good for SQL; not used today! [TKDE 2010, Teradata DBMS]

  13. Semi-naive

  14. Seminaive in SQL

  15. Optimizations: SPJ Relational algebra + physical operators Join: hash or sort-merge (nested loop does not make sense with E) Projection: push dup elimination & aggreg. Selection: push filter To be explored later Left outer joins External joins Secondary indexes: row-level only?

  16. Join: hash versus sort-mergeGoal: O(N) • Main computation: • Join optimization: • Column: projection={unordered, ordered values} • Row: unordered, ordered versus index • Array: default={ordered, indexed} choice={sparse,dense}

  17. Projection • Duplicate elimination • reachability • binary edges • Aggregation • shortest/longest path • count # paths • length vs weight/cost

  18. SelectionReduce |Rd|, correctness

  19. Issues with select operator Incorrect to use a predicate involving a join expression column in recursive step Cycles => Infinite recursion Monotonically increasing v, OK to prune Recursion depth k: required in practice

  20. Benchmark with graphsgoal: understand O() and plan • Real • Skewed • Complex structure; sample==different • But Fixed size • Synthetic • Vary n,N • Vary shape • NEW: Cliques!

  21. Simulating realistic Gsocial nets, Internet

  22. Join (stop=30 mins; array 2 chunk sizes req’d)

  23. ProjectionDuplicate elimination

  24. ProjectionAggregation

  25. Selectionsimple filter i=1

  26. Ultimate benchmark (fair?)tuned storage, best join, aggr

  27. Conclusions Query optimizations Confirm decades of research: required But impact definitely varies G knowledge helps (catch 22) Benchmark with tuned query processing Column DBMS faster; cliques/skewed degree OK Array DBMS competitive for dense/clique G Row DBMS reasonable

  28. Conclusions • Graph features impacting time and I/O • Density: avg vertex degree; deg(i) skew • Cliques: K • Cycles: deep k • Lessons • Tune storage • Tune join (skewed hash join) • Beware of large cliques and short cycles • Increment recursion depth k gradually

  29. Future work Develop operators for Array DBMS Time complexity based on G Beyond semi-naïve: Direct, logarithmic Different non-path G problems like CC, Neighborhood analysis Beyond paths => clique detection Beyond O(): Query plan cost model

  30. References on RQs • Many Datalog papers • Hadoop: SPARQL gaining interest • SQL much less • IBM • Direct: Agrawal, but not used today • My group: TKDE 2010 paper; Teradata papers; DOLAP 2014: IS journal: revisiting RQs in Columnar DBMS

More Related