Building a Database System for Order

BUILDING A DATABASE SYSTEM FOR ORDER New England Database Seminars April 2002 Alberto Lerner – ENST ParisDennis Shasha – NYU {lerner,shasha}@cs.nyu.edu

Agenda • Motivation • SQL + Order • Transformations • Conclusion

MotivationThe need for ordered data • Some queries rely on order • Examples: • Moving averages • Top N • Rank • “SQL can handle it.” Can it really?

MotivationMoving Averages: algorithmically linear Sales(month, total) SELECT t1.month+1 AS forecastMonth, (t1.total+ t2.total + t3.total)/3 AS 3MonthMovingAverageFROM Sales AS t1, Sales AS t2, Sales AS t3WHERE t1.month = t2.month - 1 AND t1.month = t3.month – 2 Can optimizer make a 3-way (in general, n-way) join linear time? Ref: Data Mining and Statistical Analysis Using SQL Trueblood and LovettApress, 2001

MotivationTop N Employee(Id, salary) SELECT DISTINCT count(*), t1.salaryFROM Employee AS t1, Employee AS t2WHERE t1.salary <= t2.salaryGROUP BY t1.salaryHAVING count(*) <= N How many elements of cross-product have salaries at least as large as t1.salary? Will optimizer see essential sort-count trick? Ref: SQL for Smarties Joe CelkoMorgan Kauffman, 1995

MotivationProblems Extending SQL with Order • Queries are hard to read • Cost of execution is often non-linear (would not pass basic algorithms course) • Few operators preserve order, so optimization hard.

SQL + OrderDesirable Features • Express order-dependent predicates and clauses in a readable, clear way • Make optimization opportunities explicit (by getting rid of complex idioms, see above) • Execution in linear (or n log n) time when possible

SQL + Orderthree steps in solution • Give SQL a vector-oriented semantics – Database is a set of array-tables “arrables”; variables in the queries do not refer to a single tuple at a time anymore, but to a whole column vector • Provide new vector-to-vector functions – Supporting order-based manipulations of column vectors • Streaming: new data may need special treatment.

SQL + OrderMoving Averages Sales(month, total) SELECT month, avgs(8, total)FROM Sales ASSUMING ORDER month avgs: vector-to-vector function, order-dependant and size-preserving order to be used on vector-to-vector functions • Execution (Sales is an arrable): • FROM clause – enforces the order in ASSUMING clause • SELECT clause – for each month yields the moving average (window size 8) ending at that month. No 8-way join.

SQL + OrderTop N Employee(ID, salary) SELECT first(N, salary) FROM Employee ASSUMING ORDER Salary first: vector-to-vector function, order-dependant and non size-preserving • Execution: • FROM clause – orders arrable by Salary • SELECT clause – applies first() to the ‘salary’ vector, yielding first N values of that vector given the order. Could get the top earning IDs by saying first(N, ID).

SQL + OrderRanking SalesReport(salesPerson, territory, total) SELECT territory, salesPerson, total, rank(total)FROM SalesReport WHERE rank(total) < N rank: vector-to-vector function, non order-dependant and size-preserving • Execution: • FROM clause – assuming is NOT needed. • rank is applied to the ‘total’ vector and maps each position into an integer.

SQL + OrderVector-to-Vector Functions size-preserving non size-preserving prev, next, $, [] avgs(*), prds(*), sums(*), deltas(*), ratios(*), reverse, … drop, first, last order-dependant rank, tile min, max, avg, count non order-dependant

SQL + OrderComplex queries: Best spread In a given day, what would be the maximum difference between a buying and selling point of each security? Ticks(ID, price, tradeDate, timestamp, …) SELECT ID, max(price – mins(price))FROM Ticks ASSUMING ORDER timestampWHERE tradeDate = ‘99/99/99’GROUP BY ID max bestspread running min min • Execution: • For each security, compute the running minimum vector for price and then subtract from the price vector itself; result is a vector of spreads. • Note that max – min would overstate spread.

SQL + OrderComplex queries: Crossing averages part I When does the 21-day average cross the 5-month average? Market(ID, closePrice, tradeDate, …)TradedStocks(ID, Exchange,…) INSERT INTO temp FROMSELECT ID, tradeDate, avgs(21 days, closePrice) AS a21, avgs(5 months, closePrice) AS a5, prev(avgs(21 days, closePrice)) AS pa21, prev(avgs(5 months, closePrice)) AS pa5FROM TradedStocks NATURAL JOIN Market ASSUMING ORDER tradeDateGROUP BY ID

SQL + OrderComplex queries: Crossing averages part I • Execution: • FROM clause – order-preserving join • GROUP BY clause – groups are defined based on the value of the Id column • SELECT clause – functions are applied; non-grouped columns become vector fields so that target cardinality is met. Violates first normal form Vectorfield groups in ID and non-grouped column grouped ID and non-grouped column two columns withthe same cardinality

SQL + OrderComplex queries: Crossing averages part II Get the result from the resulting non first normal form relation temp SELECT ID, tradeDateFROM flatten(temp)WHERE a21 > a5 AND pa21 <= pa5 • Execution: • FROM clause – flatten transforms temp into a first normal form relation (for row r, every vector field in r MUST have the same cardinality). Could have been placed at end of previous query. • Standard query processing after that.

SQL + OrderRelated Work: Research • SEQUIN – Seshadri et al. • Sequences are first-class objects • Difficult to mix tables and sequences. • SRQL – Ramakrishnan et al. • Elegant algebra and language • No work on transformations. • SQL-TS – Sadri et al. • Language for finding patterns in sequence • But: Not everything is a pattern!

SQL + OrderRelated Works: Products • RISQL – Red Brick • Some vector-to-vector, order-dependent, size-preserving functions • Low-hanging fruit approach to language design. • Analysis Functions – Oracle 9i • Quite complete set of vector-to-vector functions • But: Can only be used in the select clause; poor optimization (our preliminary study) • KSQL – Kx Systems • Arrable extension to SQL but syntactically incompatible. • No cost-based optimization.

TransformationsEarly sorting + order preserving operators SELECT ts.ID, ts.Exchange, avgs(10, hq.ClosePrice)FROM TradedStocks AS ts NATURAL JOIN HistoricQuotes AS hq ASSUMING ORDER hq.TradeDateGROUP BY Id avgs avgs avgs g-by sort g-by op avgs op sort g-by g-by op op sort op (1) Sort then joinpreserving order (2) Preserve existingorder (3) Join then sortbefore grouping (4) Join then sortafter grouping

TransformationsEarly sorting + order preserving operators

TransformationsUDFs evaluation order Gene(geneId, seq)SELECT t1.geneId, t2.geneId, dist(t1.seq, t2.seq)FROM Gene AS t1, Gene AS tWHERE dist(t1.seq, t2.seq) < 5 AND posA(t1.seq, t2.seq) posA asks whether sequences have Nucleo A in same position. Dist gives edit distance between two Sequences. posA dist Switch dynamicallybetween (1) and (2) depending on the execution history dist posA (1) (2) (3)

TransformationsUDFs Evaluation Order

TransformationsOrder preserving joins select lineitem.orderid, avgs(10, lineitem.qty), lineitem.lineid from order, lineitem assuming order lineid where order.date > 45 and order.date < 55 and lineitem.orderid = order.orderid • Basic strategy 1: restrict based on date. Create hash on order. Run through lineitem, performing the join and pulling out the qty. • Basic strategy 2: Arrange for lineitem.orderid to be an index into order. Then restrict order based on date giving a bit vector. The bit vector, indexed by lineitem.orderid, gives the relevant lineitem rows. • The relevant order rows are then fetched using the surviving • lineitem.orderid. • Strategy 2 is often 3-10 times faster.

Transformations Building Blocks • Order optimization • Simmens et al. `96 – push-down sorts over joins, and combining and avoiding sorts • Order preserving operators • KSQL – joins on vector • Claussen et al. `00 – OP hash-based join • Push-down aggregating functions • Chaudhuri and Shim `94, Yan and Larson `94 – evaluate aggregation before joins • UDF evaluation • Hellerstein and Stonebraker ’93 – evaluate UDF according to its ((output/input) – 1)/cost per tuple • Porto et al. `00 – take correlation into account

Conclusion • Arrable-based approach to ordered databases may be scary – dependency on order, vector-to-vector functions – but it’s expressive and fast. • SQL extension that includes order is possible and reasonably simple. • Optimization possibilities are vast.

Building a Database System for Order

Building a Database System for Order

Presentation Transcript

Building database searches

PNUTS Building and running a cloud database system Brian Cooper

A Type System for Higher-Order Modules

Building a Database Application

A 100 Terabyte Database System for $500k

A Type System for Higher-Order Modules

CONTENT: A model for collaborative database building

Building a “System”

Building The Database

Building a Database on S3

Building a Statewide Parcel Database for the Public

Building a Service Order System

Database design considerations for a financial system

Building a Database

Building the Database

GMOD: Building Blocks for a Model Organism System Database

Lesson 29: Building a Database

Building a Particle System

A Distributed Database System

Building a System for Creative Learning

Aquery: A DATABASE SYSTEM FOR ORDER

A Type System for Higher-Order Modules