Relational Algebra

Relational Algebra • SQL specifies what to retrieve but not how to retrieve it. • Need a process to translate a descriptive statement into a collection of activities. • This is “behind the scenes”, but important to know nonetheless. • If a CS student doesn’t know it – who will?

High level logic involving table (relation) manipulation • SQL Statement • Low level logic involving B-Tree, index, hash table access, etc. • Mid level logic involving loops and comparisons • Result

SQL queries relations and generates a new relation. • Need something that manipulates relations in order to create other relations. • Need to do it efficiently.

Relational Algebra: • a collection of operations (activities) to construct new relations from given relations. • Typical process: SQL  relational algebra operations (or something similar)  mid/low level logic  result

Reference: First few slides of [http://www.cs.wayne.edu/~shiyong/csc6710/slides/kiferComp_348761_ppt11.ppt], though the notation may be intimidating.

Background: • Tuple (n-tuple): collection of n things written as (a1, a2,…, an). • Typically a tuple represents an entity in a relation (a row in a table).

Set operations: • Definition: • 2 relations are Union-compatibleif they have same degree (number of attributes) and the ith attributes of each are defined on same domains. • The first three definitions that follow apply only to Union-compatible relations.

Union • A  B is the set of elements belonging to either A or B. Venn Diagram A  B

SQL Server: SELECT * FROM S WHERE status = 20 UNION SELECT * FROM S WHERE status = 50

Difference (Minus): • A – B is the set of elements belonging to A but not to B. Venn Diagram. A – B

SQL Server uses Except instead of Minus • Supplier table: Create a view defined by that below (all suppliers not in Paris). SELECT * FROM S EXCEPT SELECT * FROM S WHERE City = 'Paris'

Intersection • A  B is the set of elements belonging to both A and B. Venn Diagram A  B

SQL Server: SELECT * FROM S WHERE status > 20 INTERSECT SELECT * FROM S WHERE status < 50

Cartesian Product: • A×B is the set of all elements of the form (x, y) where x belongs to A and y belongs to B.

A = {a, b} B = {1, 2, 3} • A×B = {(a, 1), (a, 2), (a, 3), (b, 1), (b, 2), (b, 3) } • Supplier table: Create a view defined by SELECT * FROM S CROSS JOIN SP

Selection (Restriction) • R where predicate where the predicate is some condition that evaluates to true or false. • A row subset of relation R. • Select * from R where predicate.

Projection • Let R be a relation and X, Y, …, Z be among R’s attributes. • Then R[X, Y, …, Z] is the set of elements from R restricted to attributes from X, Y, … , Z with duplicates removed. • Column subset of R • Select distinct attribute list from R

Join • Cartesian Product followed by a selection • A×B where predicate. • SELECT * FROM S CROSS JOIN SP where S.S# = SP.S# Or • SELECT * FROM S, SP where S.S#=SP.S#

Natural Join: • A join B • Let Ci, … , Cj be attribute names common to A and B. Then A JOIN B is (A ×B)[all attributes with duplicates removed] where A.Ci = B.Ci and : : A.Cj = B.Cj

Outer Join: • Example, Consider S join SP. Suppose S9 is a supplier but S9 supplies no parts, yet. An entry for S9 will NOT appear in S join SP. • S left outerjoin SP will include every row from the left table (S) with NULLS filled in if there’s no matching S# in the SP table.

Inner Join • Conventional join operation but the phrase is used if there’s the potential for confusion with outer join.

Examples select * from S inner join SP on S.S#=SP.S# select * from S left outer join SP on S.S#=SP.S#

Get supplier names for suppliers who supply part P2. ( ( SP where P# = ‘P2) join S) [SName] • Find a row subset (Selection) of SP where P#=‘P2’ • Do a natural join with S • Find a column subset of the result (project on Sname)

Get supplier names for suppliers who supply at least one red part. ( ( ( P where color = ‘Red’) Join SP) [S#] Join S) [Sname] • Find a row subset (Selection) of P where color=‘Red’ • Do a natural join with SP • Project on S# • Do a natural join with S • Project on SName

Get supplier names for suppliers who do not supply part P2. S[Sname] Minus ( ( SP where P# = ‘P2) join S) [Sname] • Project S on SName • Find a row subset of SP where P#=‘P2’ • Join the result of 2 with S • Project the result of 3 on Sname • Calculate the difference between the result of 1 and the result of 2

Division: • Let A be a relation of degree m+nand B be a relation of degree n • Visualize an m+ntuple (entry) of A as a pair (x, y) where • x is an m-tuple (1stm attributes) and y is an n-tuple (last n attributes) • Also suppose the n attributes of B are defined on the same domains as the last n attributes of A. • Then C = A dividedby B is a relation consisting of m-tuplesx where for all y in B there is a pair (x, y) in A.

NOTE: only 5 operations: restriction, projection, product, union, and difference are primitive. • AB = A – (A – B) • A dividedby B = A[x] – (A[x]×B – A)[x].

Example: • SP dividedby B • B is: 1) P1 or 2) P2 and P4 or 3) P1 through P6 • Supplier numbers that supply all parts in B • SP Dividedby B is: 1) S1 and S2 or 2) S1 and S4 or 3) S1

A dividedby B = A[x] – (A[x]×B – A)[x]. • Apply to previous example where A = SP.

Example • Get supplier names for suppliers who supply all parts. ( ( SP dividedby P[P#]) Join S)[Sname] or ( ( SP[S#] - (SP[S#] times P[P#] – SP)[S#]) Join S)[Sname] Recall previous notes in SQL where finding ALL of something required double “not exist” subqueries.

Query Optimization: Introduction to Database Systems by C. J. Date 8thed Select S.name From S, SP Where S.S# = SP.S# And SP.P# = ‘P2’ • Suppose 100 suppliers and 10,000 shipments • Consider (S times SP) where S.S#=SP.S# and SP.P#=’P2’ vs (S times (SP where SP.P#=’P2’) where S.S# = SP.S# • First option creates a VERY large intermediate table.

Project on name Project on name join Where SP.P# = ‘P2 Where SP.P# = ‘P2 join S SP S SP Stages of query optimization: • Slide 4 of [http://www.cs.wayne.edu/~shiyong/csc6710/slides/kiferComp_348761_ppt11.ppt] • Internal representation of a relational algebra expression (e.g. a query tree) (S join (SP where P#=’P2’))[name]((S join SP) where P#=’P2) [name]

Canonical forms • Set of queries, called C, such that for every possible query there is a query in C that is equivalent. • Reason for this? • Get suppliers who supply P2 can be expressed in 8 different ways. Ideally, the efficiency should not depend on the query form.

Examples: • Where p or (q and r)  where (p or q) and (p or r) -- Conjunctive normal form • (A where p) where q  A where (p and q) • (A[attributes]) where p  (A where p)[attributes] • A.F1 > B.F2 and B.F2 = 3  A.F1 > 3

Consider: Get suppliers not in London or Paris. SELECT S#, Sname, Status, City FROM dbo.S WHERE (City <> 'Paris') OR (City <> 'London') • Does this yield the expected result?

NOT (p and q)  NOT p or NOT q (DeMorgan’s laws • NOT (p or q)  NOT p and NOT q (DeMorgan’s laws • If system knew that SP.P# is a foreign key matching a primary key P.P# in P then (SP join P) [S#]  SP[S#]

Query optimizers do not optimize – just try to find “reasonably good” evaluation strategies. Might take longer to find optimal strategy than to do brute force method. • [http://www.cs.wayne.edu/~shiyong/csc6710/slides/kiferComp_348761_ppt11.ppt]

Summaries and References: • “MySQL's query optimization engine isn't always the most efficient when it comes to subqueries.” • often a good idea to convert a subquery to a join • [http://www.databasejournal.com/features/mysql/article.php/3813821/Five-Query-Optimizations-in-MySQL.htm]

Some additional information • [http://avid.cs.umass.edu/courses/645/s2009/lectures/Lec12-QueryOptimizer-x6.pdf]

SQL Server: Query Analysis: • Select a database • Create a query window (right click on the database and select new query) and enter an query • Query  Display Estimated Execution Plan • Query  Include Actual Execution Plan. • When the query is executed there will be an execution plan tab on the result pane.

Some example queries to analyze: • select * from S • select * from S order by S# • select * from S order by status • select * from SP where Qty = 200 • select * from S, SP where S.S#=SP.S# and P#='P2' • select * from S where S# in (select S# from SP where P#='P2') • S2 Replacement View • Any Red part View. • Only Red Parts1 view. • Do some views from the genealogy database (see next slide for family tree)

Mike Sarah Tom Ellen Jim Alice Jeanne Joseph Kathy Brian Ira Daniel John Joanne Tim Jason Linda Sally

Intro to SQL Server Optimization • [http://www.c-sharpcorner.com/UploadFile/john_charles/QueryoptimizationinSQLServer200512112007154303PM/QueryoptimizationinSQLServer2005.aspx]

Some useful terms: • [http://technet.microsoft.com/en-us/library/cc917672.aspx] • Clustered index • A clustered index is organized as a B-tree, where the nonleaf nodes are index pages and the leaf nodes are data pages. • A table can have only 1.

Nonclustered index • A nonclustered index is organized as a B-tree. Unlike a clustered index, a nonclustered index consists of only index pages. The leaf nodes in a nonclustered index are not data pages, but contain row locators for individual rows in the data pages.

Hash match inner join: [http://blogs.msdn.com/craigfr/archive/2006/08/10/687630.aspx].

Left Semi Join • “returns rows from one table that would join with another table without performing a complete join” [http://blogs.msdn.com/craigfr/archive/2006/07/19/671712.aspx] • Left Anti Semi Join • Similar to above but returns those that would NOT join.

Stream Aggregates: [http://blogs.msdn.com/craigfr/archive/2006/09/13/752728.aspx]

Index scan vs index seek: [http://blogs.msdn.com/craigfr/archive/2006/06/26/647852.aspx

Spooling: • Temporary “caching” of data for use by another activity: [http://technet.microsoft.com/en-us/library/ms191221.aspx]

Relational Algebra