SQL Unit 5 Aggregation, GROUP BY, and HAVING

SQL Unit 5Aggregation, GROUP BY, and HAVING Kirk Scott

5.1 Grouping By One Field • 5.2 Grouping By More than One Field • 5.3 GROUP BY with HAVING • 5.4 More on Nulls

5.1 Grouping By One Field

1. Recall that the term aggregation referred to built-in functions like these: • COUNT, SUM, AVG, MAX, MIN, etc. • The results of such a function are based on the contents of more than one row in a table.

A simple example of the use of such a function would be: • SELECT SUM(salesprice) • FROM Carsale • This would find the sum of the salesprices of all of the cars listed in the Carsale table.

2. Remember also that the records in the Carsale table include the spno, and it is possible to write a query that orders the results of a query by that field: • SELECT * • FROM Carsale • ORDER BY spno

3. What if you would like the subtotals of the sums of the salespricesfor the cars sold by each salesperson? • This would involve finding a SUM, and it would also depend on the spno • Both of these fields arein the Carsale table. • Here is a query that accomplishes this: • SELECT spno, SUM(salesprice) • FROM Carsale • GROUP BY spno

This query will give the subtotal for each spno in the Carsale table. • There will be only one row for each spno in the results of the query. • In a sense when you GROUP BY, it is like having the keyword DISTINCT in the query.

The aggregate functions ignore nulls, but GROUP BY does not • If any sales records had null spno's, the query results would also include a row where the sum of the salesprices for such records appeared. • However, in calculating the sums, null values for salesprice would still be ignored

The keyword GROUP has this in common with the keyword ORDER: • The results of this query will be sorted by the spno values.

4. Here is another example, using COUNT, where the function is applied to * rather than to a single field in the table. • The results will be what you would expect • —the count of the number of car sales by each salesperson: • SELECT spno, COUNT(*) • FROM Carsale • GROUP BY spno

Recall that the meaning of COUNT(*) is to count all of the records where any of the fields are non-null. • None of the records can be all null, so this counts all records. • GROUP BY will include in the results a group that counts how many records had a null spno, if there were any such records.

5. It's not necessary to include the GROUP BY field in the query results. • These results may not be very useful, but this query is syntactically OK: • SELECT SUM(salesprice) • FROM Carsale • GROUP BY spno

On the other hand, there are limitations on what fields can be included in the results of a GROUP BY query. • A query like this is wrong: • SELECT spno, custno, SUM(salesprice) • FROM Carsale • GROUP BY spno

The reason is simple. • By definition, there will only be one row per spno in the results of the query. • However, it is possible that there would be more than one custno per salesperson.

It would not be possible to show the multiple custno's belonging to a single spno, so this is not allowed. • It's true that in some cases there may only be one custno for a given spno, but even so, the syntax will not support exceptions like these.

The bottom line is that in a GROUP BY query, the SELECT can include at most the GROUP BY field and the field that the aggregate is calculated on.

6. It is possible to use GROUP BY and ORDER BY together in a single query. • This is a simple, practical example. • It illustrates the fact that you can order the results by the aggregate if you want to. • Recall that the default order is by the GROUP BY field. • SELECT spno, COUNT(*) • FROM Carsale • GROUP BY spno • ORDER BY COUNT(*)

5.2 Grouping By More than One Field

1. It is also possible to GROUP BY more than one field at a time in a query. • For example: • SELECT make, model, SUM(stickerprice) • FROM Car • GROUP BY make, model

This query will give the sum of the stickerprices for every possible combination of make and model. • Each of these combinations will appear only once in the results. • Again, the effect is similar to having the keyword DISTINCT in a query.

The results would also include rows for the three cases where either the make, model, or both fields were null in the original records in the Car table. • No fields other than make and model (and the aggregate) could be included in the select clause. • Also, both make and model are optional in the SELECT, although in most cases the query results would probably be more useful if they were included.

2. It is again useful to compare the GROUP BY query with the analogous ORDER BY query: • SELECT make, model • FROM Car • ORDER BY make, model

In this query the primary sort key is make and the secondary sort key is model. • The results of the query will show every combination of make and model that occurs in the Car table sorted first by make, and within make by model. • The corresponding GROUP BY query will show the sums of the stickerprices for every combination of make and model in the table and the results will be given in the same order as the ORDER BY query.

3. Observe that it would also be possible to write queries where the order that the fields are selected is changed. • The sums for the various combinations of make and model wouldn't change, but the orders of the columns and rows in the results would change. • The first example would put the model column before the make column, but the sort order of the rows would be the same as in the previous example.

It is conceivable that someone might want to write a query like this: • SELECT model, make, SUM(stickerprice) • FROM Car • GROUP BY make, model

The second example would put the make column first and the model column second, but the sort order has been changed to sort first by model and than by make. • It seems unlikely that anyone would write the query in this way intentionally, but it is possible that all they're interested in is the sum for each combination of make and model and the sort order doesn't make a difference.

In any case, it's syntactically OK: • SELECT make, model, SUM(stickerprice) • FROM Car • GROUP BY model, make

4. It bears repeating that including a GROUP BY field in the SELECT is optional. • For example, the following example would be OK. • The results will only show the make and sum in each row, but there will be a row for each combination of make and model: • SELECT make, SUM(stickerprice) • FROM Car • GROUP BY make, model

It also bears repeating that it is not possible to include in the SELECT any fields except for the aggregate field and the fields in the GROUP BY. • This is because there may be multiple values for the additional field for each combination of the GROUP BY fields.

For example, this query is wrong: • SELECT make, model, year, SUM(stickerprice) • FROM Car • GROUP BY make, model

5. It is always possible to specify an order for the results of a query in addition to doing GROUP BY. • This example is kind of silly, because it simply accomplishes what could be accomplished by putting the fields in the GROUP BY in the other order. • But it does illustrate how the syntax for ORDER BY will override the ordering that otherwise would be used by GROUP BY: • SELECT make, model, SUM(stickerprice) • FROM Car • GROUP BY model, make • ORDER BY make, model

This example illustrates a more practical use of the syntax. • Notice again that it's possible to use the aggregate function in the ORDER BY: • SELECT make, model, SUM(stickerprice) • FROM Car • GROUP BY make, model • ORDER BY SUM(stickerprice) DESC

5.3 GROUP BY with HAVING

1. In a simple query, a WHERE clause causes the SELECT to pick out only certain sets of records in a table based on a condition on the value of an individual field. • This is known as a selection or a restriction. • It might also be called a refinement of the query's results. • A query with a WHERE clause will potentially give as its results a subset of the results that would be returned by the same query without the WHERE clause.

In a query with GROUP BY, the HAVING clause can be used to achieve similar results as the WHERE clause in a simple query. • In other words, it can be used to restrict the results based on the results of the aggregate function in the query.

For example, this query will show the spno's and the sums of the salesprices of cars that they sold, but only for those salespeople who sold a total of at least 50000 dollars worth of cars overall: • SELECT spno, SUM(salesprice) • FROM Carsale • GROUP BY spno HAVING SUM(salesprice) >= 50000

Here is another straightforward example which will find the salespeople and the counts of the numbers of cars they sold, if they sold more than 4 cars: • SELECT spno, COUNT(*) • FROM Carsale • GROUP BY spno HAVING COUNT(*) > 4

2. For better or worse, the HAVING clause can also be applied to the GROUP BY field or fields. • So, for example, this query is possible. • It will find the sum of the stickerprices for all of the Chevrolets and only the Chevrolets.

There will be only one row in the results: • SELECT make, SUM(stickerprice) • FROM Car • GROUP BY make HAVING make = 'Chevrolet'

There is nothing wrong with the previous example, but the following alternative may be preferable. • It is possible to have both WHERE and GROUP BY in the same query, and it might be helpful to use WHERE instead of HAVING whenever that is possible.

Here is a query that has the same results as the previous one: • SELECT make, SUM(stickerprice) • FROM Car • WHERE make = 'Chevrolet' • GROUP BY make

Keep in mind that it is possible to do inequalities on text fields. • This query would find the sums of the stickerprices for all makes whose names appear after Chevrolet in alphabetical order: • SELECT make, SUM(stickerprice) • FROM Car • WHERE make > 'Chevrolet' • GROUP BY make

3. It is possible to have both a condition on a GROUP BY field (a non-aggregate field) and the aggregate field in a query. • Again, it may be helpful to keep them straight by using WHERE for the condition on the GROUP BY field. • You have to use HAVING on the aggregate field in any case.

So, for example, this query will find the makes and the sums of their stickerprices for makes that appear after Chevrolet in alphabetical order, and whose stickerprice sums are greater than or equal to 50000. • Notice that even though the word "and" appears in the verbal description, the keyword AND does not belong in the syntax of a correct query implementing this:

SELECT make, SUM(stickerprice) • FROM Car • WHERE make > 'Chevrolet' • GROUP BY make HAVING SUM(stickerprice) >= 50000

4. All of the examples so far have concentrated on conditions on the group by fields or the aggregate. • As usual, most things in SQL mix and match. • It is also possible to have a condition on any field or fields.

For example: • SELECT make, SUM(stickerprice) • FROM Car • WHERE make > 'Chevrolet‘ • AND year > 2005 • GROUP BY make HAVING SUM(stickerprice) >= 50000

5. The ability to mix and match extends to joins. • It is possible to have a join query where the grouping is done on the field of one table, while the aggregate is done on a field of the other table. • Such a query could also include the keyword HAVING as well as other elements of SQL queries unrelated to grouping.

This last example dispenses with HAVING and where clauses except for the joining condition in order to clearly illustrate doing a join and GROUP BY together. • SELECT commrate, SUM(salesprice) • FROM Salesperson, Carsale • WHERE Salesperson.spno = Carsale.spno • GROUP BY commrate

SQL Unit 5 Aggregation, GROUP BY, and HAVING