SQAK: Doing More with Keywords

1 / 50

# SQAK: Doing More with Keywords - PowerPoint PPT Presentation

## SQAK: Doing More with Keywords

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. SQAK:Doing More with Keywords Sandeep Tata, Guy M. Lohman IBM Almaden Research Center Presented by Alex Zlotnik Seminar in Databases, 236826 azlotnik@tx.technion.ac.il

2. Aggregate Query A query that uses one of these functions: {Count, Average, Sum, Min, Max}

3. SQAK = SQL Aggregates using Keywords

4. Content • The problem • Solution • Research • Experiments • Other Challenges • Power vs. Ease of Use • Overview

5. 3 Courses Professor Section Enrollment Courses Department Section Enrollment Student Section 4 courseid courseid deptid courseid courseid sectionid id id sectionid Join courseid name name sectionid studentid name sectionid studentid name name Join sectionid dept deptid deptid location term deptid term grade grade term instructor Section courseid sectionid term instructor The problem Number of students registered Write SQL for Number of students registered for the course “Seminar in Databases” in the Fall semester in 2009 course “Seminar in Databases” Fall semester in 2009 in less than 3 minutes 1 2

6. The problem 5. SQL: SELECT courses.name, section.term, count(students.id) as count FROM students, enrollment, section, courses WHERE students.id = enrollment.id AND section.classid = enrollment.classid AND courses.courseid = section.courseid AND lower(courses.name) LIKE ’\%seminar in databases\%’ AND lower(section.term) = ’\%fall 2009\%’ GROUP BY courses.name, section.term Number of students registered Write SQL for Number of students registered for the course “Seminar in Databases” in the Fall semester in 2009 course “Seminar in Databases” Fall semester in 2009 in less than 3 minutes

7. Perfect world solution SQAK: “Seminar in Databases” “Fall 2009” number students SQL: SELECT courses.name, section.term, count(students.id) as count FROM students, enrollment, section, courses WHERE students.id = enrollment.id AND section.classid = enrollment.classid AND courses.courseid = section.courseid AND lower(courses.name) LIKE ’\%seminar in databases\%’ AND lower(section.term) = ’\%fall 2009\%’ GROUP BY courses.name, section.term

8. Perfect world solution • Create aggregate queries using simple keywords • Little or no knowledge of the schema is required from the user • No changes required in the database • Any existing database

9. Progress • The need • The goal • Research Yada • Experiments • Other Challenges • Power vs. Ease of Use • Overview

10. Generating SQL - Contents

11. Generating SQL - Parser Parser (keywords) Keyword  {Candidates | Candidate = (Table, Column)} • Matching keyword against schema elements with approximate string matching Example: “students”  Enrollment.studentid “students”  Student.id • Inverted index from all text values to their columns • Example: “Seminar in Databases”  Courses.name • Aggregates: sum, count, avg, min, max Candidate Interpretations = Cross product of each Candidates list “Seminar in Databases” “Fall 2009” number students  {(Course.name, Section.term, count Enrollment.studentid), (Course.name, Section.term, count Student.id )}

12. number student “Cohen” {(count, Student.Id, Student.Name), (count, Student.Name, Student.Name)} number student “Cohen” {(count, Student.Id, Student.Name), (count, Student.Id, Enroll.StudentId)} Generating SQL - Parser Parser (keywords) Initial Filtering - CI with 2 keywords corresponding to the same column - CI with 2 columns that are primary and foreign key

13. SQN Builder (Candidate Interpretations) For every candidate interpretation Build the best matching sub-graph of the schema Enrollment Section Students Courses Generating SQL – SQN Builder “Seminar in Databases” “Fall 2009” number “Alex” (Course.name, Section.term, count Student.name)

14. Generating SQL – Scorer Scorer (SQNs) Find the best SQN and create SQL for it

15. Building Sub-graph (SQN) • Input: Tables as nodes in the directed schema graph • Output: Connected sub-graph covering the tables • Principle: Simplest model Making fewest assumptions, used in other papers too • Attempt #1: Minimal covering sub-graph with directed path between every 2 tables

16. Enrollment Section Section Professor Students Course Building Sub-graph (SQN) • Attempt #1: Minimal covering sub-graph with directed path between every 2 tables • Problem: Many-to-Many relationships

17. Enrollment Section Students Professor Course Weak Reference Department Building Sub-graph (SQN) • Attempt #2: Minimal covering sub-graph with Node Clarity • Node Clarity: The sub-graph doesn’t contain any node with multiple incoming edges

18. Enrollment Section Students Students Course Course Weak Reference Department Building Sub-graph (SQN) • Example:Find the number of students per course • Query: courses count students For each course, list the number of students that are in the same department that offers the course Enrollment Section Students Professor Course Department

19. Building Sub-graph (SQN) • Input: Tables as nodes in the directed schema graph • Output: Minimal sub-graph with Node Clarity covering the tables • Observation: The output is a tree NP Complete Reduction from Exact 3-Cover

20. Building Sub-graph (SQN) Greedy Heuristic Algorithm for finding min SQN CI = {nodes (tables) of keywords} Qagg = Aggregate node SQN = {} Start BFS (non-directed) from the aggregate node, For every step i 1. Qi = Nodes discovered in step i 2. for every node q in CI  Qi 2.1 If NodeClear(q.path  SQN) 2.1.1 SQN  SQN  q.path 2.1.2 CI  CI \ {q} 2.1.3 if (CI = {}) return SQN 3. If no progress was made 3.1 backtrack the added path 4. Stop when all nodes in CI where found or BFS finishes Minimality: By BFS

21. Building Sub-graph (SQN) Greedy Heuristic Algorithm for finding min SQN The algorithm finds minimal SQN Complexity: Without backtracking: O(q2E2) Otherwise, exponential Time limit: Stop the algorithm after fixed time and run without node clarity = approx. Steiner In this case SQAK warns the user that the result might not be accurate

22. Generating SQL – Scorer (reminding) Scorer (SQNs) Find the best SQN and create SQL for it Score(CI, SQN) =

23. Generating SQL Procedure makeSimpleStatement(CI, SQN) • Make SELECT clause from elements in CI • Make FROM clause from nodes in SQN • Make WHERE clause from edges in SQN • Make GROUP BY clause from elements of CI except aggregated node • Add predicates in CI to the WHERE clause • Return statement end procedure

24. Generating SQL • Input: CI, SQN • Output: SQL • 3 types of queries: • Simple: “Seminar in DB” count students • Top1, single level aggregate: “department with max num students” • Top1, double level aggregate: “department student with max avg grade”

25. Generating SQL translateSQN(CI, SQN) 1. if SQN does not have a w-node then 1.1 Return makeSimpleStatement(CI, SQN) end if 2. if SQN has a w-node and a single level aggregate then 2.1 Produce view u = makeSimpleStatement(CI,SQN) 2.2 Remove w-node from u’s SELECT clause and GROUP BY clause 2.3 r = makeSimpleStatement(CI, SQN) 2.4 Add u to r’s FROM clause 2.5 Add join conditions joining all the columns in u to the corresponding ones in r 2.6 return r end if

26. Generating SQL: Top1 single level aggregate • “department with max num students” WITH temp(DEPTID, COURSEID) AS ( SELECT DEPARTMENT.DEPTID, count(COURSES.COURSEID) FROM COURSES, DEPARTMENT WHERE DEPARTMENT.DEPTID = COURSES.DEPTID GROUP BY DEPARTMENT.DEPTID), temp2(COURSEID) AS (SELECT max(COURSEID) FROM temp) SELECT temp.DEPTID, temp.COURSEID FROM temp, temp2 WHERE temp.COURSEID = temp2.COURSEID

27. Generating SQL translateSQN(CI, SQN) … 3. if SQN has a w-node and a double level aggregate then 3.1 Produce view u = makeSimpleStatement (CI,SQN) 3.2 Produce view v = aggregate of u from the second level aggregate term in the CI excluding the w-node in the SELECT and GROUP BY clauses 3.3 Produce r = Join u and v, equation on all the common columns 3.4 Return r end if

28. Generating SQL: Top1 double level aggregate • “department student with max avg grade” WITH temp( DEPTID, ID, GRADE) AS ( SELECT STUDENTS.DEPTID, STUDENTS.ID, avg(ENROLLMENT.GRADE) FROM ENROLLMENT, STUDENTS WHERE STUDENTS.ID = ENROLLMENT.ID GROUP BY STUDENTS.DEPTID , STUDENTS.ID), temp2( DEPTID, GRADE) AS (SELECT DEPTID, max(GRADE) FROM temp GROUP BY DEPTID) SELECT temp.DEPTID, temp.ID, temp.GRADE FROM temp, temp2 WHERE temp.DEPTID = temp2.DEPTID AND temp.GRADE = temp2.GRADE

29. SQAK Expressiveness • No formal definition of rSQL expressiveness – Future work • Queries based on weak reference – cannot be expressed

30. Progress • The need • The goal • Research Yada • Experiments • Other Challenges • Power vs. Ease of Use • Overview

31. Experiments Metrics • Data Precision Irrelevant, either the correct data is retrieved or not • Effectiveness • Savings • Parameters • Cost

32. Experiments - Effectiveness Average grade received by a student named William from the EECS department • SQAK – 93% (14 out of 15) • Steiner – 60% ( 9 out of 15)

33. Experiments - Effectiveness SQAK – 100% Steiner – 87% (13 of 15)

34. TPCH Database

35. Experiments - Savings CS num students VS. SELECT DEPARTMENT.NAME, count(STUDENTS.ID) FROM STUDENTS, DEPARTMENT WHERE DEPARTMENT.DEPTID = STUDENTS.DEPTID AND lower(DEPARTMENT.NAME) LIKE ’%cs%’ GROUP BY DEPARTMENT.NAME

36. Experiments - Savings • Measure: #of schema elements + # of join conditions • M(CS num students) = 1 • M(SQL) = 5 SELECT DEPARTMENT.NAME, count(STUDENTS.ID) FROM STUDENTS, DEPARTMENT WHERE DEPARTMENT.DEPTID = STUDENTS.DEPTID AND lower(DEPARTMENT.NAME) LIKE ’%cs%’ GROUP BY DEPARTMENT.NAME • Saved = 5 – 1 = 4

37. Experiments - Savings • Measure: #of schema elements + # of join conditions • Not taken into account • SQL syntax and correctness • Top1 single and double level constructions

38. Experiments - Savings • Measure: #of schema elements + # of join conditions • Average Savings:

39. Experiments - Parameters •  - Mismatch tolerance threshold Match(keyword to column) <  ? MATCH : NOT_MATCH • f – Mismatch penalty (used by the scorer)

40. Experiments - Parameters • Low sensitivity to mismatch threshold [0.4, 0.8] • In lower thresholds wrong columns were selected for CI • As expected, lower penalty (f=1.5) leads to lower accuracy • Robust: f=2 or 3,  between 0.4-0.8

41. Experiments – More Schemas • Large DB of IT assets of large enterprise • 600 tables, each with several columns • Sample queries provide accuracy as presented • Generating SQL, always less than 1 second • Warehouse DB • 14 tables, star schema • Star schema is easier for SQAK – no backtracking

42. Progress • The need • The goal • Research Yada • Experiments • Other Challenges • Power vs. Ease of Use • Overview

43. Other challenges • Approximate Matching • The user doesn’t know columns names • Proposal: hint list, either on paper or in code, or ontology based normalization • Missing Referential Integrity • Referential constraints not defined by the DBA • Proposal: Use (out of scope) algorithm to discover referential constraints

44. Other challenges • Tied or Close Plans • Several SQNs with (close to) best score • Can occur in similar names, different semantic areas of DB • The user selects the relevant SQL • Future research: Visualizing the interesting SQNs

45. Other challenges • Expressiveness • Users adopt SQAK quickly and pose queries such as age > 18 • SQAK adds an appropriate WHERE clause

46. Power vs. Ease of Use

47. Overview • SQAK - A system to create SQL queries with aggregates from keywords • Useful – No knowledge of schema or changes are required • Expressive, but with limitations • Trade-off between correctness and computability cost. • Execution in many common cases – polynomial, exponential at worst case

48. Personal opinion • The idea is powerful • Experiments on industrial data would emphasize the strengths and the weaknesses • Some of the results were expected (Steiner tree) • Translating CI to SQN is solving the same NP complete problem repeatedly. Caching mechanism would be very beneficiary

49. Questions?

50. Merci