1 / 117

Project Part II

Project Part II. Starting Point: Existing DB Application (Java + SQL) ideally, your system from Part I of the project otherwise, the student enrollment demo app (+ queries) Task Implement a DB library: relational algebra on file system Replace all JDBC statements with calls to your library

august
Download Presentation

Project Part II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Project Part II • Starting Point: Existing DB Application (Java + SQL) • ideally, your system from Part I of the project • otherwise, the student enrollment demo app (+ queries) • Task • Implement a DB library: relational algebra on file system • Replace all JDBC statements with calls to your library • Goal • Understand the internals of a database system • Understand that there is always an alternative • Brief glimpse on the „Information Systems“ lecture

  2. The Data Management Universe Project Part I Project Part 2

  3. Components of a Database System „Naive“ User Expert User App- Developer DB- admin Application Ad-hoc Query Compiler Management tools DML-Compiler DDL-Compiler Query Optimizer DBMS Runtime Schema TA Management Recovery Storage Manager Logs Indexes DB Catalogue External Storage (e.g., disks)

  4. Application {tuples} SQL Server (Network, App Buffers, Security) Query Processor Metadata Mgmt (Schema, Stats) Transactions (Locking, Logging) Data Manager (Indexes, Records) Storage Manager (Pages) get/put block Storage System (disks, SSDs, SAN, …)

  5. Why use a DBMS? • Avoid redundancy and inconsistency • Query Processor, Transaction Manager, Catalog • Rich (declarative) access to the data • Query Processor • Synchronize concurrent data access • Transaction Manager • Recovery after system failures • Transaction Manager, Storage Layer • Security and privacy • Server, Catalog

  6. Application {tuples} SQL / rel. algebra Server (Network, AppBuffers, Security) Query Processor Query Processor Metadata Mgmt (Schema, Stats) Transactions (Locking, Logging) Indexes, Records Data Manager Manager, Pages Storage Manager block get/put Storage System (disks, SSDs, SAN, …) Project Part II

  7. What does a Database System do? • Input: SQL statement • Output: {tuples} • Translate SQL into a setofget/putreq. to backend storage • Extract, process, transformtuplesfromblocks • Tons ofoptimizations • Efficientalgorithmsfor SQL operators (hashing, sorting) • Layout ofdata on backend storage (clustering, freespace) • Orderingofoperators (small intermediate results) • Semanticrewritingsofqueries • Buffermanagementandcaching • Parallel executionandconcurrency • Outsmart the OS • Partitioningand Replication in distributedsystem • IndexingandMaterialization • Loadandadmissioncontrol • + Security + Durability + ConcurrencyControl + Tools

  8. Database Optimizations • Query Processor (based on statistics) • Efficient algorithms for SQL operators (hashing, sorting) • Ordering of operators (small intermediate results) • Semantic rewritings of queries • Parallel execution and concurrency • Storage Manager • Load and admission control • Layout of data on backend storage (clustering, free space) • Buffer management and caching • Outsmart the OS • Transaction Manager • Load and admission control • Tools (based on statistics) • Partitioning and Replication in distributed system • Indexing and Materialization

  9. DBMS vs. OS Optimizations • Many DBMS tasks are also carried out by OS • Load control • Buffer management • Access to external storage • Scheduling of processes • … • What is the difference? • DBMS has intimate knowledge of workload • DBMS can predict and shape access pattern of a query • DBMS knows the mix of queries (all pre-compiled) • DBMS knows the contention between queries • OS does generic optimizations • Problem: OS overrides DBMS optimizations!

  10. Application {tuples} SQL / rel. algebra Server (Network, AppBuffers, Security) Query Processor Query Processor Metadata Mgmt (Schema, Stats) Transactions (Locking, Logging) Indexes, Records Data Manager Manager, Pages Storage Manager block get/put Storage System (disks, SSDs, SAN, …)

  11. Query Processor SQL {tuples} Parser QGM Rewrite Runtime System Compiler QGM Optimizer QGM++ Plan CodeGen Interpreter

  12. SQL -> Relational Algebra x Rk SQL select A1, ..., An from R1, ..., Rk where P; Relational Algebra  A1, ..., An(P (R1 x ... x Rk )) A1, ..., An P x x R3 R1 R2

  13. Runtime System • Three approaches • A. Compile query into machine code • B. Compile query into relational algebra and interpret that • C. Hybrid: e.g., compile predicates into machine code • What to do? • A: better performance • B: easier debugging, better portability • Project: use Approach B • Query Interpreter • provide implementation for each algebra operator • define interface between operators

  14. Algorithms for Relational Algebra • Table Access • scan (load each page at a time) • index scan (if index available) • Sorting • Two-phase external sorting • Joins • (Block) nested-loops • Index nested-loops • Sort-Merge • Hashing (many variants) • Group-by (~ self-join) • Sorting • Hashing

  15. Two-phase External Sorting • Phase I: Create Runs • Loadallocatedbufferspacewithtuples • Sorttuples in bufferpool • Write sortedtuples (run) todisk • Goto Step 1 (createnextrun) until all tuplesprocessed • Phase II: Merge Runs • Usepriorityheaptomergetuplesfromruns • Special cases • buffer >= N: nomergeneeded • buffer < sqrt(N): multiple mergephasesnecessary • (N sizeoftheinput in pages)

  16. External Sort 97 17 3 5 27 16 2 99 13

  17. External Sort 97 17 3 5 27 16 2 99 13 load 97 17 3

  18. External Sort 97 17 3 5 27 16 2 99 13 sort 3 17 97

  19. External Sort 3 17 97 97 17 3 5 27 16 2 99 13 write run 3 17 97

  20. External Sort 3 17 97 97 17 3 5 27 16 2 99 13 load 5 27 16

  21. External Sort 3 17 97 5 16 27 97 17 3 5 27 16 2 99 13 sort & write 5 16 27

  22. External Sort 3 17 97 5 16 27 97 17 3 5 27 16 2 99 13 load 2 99 13

  23. External Sort 3 17 97 5 16 27 2 13 99 97 17 3 5 27 16 2 99 13 End of Phase 1 2 13 99

  24. External Sort 3 17 97 5 16 27 2 13 99 merge 3 5 2

  25. External Sort 3 17 97 5 16 27 2 13 99 2 merge 3 5 2

  26. External Sort 3 17 97 5 16 27 2 13 99 2 3 merge 3 5 13

  27. External Sort 3 17 97 5 16 27 2 13 99 2 3 5 merge 17 5 13

  28. External Sort 3 17 97 5 16 27 2 13 99 2 3 5 merge 17 16 13

  29. External Sort 3 17 97 5 16 27 2 13 99 2 3 5 13 17 16 13

  30. Multi-way Merge (N = 13; M = 2)

  31. Analysis • N: size of table in pages • M: size of (available) main memory in pages • IO Cost • O(N): if M >= sqrt(N) • 2 * N: if M >= N • 4 * N: if N > M >= sqrt(N) • O(N logM N): if M < sqrt(N) • Base of logarithm: in O notation not relevant, but constants matter • CPU Cost (M >= sqrt(N)) • Phase 1 (create N/M runs of length M): O(N * log2 M) • Phase 2 (merge N tuples with heap): O(N * log2 N/M) • Exercise: Do CPU cost increase/decrease with M?

  32. Sorting Summary • Complexity: N * log(N) theory is right, but • DB people care about CPU and IO complexity • Constants matter! • Buffer allocation matters! Many concurrent queries? • More main memory can hurt performance! • Main memory is large. Do two-way sort because… • Parallelize sorting on different machines • Or many concurrent sorts on same machine • But more than 2-ways very rare in practice • Knuth suggests Replacement Selection • Increases length of runs • But, higher constant for CPU usage • Typically, not used in practice

  33. (Grace) Hash Join

  34. Grace Hash Join

  35. Sorting vs. Hashing • Both techniques can be used for joins, group-by, … • binary and unary matching problems • Same asymptotic complexity: O(N log N) • In both IO and CPU • Hashing has lower constants for CPU complexity • IO behavior is almost identical • Merging (Sort) vs. Partitioning (Hash) • Merging done afterwards; Partitioning done before • Partitioning depends on good statistics to get right • Sorting more robust. Hashing better in average case!

  36. Iterator Model • Plan contains many operators • Implement each operator indepently • Define generic interface for each operator • Each operator implemented by an iterator • Three methods implemented by each iterator • open(): initialize the internal state (e.g., allocate buffer) • char* next(): produce the next result tuple • close(): clean-up (e.g., release buffer) • N.B. Modern DBMS use a Vector Model • next() returns a set of tuples • Why is that better?

  37. Iterator Model at Work Application NLJoin NLJoin scan scan scan R S T

  38. Iterator Model at Work JDBC: execute() Application NLJoin NLJoin scan scan scan R S T

  39. Iterator Model at Work JDBC: execute() Application open() NLJoin NLJoin scan scan scan R S T

  40. Iterator Model at Work JDBC: execute() Application open() NLJoin open() NLJoin open() scan scan scan R S T

  41. Iterator Model at Work JDBC: execute() Application open() NLJoin open() NLJoin scan open/close for each R tuple scan scan R S T

  42. Iterator Model at Work JDBC: execute() Application open() NLJoin NLJoin scan scan open/close for each R,S tuple scan R S T

  43. Iterator Model at Work JDBC: next() Application next() NLJoin next() NLJoin next() scan scan scan R S T

  44. Iterator Model at Work JDBC: next() Application next() NLJoin next() NLJoin r1 scan scan scan R S T

  45. Iterator Model at Work JDBC: next() Application next() NLJoin next() NLJoin r1 scan open() scan scan R S T

  46. Iterator Model at Work JDBC: next() Application next() NLJoin next() NLJoin r1 scan next() scan scan R S T

  47. Iterator Model at Work JDBC: next() Application next() NLJoin next() NLJoin r1 scan s1 scan scan R S T

  48. Iterator Model at Work JDBC: next() Application next() NLJoin next() NLJoin r1 scan next() scan scan R S T

  49. Iterator Model at Work JDBC: next() Application next() NLJoin next() NLJoin r1 scan s2 scan scan R S T

  50. Iterator Model at Work JDBC: next() Application next() NLJoin r1, s2 NLJoin scan scan scan R S T

More Related