1 / 64

Databases 101 & Multimedia Support

Databases 101 & Multimedia Support. Joe Hellerstein Computer Science Division UC Berkeley. Overview for the day. Databases 101 Motivation Relational databases Relational Algebra & SQL Functional units in a DBMS Engine Indexes and Query Optimizers MM Objects in a DBMS ADT support

Download Presentation

Databases 101 & Multimedia Support

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Databases 101 & Multimedia Support Joe Hellerstein Computer Science Division UC Berkeley

  2. Overview for the day • Databases 101 • Motivation • Relational databases • Relational Algebra & SQL • Functional units in a DBMS Engine • Indexes and Query Optimizers • MM Objects in a DBMS • ADT support • GiST & multi-d indexing • Similarity Search and MM indexing

  3. What we’re skipping • Concurrency and Recovery • How DBs ensure your data is safe, despite concurrent access and crashes • Buffer Management • How to manage a main-mem cache, and still guarantee correctness in the face of crashes • Query optimization • how to map a declarative query (SQL) to a query plan (relational algebra + implementations) • Query processing algorithms • sort, hash, join algorithms • Database design • logical design: E-R models & normalization • physical design: indexes, clustering, tuning

  4. More Background • Overview Texts: • Ramakrishnan, Database Management Systems (the cow book) • Silberschatz, Korth, Sudarshan: Database System Concepts (the sailboat book) • O’Neil: Database Principles, Programming, Performance • Ullman & Widom: A 1st Course in Database Systems • Graduate-level Texts • Stonebraker & Hellerstein, eds. Readings in Database Systems (http://redbook.cs.berkeley.edu) • Gray & Reuter: Transaction Processing: Concepts and Techniques. • Ullman: Principles of Database Systems • Faloutsos: Searching Multimedia Databases by Content

  5. Database Management Systems • What more could we want than a file system? • Simple, efficient ad hoc1 queries • concurrency control • recovery • benefits of good data modeling 1ad hoc: formed or used for specific or immediate problems or needs

  6. Describing Data: Data Models • A data modelis a collection of concepts for describing data. • Aschemais a description of a particular collection of data, using the a given data model. • The relational model of datais the most widely used model today. • Main concept: relation, basically a table with rows and columns. • Every relation has a schema, which describes the columns, or fields.

  7. Levels of Abstraction View 1 View 2 View 3 • Many views, single conceptual (logical) schemaand physical schema. • Views describe how users see the data. • Conceptual schema defines logical structure • Physical schema describes the files and indexes used. Conceptual Schema Physical Schema

  8. Example: University Database • Conceptual schema: • Students(sid: string, name: string, login: string, age: integer, gpa:real) • Courses(cid: string, cname:string, credits:integer) • Enrolled(sid:string, cid:string, grade:string) • Physical schema: • Relations stored as unordered files. • Index on first column of Students. • External Schema (View): • Course_info(cid:string,enrollment:integer)

  9. Data Independence • Applications insulated from how data is structured and stored. • Logical data independence: Protection from changes in logical structure of data. • Physical data independence: Protection from changes in physical structure of data. • One of the most important benefits of using a DBMS!

  10. Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB These layers must consider concurrency control and recovery Structure of a DBMS • A typical DBMS has a layered architecture. • The figure does not show the concurrency control and recovery components. • This is one of several possible architectures; each system has its own variations.

  11. Advantages of a DBMS • Data independence • Efficient data access • Data integrity & security • Data administration • Concurrent access, crash recovery • Reduced application development time • So why not use them always? • Can be expensive, complicated to set up and maintain • This cost & complexity must be offset by need • - Often worth it!

  12. Relational Algebra p By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental power of the race. -- Alfred North Whitehead (1861 - 1947)

  13. Relational Query Languages • Query languages: Allow manipulation and retrieval of data from a database. • Relational model supports simple, powerful QLs: • Strong formal foundation based on logic. • Allows for much optimization. • Query Languages != programming languages! • QLs not expected to be “Turing complete”. • QLs not intended to be used for complex calculations. • QLs support easy, efficient access to large data sets.

  14. Formal Relational Query Languages Two mathematical Query Languages form the basis for “real” languages (e.g. SQL), and for implementation: • Relational Algebra: More operational, very useful for representing internal execution plans. (Database byte-code…) • Relational Calculus: Lets users describe what they want, rather than how to compute it. (Non-operational, declarative -- SQL comes from here.)

  15. Preliminaries • A query is applied to relation instances, and the result of a query is also a relation instance. • Schemasof input relations for a query are fixed (but query will run regardless of instance!) • The schema for the resultof a given query is also fixed! Determined by definition of query language constructs. • Languages are closed (can compose queries)

  16. R1 Example Instances • “Sailors” and “Reserves” relations for our examples. • We’ll use positional or named field notation, assume that names of fields in query results are `inherited’ from names of fields in query input relations. S1 S2

  17. Relational Algebra • Basic operations: • Selection ( ) Selects a subset of rows from relation. • Projection ( ) Deletes unwanted columns from relation. • Cross-product( ) Allows us to combine two relations. • Set-difference ( ) Tuples in reln. 1, but not in reln. 2. • Union( ) Tuples in reln. 1 and in reln. 2. • Additional operations: • Intersection, join, division, renaming: Not essential, but (very!) useful.

  18. Projection • Deletes attributes that are not in projection list. • Schema of result: • exactly the fields in the projection list, with the same names that they had in the (only) input relation. • Projection operator has to eliminate duplicates! (Why??) • Note: real systems typically don’t do duplicate elimination unless the user explicitly asks for it. (Why not?)

  19. Selection • Selects rows that satisfy selection condition. • No duplicates in result! • Schema of result: • identical to schema of (only) input relation. • Result relation can be the input for another relational algebra operation! (Operatorcomposition.)

  20. Union, Intersection, Set-Difference • All of these operations take two input relations, which must be union-compatible: • Same number of fields. • `Corresponding’ fields have the same type. • What is the schema of result?

  21. Cross-Product • S1 x R1: Each row of S1 is paired with each row of R1. • Result schema has one field per field of S1 and R1, with field names `inherited’ if possible. • Conflict: Both S1 and R1 have a field called sid. • Renaming operator:

  22. Joins • Condition Join: • Result schema same as that of cross-product. • Fewer tuples than cross-product, might be able to compute more efficiently • Sometimes called a theta-join.

  23. Joins • Equi-Join: Special case: condition c contains only conjunction of equalities. • Result schema similar to cross-product, but only one copy of fields for which equality is specified. • Natural Join: Equijoin on all common fields.

  24. SELECT [DISTINCT] target-list FROMrelation-list WHERE qualification Basic SQL Query • relation-list : A list of relation names • possibly with a range-variable after each name • target-list : A list of attributes of tables in relation-list • qualification : Comparisons combined using AND, OR and NOT. • Comparisons are Attr op const or Attr1 op Attr2, where op is one of • DISTINCT: optional keyword indicating that the answer should not contain duplicates. • Default is that duplicates are not eliminated!

  25. Conceptual Evaluation Strategy • Semantics of an SQL query defined in terms of the following conceptual evaluation strategy: • Compute the cross-product of relation-list. • Discard resulting tuples if they fail qualifications. • Delete attributes that are not in target-list. • If DISTINCT is specified, eliminate duplicate rows. • Probably the least efficient way to compute a query! • An optimizer will find more efficient strategiessame answers.

  26. Example of Conceptual Evaluation SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND R.bid=103

  27. A Note on Range Variables • Really needed only if the same relation appears twice in the FROM clause. The previous query can also be written as: SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND bid=103 Some folks suggest using range variables always! SELECT sname FROM Sailors, Reserves WHERE Sailors.sid=Reserves.sid AND bid=103 OR

  28. Find sailors who’ve reserved at least one boat SELECT S.sid FROM Sailors S, Reserves R WHERE S.sid=R.sid • Would adding DISTINCT to this query make a difference? • What is the effect of replacing S.sid by S.sname in the SELECT clause? • Would adding DISTINCT to this variant of the query make a difference?

  29. Expressions and Strings SELECT S.age, age1=S.age-5, 2*S.age AS age2 FROM Sailors S WHERE S.sname LIKE ‘B_%B’ • Arithmetic expressions, string pattern matching. • ASand = are two ways to name fields in result. • LIKE is used for string matching. • `_’ stands for any one character and `%’ stands for 0 or more arbitrary characters.

  30. COUNT (*) COUNT ( [DISTINCT] A) SUM ( [DISTINCT] A) AVG ( [DISTINCT] A) MAX (A) MIN (A) Aggregate Operators • Significant extension of relational algebra. single column SELECT COUNT (*) FROM Sailors S SELECT S.sname FROM Sailors S WHERE S.rating= (SELECT MAX(S2.rating) FROM Sailors S2) SELECT AVG (S.age) FROM Sailors S WHERE S.rating=10 SELECT COUNT (DISTINCT S.rating) FROM Sailors S WHERE S.sname=‘Bob’ SELECT AVG ( DISTINCT S.age) FROM Sailors S WHERE S.rating=10

  31. Database APIs: ODBC/JDBC • special procedures/objects for procedural languages to access DBs • pass SQL strings from language, presents result sets in a language-friendly way • Microsoft’s ODBC becoming C/C++ standard on Windows • Sun’s JDBC a Java equivalent • Supposedly DBMS-neutral • a “driver” traps the calls and translates them into DBMS-specific code • database can be across a network

  32. SQL API in Java (JDBC) Connection con = // connect DriverManager.getConnection(url, ”login", ”pass"); Statement stmt = con.createStatement(); // set up stmt String query = "SELECT COF_NAME, PRICE FROM COFFEES"; ResultSet rs = stmt.executeQuery(query); try { // handle exceptions // loop through result tuples while (rs.next()) { String s = rs.getString("COF_NAME"); Float n = rs.getFloat("PRICE"); System.out.println(s + " " + n); } } catch(SQLException ex) { System.out.println(ex.getMessage () + ex.getSQLState () + ex.getErrorCode ()); }

  33. Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB These layers must consider concurrency control and recovery Structure of a DBMS • A typical DBMS has a layered architecture. • The figure does not show the concurrency control and recovery components. • This is one of several possible architectures; each system has its own variations.

  34. Query Optimization & Processing • Optimizer maps SQL to algebra tree with specific algorithms • access methods and join algorithms • relational operators implemented as iterators • open() • next(possible with condition) • close • processing engine is a pull-based, single-threaded data flow • parallelizes naturally

  35. Index Entries (Direct search) Data Entries ("Sequence set") B+ Tree: The Most Widely Used Index • Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages) • Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. • Supports equality and range-searches efficiently.

  36. Example B+ Tree • Search begins at root, and key comparisons direct it to a leaf. • Search for 5*, 15*, all data entries >= 24* ... Root 30 24 13 17 39* 22* 24* 27* 38* 3* 5* 19* 20* 29* 33* 34* 2* 7* 14* 16* • Based on the search for 15*, we know it is not in the tree!

  37. MM objects in a Standard RDBMS? create table frames (frameno integer, image BLOB, category integer) • Binary Large Objects (BLOBs) can be stored and fetched • User-level code must provide all logic for BLOBs • Scenario: client (Machine A) requests “thumbnail” images for all frames in DBMS (Machine B) • Inefficient, too hard to express queries.

  38. Solution 2: Object-Relational DBMS • Idea: add OO features to the type system of SQL. I.e. “plain old SQL”, but... • columns can be of new types (ADTs) • user-defined methods on ADTs • columns can be of complex types • reference types and “deref” • inheritance and collection inheritance • old SQL schemas still work! (backwards compatibility) • Relational vendors all moving this way (SQL99).

  39. ADTs: User-Defined Atomic Types • Built-in SQL types (int, float, text, etc.) limited • have simple methods as well (math, LIKE, etc.) • ORDBMS: can define new types (& methods) create type jpeg (internallength = variable, input = jpeg_in, output = jpeg_out); • Not naturally composed of built-in types • new atomic types • Need input & output methods for types • convert from text to internal type and back • we’ll see how to do method definition soon...

  40. An Example Queries in SQL-3 • Clog cereal wants to license an image of Herbert the Worm in front of a sunrise: • The thumbnail method produces a small image • The Sunrise_rank method returns the relative similarity of a pic to a sunrise • The Herbert_rank method returns the relative similarity of a pic to Herbert select F.frameno, thumbnail(F.image), C.lease_price from frames F, categories C where F.category = C.cid order by Sunrise_rank(F.image) + Herbert_rank(F.image);

  41. User-Defined Methods • New ADTs will need methods to manipulate them • e.g. for jpeg: thumbnail, crop, rotate, smooth, etc. • expert user writes these methods in a language like Java or C • register methods with ORDBMS: create function thumbnail(jpeg) returns jpeg as external name ‘/a/b/c/Dinkey.o’ • ORDBMS dynamically links functions into server.

  42. Modifications to support this? • Indexing • Need extensible indexing architecture • Indexing on arbitrary comparison methods (not just < > =) • Must support similarity search • Chunked array support • Optimization • WHERE clause exprs can be expensive • Need to carefully order selections • May want to do join before selection • Teach optimizer about extensible indexes • Query execution • Make sure you trust downloaded code • Cache method invocations

  43. Beyond B-trees • Question: Can B+-trees handle more complicated searches? • A typical example: “gpa > 3.7 and age < 18” • Same thing: “all restaurants in downtown Berkeley” • Even fancier: “all pictures resembling </tmp/sunset.gif>” • (Easy: “all pictures identical to </tmp/sunset.gif>“) • B+-trees exploit data order to do range search • 1-d range search is not always what you want

  44. Search Trees in General • Lots of trees invented for multidimensional data • e.g. R-trees • New tree indexes for other kinds of data • DNA sequences, etc. • Basically “unordered” B+-trees, fancy keys • In some ways, B+-tree is just a “special case” of these trees.

  45. key1 key2 ... Index Nodes (directory) Data Nodes (linked list) Search Trees from 30,000 feet

  46. How to search a disorderly B+-tree? Equality search is identical to traditional! Range search = traversing multiple paths follow all pointers where key range overlaps query range. [32 - 61) [17 - 32) [61 - inf) [-inf - 17) A “Disorderly” B+-tree Traditional 17 32 61 Disorderly

  47. [47 - 80) [17 - 32) [74 - 103) [1 - 17) Another Disorderly B+-Tree • Insert 41??? • keys need not cover all possible values • Search for 77?? • keys may “overlap”

  48. Generalized Search Tree (GiST) • A disorderly B+-tree, with user-defined keys • tree doesn’t interpret keys. • “user” implements keys as OO objects, with methods that guide search, insert, delete, split, etc. • Structure: balanced tree of (p, ptr) pairs • p is a key “predicate” (assertion) • p holds for all data records below ptr • have to have n keys for n pointers (unlike B+-tree)

  49. User-provided Key Methods • Search: • Consistent(E,q): E.p Ù q? (no/maybe) • e.g. E.p = “gpa > 3.7 and age < 18” q = “2.0 < gpa < 3.0 and 15 < age < 18” • Generating new keys after splits: • Union(P): new key that holds for all tuples in P • Clustering: • Penalty(E1,E2): penalty of inserting E2in subtree E1 • PickSplit(P): split P into two groups of entries

  50. Search • General technique: • DFS tree where Consistent is TRUE • More sophisticated • Priority queue to control the next Consistent node to visit • Example: similarity (near-neighbor) search for ranked queries • We’ll return to this idea

More Related