1 / 35

CS 405G: Introduction to Database Systems

CS 405G: Introduction to Database Systems. 21 Storage Chen Qian University of Kentucky. Homeworks. 5 -> 4 To give you more time for the project. Review. Disk management Basic disk (physical) unit: page Basic DB unit: record (row of a table) How to connect page with record? File!.

myrrh
Download Presentation

CS 405G: Introduction to Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 405G: Introduction to Database Systems 21 Storage Chen Qian University of Kentucky

  2. Homeworks • 5 -> 4 • To give you more time for the project Chen Qian @ University of Kentucky

  3. Review • Disk management • Basic disk (physical) unit: page • Basic DB unit: record (row of a table) • How to connect page with record? • File! Chen Qian @ University of Kentucky

  4. A row in a table, when located on disks, is called • A record • Two types of record: • Fixed-length • Variable-length Chen Qian @ University of Kentucky

  5. In an abstract sense, a file is • A set of “records” on a disk • In reality, a file is • A set of disk pages • Each record lives on • A page • Physical Record ID (RID) • A tuple of <page#, slot#> Chen Qian @ University of Kentucky

  6. Files • Higher levels of DBMS operate on records, and files of records. • FILE: A collection of pages, containing a collection of records. • Record = row in a table • Must support: • insert/delete/modify record • fetch a particular record (specified using record id) • scan all records (possibly with some conditions on the records to be retrieved) Chen Qian @ University of Kentucky

  7. Unordered (Heap) Files • Simplest file structure contains records in no particular order. • As file grows and shrinks, disk pages are allocated and de-allocated. • To support record level operations, we must: • keep track of the pages in a file • keep track of free spaceon pages • keep track of the records on a page • There are many alternatives for keeping track of this. • We’ll consider 2 Chen Qian @ University of Kentucky

  8. Data Page Data Page Data Page Full Pages Header Page Data Page Data Page Data Page Pages with Free Space Heap File Implemented as a List • The header page id and Heap file name must be stored someplace. • Database “catalog” • Each page contains 2 `pointers’ plus data. Chen Qian @ University of Kentucky

  9. Data Page 1 Header Page Data Page 2 Data Page N DIRECTORY Heap File Using a Page Directory • DBMS must remember where the first directory page is located. • The entry for a page can include the number of free bytes on the page. • The directory itself is a collection of pages and shown as a linked list. • Much smaller than linked list of all HF pages! Chen Qian @ University of Kentucky

  10. Trade-off • Compared to linked list, the directory based approaches have: • Pros: Fast access to a particular record • Cons: Extra space Chen Qian @ University of Kentucky

  11. Record layout Record = row in a table • Variable-format records • Rare in DBMS—table schema dictates the format • Relevant for semi-structured data such as XML • Focus on fixed-format records • With fixed-length fields only, or • With possible variable-length fields Chen Qian @ University of Kentucky

  12. Record Formats: Fixed Length F3 F4 F1 F2 • All field lengths and offsets are constant • Computed from schema, stored in the system catalog • Finding i’th field done via arithmetic. L3 L4 L1 L2 Address = B+L1+L2 Base address (B) Chen Qian @ University of Kentucky

  13. 0 4 24 28 36 142 Bart(padded with space) 10 2.3 Fixed-length fields • Example: CREATE TABLE Student(SID INT, name CHAR(20), age INT, GPA FLOAT); • Watch out for alignment • May need to pad; reorder columns if that helps • What about NULL? • Add a bitmap at the beginning of the record Chen Qian @ University of Kentucky

  14. Record Formats: Variable Length • Two alternative formats (# fields is fixed): F1 F2 F3 F4 $ $ $ $ Fields Delimited by Special Symbols F1 F2 F3 F4 Array of Field Offsets • Second offers direct access to i’th field, efficient storage • of nulls; extra but small directory overhead. Chen Qian @ University of Kentucky

  15. Trade-off • Compared to the 1st formats, the offset based format has: • Pros: Fast access to a particular field; efficient storage of nulls • Cons: Extra space Chen Qian @ University of Kentucky

  16. Large Object (LOB) fields • Example: CREATE TABLE Student(SID INT, name CHAR(20), age INT, GPA FLOAT, picture BLOB(32000)); • Student records get “de-clustered” • Bad because most queries do not involve picture • Decomposition (automatically done by DBMS and transparent to the user) • Student(SID, name, age, GPA) • StudentPicture(SID, picture) Chen Qian @ University of Kentucky

  17. Page layout How do you organize records in a page? • Fixed length records • Variable length records • NSM (N-ary Storage Model) is used in most commercial DBMS Chen Qian @ University of Kentucky

  18. Page Formats: Fixed Length Records • Record id = <page id, slot #>. In first alternative, moving records for free space management changes rid; may not be acceptable. Slot 1 Slot 1 Slot 2 Slot 2 Free Space . . . . . . Slot N Slot N Slot M . . . N 1 1 1 M 0 M ... 3 2 1 number of slots number of records PACKED UNPACKED, BITMAP Chen Qian @ University of Kentucky

  19. Trade-off • Compared to the packed formats, the unpacked format has: • Pros: No need to move records into vacated slots after delete. No need to change the slot number. • Cons: To insert a record, one needs to find an empty slot, by scanning the bitmap. Chen Qian @ University of Kentucky

  20. 142 Bart 10 2.3 123 Milhouse 10 3.1 857 Lisa 8 4.3 456 Ralph 8 2.3 Variable-Length Records • Store records from the beginning of each page • Use a directory at the end of each page • To locate records and manage free space • Necessary for variable-length records Why store data and directoryat two different ends? Both can grow easily Chen Qian @ University of Kentucky

  21. Options • Reorganize after every update/delete to avoid fragmentation (gaps between records) • Need to rewrite half of the block on average • What if records are fixed-length? • Reorganize after delete • Only need to move one record • Need a pointer to the beginning of free space • Do not reorganize after update • Need a bitmap indicating which slots are in use Chen Qian @ University of Kentucky

  22. System Catalogs • For each relation: • name, file location, file structure (e.g., Heap file) • attribute name and type, for each attribute • index name, for each index • integrity constraints • For each index: • structure (e.g., B+ tree) and search key fields • For each view: • view name and definition • Plus statistics, authorization, buffer pool size, etc. Catalogs are themselves stored as relations! Chen Qian @ University of Kentucky

  23. Indexes • A Heap file allows us to retrieve records: • by specifying the rid, or • by scanning all records sequentially • Sometimes, we want to retrieve records by specifying the values in one or more fields, e.g., • Find all students in the “CS” department • Find all students with a gpa > 3 • Indexes are file structures that enable us to answer such value-based queries efficiently. Chen Qian @ University of Kentucky

  24. database indexing Search Basics • Given a value, locate the record(s) with this value SELECT * FROM R WHERE A = value; SELECT * FROM R, S WHERE R.A = S.B; • Other search criteria, e.g. • Range search SELECT * FROM R WHERE A > value; • Keyword search Chen Qian @ University of Kentucky

  25. Dense and sparse indexes • Dense: one index entry for each search key value • Sparse: one index entry for each block • Records must be clustered according to the search key Chen Qian @ University of Kentucky

  26. Dense versus sparse indexes • Index size • Sparse index is smaller • Requirement on records • Records must be clustered for sparse index • Lookup • Sparse index is smaller and may fit in memory • Dense index can directly tell if a record exists • Update • Easier for sparse index Chen Qian @ University of Kentucky

  27. Primary and secondary indexes • Primary index • Created for the primary key of a table • Records are usually clustered according to the primary key • Can be sparse • Secondary index • Usually dense • SQL • PRIMARY KEY declaration automatically creates a primary index, UNIQUE key automatically creates a secondary index • Additional secondary index can be created on non-key attribute(s)CREATE INDEX StudentGPAIndex ON Student(GPA); Chen Qian @ University of Kentucky

  28. Tree-Structured Indexes: Introduction • Tree-structured indexing techniques support both range selections and equality selections. • ISAM =IndexedSequentialAccessMethod • static structure; early index technology. • B+ tree: dynamic, adjusts gracefully under inserts and deletes. Chen Qian @ University of Kentucky

  29. index entry P K P P K P K m 2 0 1 m 1 2 Index File kN k2 k1 Motivation for Index • ``Find all students with gpa > 3.0’’ • If data file is sorted, do binary search • Cost of binary search in a database can be quite high, Why? • Simple idea: Create an `index’ file. Data File Page N Page 3 Page 1 Page 2 Can do binary search on (smaller) index file! Chen Qian @ University of Kentucky

  30. 100, 200, …, 901 … Index blocks 100, 123, …, 192 200, … 901, …, 996 100, 108,119, 121 123, 129,… 192, 197,… 200, 202,… 901, 907,… 996, 997,… … … … Data blocks ISAM • What if an index is still too big? • Put a another (sparse) index on top of that! • ISAM (Index Sequential Access Method), more or less Example: look up 197 Chen Qian @ University of Kentucky

  31. How many steps? • For a balanced tree of N pages, the num of search steps is O(logN) • Less than a constant times logN Chen Qian @ University of Kentucky

  32. 100, 200, …, 901 107 Overflow block … Index blocks 100, 123, …, 192 200, … 901, …, 996 100, 108,119, 121 123, 129,… 192, 197,… 200, 202,… 901, 907,… 996, 997,… … … … Data blocks Updates with ISAM • Overflow chains and empty data blocks degrade performance • Worst case: most records go into one long chain Example: insert 107 Example: delete 129 Chen Qian @ University of Kentucky

  33. How many steps? • For a chain of N pages, the worst-case num of search steps is N • Much larger than O(logN)! Chen Qian @ University of Kentucky

  34. A Note of Caution • ISAM is an old-fashioned idea • B+-trees are usually better, as we’ll see • But, ISAM is a good place to start to understand the idea of indexing • Upshot • Don’t brag about being an ISAM expert on your resume • Do understand how they work, and tradeoffs with B+-trees Chen Qian @ University of Kentucky

  35. Summary (Contd.) • DBMS vs. OS File Support • DBMS needs features not found in many OSes, e.g., forcing a page to disk, controlling the order of page writes to disk, files spanning disks, ability to control pre-fetching and page replacement policy based on predictable access patterns, etc. • Variable length record format with field offset directory offers support for direct access to ith field and null values. Chen Qian @ University of Kentucky

More Related