1 / 20

Databases vs. Programmers

Databases vs. Programmers. Dmitri Gavrilov , Ph.D. Principal Engineering Manager, Exchange, Microsoft. Why do I need to know this?. Would you ever need to code a B-tree? If you do, you are wasting your employer’s money

Download Presentation

Databases vs. Programmers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Databases vs. Programmers Dmitri Gavrilov, Ph.D. Principal Engineering Manager, Exchange, Microsoft

  2. Why do I need to know this? • Would you ever need to code a B-tree? • If you do, you are wasting your employer’s money • Unless you work for Oracle, IBM (DB2), SAP (Sybase) or Microsoft (SQL Server) • Why is it then important to know how B-trees work? • I’ll try to show you.

  3. You are Hired! • You’ve been just hired as a database developer by the UofC IT department (ACS? UCS?) • Your first project is fixing the database of students. • What’s wrong with the current implementation? • It’s too slow!

  4. What have you got? • Self-service website for students to enter their personal data: • Name, email, phone number • ID, year, department • Admin website: • Create/edit/delete students • Bulk upload data • Generate reports • The backend DB is MS Access 2007

  5. What happened? • Over the summer, a local IT shop named QuickySoft was hired to develop the site. • The project went well, stayed almost on schedule, tests went fine, everybody was happy. • On Sept 6, the site melted down. • It took over a minute for a student to log on to the site • When students attempted to update their data, the page timed out • The admin could not upload new data (timeouts) • Project manager got fired, and you got hired.

  6. What do you do? • Well, that’s clear. The database was crappy. • You replace MSAccess with a real database • Say, MS SQL Server. • Your first task? Database schema design!

  7. DB schema design • That’s what DBAs get paid huge money for • Table: normalized dataset • ID, name, email, phone, department, year • Primary key: ordering of the dataset • ID seems to be a good primary key • We’ll discuss how to select primary key later • Index: optimizes searching, sorting • Name, email, more? • Tables and indexes are B-trees

  8. What makes a real database slow? • Lack of required index • QuickySoft gurus forgot to add an index on name. • Every lookup by name ended up scanning the whole table of students. • That was too easy… But DB is still slow. • IOPS is the king • Disk is the slowest computer component • Query speed is proportional to the number of IO operations it involves • Caching DB pages in memory helps, but not for large DBs

  9. Improving IOPS • Avoid going to disk, use DB cache. Buy more RAM. • Too easy. Try harder. • Reduce the number of times we have to move the HD head • Reduce the distance by which HD head is moved • Improve locality of reference!

  10. Locality of Reference • QuickySoft guys wanted to make DB more compact (for better caching!) • There’s a lot of repetition among email suffixes, to they decided to normalize them out. • Email suffixes are stored in a separate table: gmail.com, ucalgary.ca, yahoo.com, hotmail.com, etc… • For each user, store just the username part (marina), plus ID of the suffix (25  cpsc.ucalgary.ca) • Now, each READ operation needs to get data from two tables (students + suffixes). • Two disk reads instead of one.

  11. Locality of Reference • Site admin complained that generating reports is super slow. Why? • Her reports are sorted by name. • We have an index on name, so we just scan the index and output the records. All’s well? • Data in the table is ordered by ID (our primary key) • When generating the report, we incur N disk seeks (N is the number of students) • Let’s select name as the primary key! • Reports are now lightning fast, since we are just reading the file from start to end!

  12. Oops? • You get a call at 11pm. • Admin is unable to upload a new batch of students. It’s just too slow, the page is timing out. • Why? Input data contains random names. You lost locality of reference. You have to insert data into random places in the table, causing repeated page splits and B-tree rebalancing. Which means many random disk seeks. • In the past, you were always appending data to the end of the table, as you assigned auto-incremented IDs to new students.

  13. Trade-off Time • There’s no perfect software. • You will always have to trade-off: • Space vs. speed • Users vs. admins • Flexibility vs. simplicity (power user vs. your mom) • Features vs. deadlines • Security vs. usability • Quality vs. deadlines • Should we optimize reporting or upload? • Reporting usually loses • Let’s reinstate ID as our primary key.

  14. It’s Still Broken • You get a call at 1am two weeks later • We are running out of disk space! • But your projections were right? • You measured the average size of a student record, it was ~200Kb (including photo ) • 1000 students must grow the DB by 200Mb • In reality, 1000 students grow the DB by 400Mb. • We’ll run out of disk space tomorrow if you don’t do something!

  15. What’s going on? • DB is growing because we are inserting new student records • We always append to the end of the table (in ID order) • The last B-tree page fills up • We split it in half • We continue adding to the last page • All pages are half-full!

  16. What do you do? • You have to change page-split strategy to something like 80/20. • In real DBs, that’s called “index density” • You optimized for sequential inserts, but you hurt random inserts. Another trade-off. Ah well.

  17. Data Migration • Ok, you developed your new database and you are proud of it. • You need to migrate existing data from the old Access DB to your new shiny SQL DB. • You want to minimize downtime • You’ll be taking site down to migrate the data

  18. How can you migrate faster? • You have 17 indexes on your table • Name, phone, year, dept, … • Should you create indexes before populating the table or after? • A good DB engine can usually sort N elements faster than doing N inserts. • But creating 17 indexes afterwards means 17 scans over the table. • Does DB engine support batch index creation (a single table scan)? Some do. • Do you trust input data? Do you need to verify key uniqueness? If so, you need indexes prior to inserting data. • Trade-offs again…

  19. Many more interesting topics • Transactions support • Multithreading/Locking/Serialization • Query optimization • Crash resilience/data consistency • DB cache strategies • Distributed databases • …

  20. Questions?

More Related