What’s a database? • In essence a database is nothing more than a collection of information that exists over a long period of time. • Databases are empowered by a body of knowledge and technology embodied in specialized software called a database management system, or DBMS. • A DBMS is a powerful tool for creating and managing large amounts of data efficiently and allowing it to persist over long periods of time, safely. • Among the most complex types of software available.
The database [management] system • Allows users to create new databases and specify their schema(logical structure of the data), using a data-definition language. • Gives user the ability to query the data and modify the data, using a query language and data-manipulation language. • Supports intelligent storage of very large amounts of data. • Protects the data from accident or not proper use. • Example: We can require from the DBMS to not allow the insertion of two different employees with the same SIN. • Allows efficient access to the data for queries and modifications. • Example: The use of indexes over a specified field, e.g. on the name field for employees, allows fast response for queries asking a specific name. • Controls access to data from many users at once (concurrency), without allowing “bad” interactions that can corrupt the data accidentally. • Recovers from failures and crashes.
Relational Model • Based on tables: • Today used in most DBMS’s • Oracle, SQL-Server, IBM DB2, Sybase, Microsoft Access etc… • Challengers: Object oriented DB’s (ObjectStore, Objectivity…) • Object-relational:: oo-extension of rel’s
Database Studies • Design of databases. • What kinds of information go into the database? • How is the information structured? • How do data items connect? • Database programming. • How does one express queries on the database? • How does one use other capabilities of a DBMS, such as transactions or constraints, in an application? • How is database programming combined with conventional programming? • Database system implementation. • How does one build a DBMS, including such matters as query processing, transaction processing and organizing storage for efficient access?
Fictitious Megatron 2006 DBMS • Stores relations as Unix files • Students(name, sid, dept) is stored in the file /home/megatron/students as Smith#123#CS Jones#533#EE • Schemas are stored in /home/megatron/schemas e.g. Students#name#STR#id#INT#dept#STR Depts#name#STR#office#str
Megatron sample session mayne$ megatron WELCOME TO MEGATRON 2006 megaSQL% SELECT * FROM Students; Name id dept ---------------------------------- Smith 123 CS Johnson 522 EE megaSQL%
Megatron sample session II megaSQL% SELECT * FROM Students WHERE id >= 500 | HighId.txt; megaSQL% more HighId.txt Jones#522#EE megaSQL% quit THANK YOU FOR USING MEGATRON 2006 mayne$
Megatron Implementation • To execute SELECT * FROM R WHERE <COND> • Read file schema to get attributes of R • Check that the <COND> is semantically valid for R • Read file R, • for each line • check condition • if OK, display • If we pipe the result into a file, say T, then add an entry for T in the file /home/megatron/schemas
Megatron Implementation II • To execute SELECT office FROM Students, Dept WHERE Students.name = 'Smith' AND Students.dept = Depts.name; • Read file schema to get attributes and do semantic check. • If Ok, then, for each tuple s in Students for each tuple d in Depts if s and d satisfy the WHERE condition, display the office value from s
What’s wrong with Megatron? • Tuple layout on disk: no flexibility for DB modifications. • Change CS to ECON and the entire file has to be rewritten. • Search Expensive: no indexes; always read entire relation. • Bruteforce query processing. • Did we need to look at all pairs of studentdept tuples? • No buffer manager: everything comes off of disk all the time. • No concurrency control: several users can modify a file at the same time with unpredictable results. • No reliability: can lose data in a crash or leave operations half done. • Little security: file system protection too coarse.
Architecture of a DBMS • The “cylindrical” component contains not only data, but also metadata, i.e.info about the structure of data. • If the DBMS is relational the metadata includes: • names of relations, • names of attributes of those relations, and • data types for those attributes (e.g., integer or character string). • Often a database maintains indexes for the data. • Indexes are part of the stored data. • A description of which attributes have indexes is part of the metadata.
Storage and BufferManager • The job of the storage manager is • to obtain requested information from the data storage, and • to modify the information to the data storage when requested. • The buffer manager handles main memory. It obtains and returns blocks of data from/to the file manager and stores the blocks temporarily in main memory pages. • E.g. 1 block = 1 page = 4,000 to 16,000 bytes. • One block is the smallest unit of data that is read/written from/to disk.
Query Processor • The query processor handles: queries+modifications to the data. • Its job is to find the best way to carry out a requested operation and, • to issue commands to the storage manager that will carry them out. • E.g. A bank has a DB with two relat.: • Customers (name, ssn, address), • Accounts (accountNo, balance, ssn) • Query: “Find the balances of all accounts of which Sally is the owner.” • SELECT Accounts.balance • FROM Customers, Accounts • WHERE Customers.ssn = Accounts.ssn AND Customers.name = “Sally”
Query Processor (Cont.) • What this query logically says is: • Make the Cartesian product of the tables specified in the FROM-clause, • i.e. associate each tuple of Customers with each tuple of Accounts. • We get a new temporary relation R with longer tuples, • the attributes are renamed so as to include the name of originating relation. (Customer.ssnetc.) • Chose from R only the tuples satisfying the condition in the WHERE clause. • Produce in the answer only the values of attributes in SELECT-clause. • Of course,if we would answer this query as it says the performance would be terrible (step 1). • Supp. we have an index on name of Customer and an index on ssn of Accounts. • Using the index on name of Customer we need usually three disk access. • One more access gets us the tuple for “Sally” • Similarly we need four disk accesses for finding the correspond. account. • If there are several accounts of “Sally” we need a few more accesses.
Transaction Manager • The transaction manager is responsible for the integrity of the system. It must assure that: • several queries running simultaneously do not interfere with each other and that, • the system will not lose data even if there is a power failure. • The transaction manager interacts with: • execution engine, • it may need to delay certain queries or operations in order to avoid conflicts. • storage manager • schemes for protecting the data usually involve storing a log of changes to the data.
What will be covered • Storage Systems: • Physical devices and characteristics, especially disks. • Logical layout of data; data structures, especially, indexedsequential files, Btrees, hashing. • Multidimensional indexes for GIS and OLAP • Query optimization (we concentrate a lot here): • Queryplan generation; algebraic transformations. • Join methods. • Resilience: • Logging. • Authorization and encryption. • Transaction processing: Serialization, deadlocks, locking, timestamping. • Distributed DB's. • OLAP in detail • Data Integration