Mastering Databases: Essential Data Management Insights

Databases A/Prof Alan Fekete University of Sydney

Overview • Data, databases, and database management systems • Querying a database with SQL • Creating a database with a RDBMS • Tradeoffs

Data • Facts about the world that are worth knowing • Eg NGC 450 is at 18.876,-0.861 • Eg The AAT has a 3.9 m reflector • The redshift of NGC 450 was measured with the AAT’s 2dF spectrograph • Data is very valuable • Data can be analysed to produce theories, or to test them • It takes a lot of effort to gather data

Persistence • Values of variables in a running program are stored in the memory (RAM) • They are lost when the program ends, or the machine crashes • Computers have persistent storage (hard disk, flash memory etc) whose values persist • The operating system organizes the persistent storage into files • So store valuable data in files

Sharing • Once someone has collected data and got it into a computer, they can use it for many purposes, and many others can use it • The naïve approach: copy the file around • Or email, or set up an ftp site, etc • This is very bad, if the data might change • More data (more galaxies, more facts about each galaxy!) • Corrections to existing data • Copies can become inconsistent; so share one store instead

A Database • A store of data relevant to some domain, stored in a computer, and shared (between application programs or between users) • Typically, the data is structured • Many facts in each of a few types • Eg many galaxies, each has name, ra, dec, redshift etc

Clients and servers • The database server is the machine where the database is stored • Each user must run some presentation client code on their machine, to display the data • Usually this is just a web browser • There may be other machines in-between • Eg web server, application server • Note that a machine may be a server which receives some requests, and also a client of others as it tries to respond! • We focus on the database server, and the “database client” where the application code is running that understands the data

Purpose-written application code, accessing data stored in files, using knowledge of the structure of the files Use a database management system (DBMS) which manages the data and provides higher-level facilities to access it You may write applications that call the DBMS, or even use it directly, interactively Two approaches This is common in science This is standard in industry, and is becoming more frequent in science

DBMS Features • A query language • Describe the data to retrieve • Can also describe updates etc • A data definition language • Describe the logical structure of the data • Storage management • Control the layout in files, and with facilities to improve performance • Access control

Different Data Models • Each database has a model or schema which tells what sort of data is stored • Eg each galaxy has id, name, ra, dec, etc • Using a DBMS means that the data model must fit the style supported by the DBMS • The most common DBMSs all support the relational style of data model

Relational Data Models • Database consists of tables, each with a name • Each table has many rows, each with exactly the same structure • several attributes, each with a name and type (integer, float, string of length 20, etc) • Each row has a value for each attribute • Rows are distinguished by the values of a primary key (one or more attributes)

Schema: galaxy(gal_id INTEGER, gal_ra FLOAT, gal_decl FLOAT, gtype STRING) observation(obs_id INTEGER, obs_ra FLOAT, obs_decl FLOAT, obs_flux FLOAT, gal_id INTEGER, source STRING) An Instance Example Note use of “foreign key” to connect information in different tables

Views • You can define a view on a database • Just like a table, but the data isn’t inserted or stored explicitly • It is computed when needed from the other tables • A view can be used in queries just like a table

Queries • User can write a query, and run it • Answer is a collection of relevant data • Query may be typed interactively, or written into a program for repeated use • Perhaps with parameters that vary • Answer may be viewed on the screen, imported into a program, printed, or stored back in the database

SQL • A standardized language in which queries can be written against databases in relational DBMSs • But each platform has slight variations • SQL is declarative • Query describes what data is needed • RDBMS has the job of working out how to trawl through the database to compute the answer

SELECT gal_id, gtype FROM galaxy WHERE gal_ra between 13.0 and 16.0 AND gal_decl between 4.0 and 6.0; Select-from-where Answer is:

Remember • FROM tells which table to use • WHERE tells which rows you are interested in • A boolean condition that is evaluated for each row • SELECT says which attributes (columns) you want in the answer Note query is stated SELECT-FROM-WHERE

Limiting the answers • Real scientific databases have tables with many many rows • A query which returns too many is expensive to run • And it’s hard to understand the answer • Especially when exploring the data, get a small sample first • Ask for the top few rows, perhaps with respect to some order on the selected rows • Syntax details are not standard • MySQL has “LIMIT n” clause at end of query

A query can combine data from several tables use the foreign keys to pick values from the related row SELECT observation.obs_id, galaxy.gal_ra, galaxy.gal_decl, galaxy.gtype FROM galaxy JOIN observation ON observation.gal_id = galaxy.gal_id WHERE observation.source=‘A’; Joins From observation From galaxy

Unstructured joins • Sometimes you want to take all combinations of rows from tables, not just the rows related through foreign key • Then filter by a condition SELECT galaxy.gal_id, observation.obs_id FROM galaxy, observation WHERE ABS(galaxy.gal_ra - observation.obs_ra)<0.2 AND ABS(galaxy.gal_decl - observation.obs_decl)<0.3;

Combine the values from a column, among all the rows that satisfy WHERE clause MAX, MIN, SUM, AVG, COUNT COUNT(DISTINCT colname) is usefully different if values are duplicated in the column SELECT MAX(obs_flux) FROM observation WHERE gal_id = 3; Find the largest flux in any observation of galaxy 3 Aggregation

SELECT source, COUNT(DISTINCT obs_id) FROM observation WHERE obs_flux > 10.0 GROUP BY source; For each source, report how many observations that have high flux were made by that source. Grouping • Divide the rows into groups, based on the values in some column(s) • Aggregate the values in some other column(s), for each group • Display the grouping column along with the aggregate

SQL is block-structured A subquery can be used in the FROM clause, instead of an existing table A subquery can be used to produce a value used in the WHERE clause eg in a comparison SELECT obs_id FROM observation WHERE obs_flux = (SELECT MAX(obs_flux) FROM observation); Find the observation with the highest flux Subqueries

Creating a database • Decide on the platform to use • Decide on the schema that will represent your data • Create a database and tables • Load the data • Allow users to access it

Relational DBMS platforms • Enterprise solutions (Oracle, IBM DB2, SQL Server) are very expensive • And require professional administration too • Vendors also offer light-weight cheap/free variants • Free open-source platforms: PostreSQL, MySQL, Cloudscape

Table Design • One table for each type of object in the domain • Columns for each attribute of importance • Make sure there is a primary key • Invent a new identifier for this purpose if necessary • Use foreign keys to represent relationships between objects • Or have a separate “association table” where one object can be related to many, in each direction

Normalisation • It is important that the schema doesn’t allow the same facts to be stored repeatedly in different rows, or different tables • Eg observation(obs_id, obs_ra,…,gal_id, gal_ra, …, gtype, …) would have same gal_ra stored in several rows of observations of this galaxy • This would risk inconsistency when updates occur • There is theory about this, but sensibly defining one table per object will avoid the problems

Declaring tables • Syntax isn’t quite standard • Details of types etc • MySQL variant is CREATE TABLE galaxy ( gal_id INT UNSIGNED, gal_ra FLOAT, gal_decl FLOAT, gtype VARCHAR(3), PRIMARY KEY (gal_id) ) ;

Indices • The performance of queries can vary greatly depending on how the data is stored in the DBMS • Having an index on a column usually speeds up queries that select rows with given values (or ranges of values) in that column • SQL has CREATE INDEX statement

Permissions • Each DBMS can control which users are allowed to run different sorts of statements on each table separately • Typically, for scientific data one allows SELECT access to all users (or all registered users) • INSERT access to a few trusted users • DELETE, UPDATE only for the administrators • Usually have a separate account, so even the admin people can’t do this accidentally while acting as scientists

Transactions • Nasty problems can occur if some users are modifying data while others are looking at it • Or worse, if several modify concurrently • This is serious for enterprise applications, but not common in scientific situations, as updates are rare • A transaction is a whole collection of SQL statements that should be done as a single block • Not interrupted by others, and completely undone unless the whole collection succeeds

Loading data • SQL has INSERT statement which adds rows to a table INSERT INTO galaxy (gal_id, gal_ra, gal_decl, gtype) VALUES (1, 13.1, 7.2, 'S'), (2, 14.7, 5.3, 'C'), (3, 11.2, 4.8, 'S'), (4, 15.3, 4.6, 'S');

Bulk load • If data is already in a file in a well-known format (eg csv), then DBMS has non-standard commands to import it directly into a table • MySQL LOAD DATA LOCAL INFILE “gal.csv” INTO TABLE galaxy

Trade-offs Compared to writing applications against data stored directly in files, using DBMS to manage the data: • Allows users to perform unpredicted (ad-hoc) queries • without being programmers, and without knowing the structure in files • Gives better support as data schema evolves • Existing queries can continue to run • Has more predictable performance • Easier to avoid very slow execution • But you may not be able to get very fast execution • Has better security • But considerable overhead in administration

Mastering Databases: Essential Data Management Insights

Mastering Databases: Essential Data Management Insights

Presentation Transcript

Databases

Databases

Databases

DATABASES

Databases

Databases

Databases

Databases

Databases

Databases

Databases

Databases

Databases

Databases

Databases

Electronic Databases (Research Databases)

Databases

DATABASES

Databases

Databases

Databases

Databases