Using SQL Databases from APL (Dyalog & other)

Using SQL Databases from APL (Dyalog & other) J. Merrill Analytical Software Corp. jamesmerrill@usa.net

Overview • About my APL and SQL experience • This talk emphasizes information about SQL • Concepts you must know • Weaknesses you need to understand • Ideas about SQL database design • Why so much about SQL? • Could a few people learn the details about SQL, and build tools to shield other developers from the issues? • Why you should not hide SQL from APLers • The very successful project that was a failure • APL talking to SQL is not hard to do (any more)

My Background • I started working with APL professionally in 1975, as a user of STSC timesharing. • While working at STSC, I took a course in relational database technology and SQL from a representative of a small company called Relational Software Inc. Later they changed their name to Oracle – I’ve worked with SQL for longer than Oracle Corporation has existed!

SQL Work at STSC • I automated connection of an Oracle database of microcomputer software sales data to the existing (APL-based) accounting system. • I designed a portable interface between STSC’s APL*Plus and SQL on these platforms: • VMS talking to RDB • VMS talking to Oracle • Unix talking to Oracle and implemented the APL in those interfaces.

STSC, Manugistics, and Later • I was a software development consultant for many years with both STSC and (after the name change) Manugistics. • Client projects included migrating an APL-based database of foreign exchange trading data to Microsoft SQL Server. • Since founding Analytical Software Corp. I’ve worked extensively with clients using SQL Server from APL+Win.

SQL Concepts You Must Know • SQL has only four data manipulation verbs • RDBMSs use Client / Server techniques • SQL has the concept of null • All data is in a table (like an APL matrix) • SQL columns are named, not numbered • SQL columns are “strongly typed” • Relationships are data not structure

DML: select insert update delete • SQL’s select is extraordinarily versatile • SQL’s insert and update normally operate on a single row at a time and therefore are quite primitive • If insert and update work on multiple rows, not a single row, they become very powerful (as powerful as select) • SQL’s delete is usually simple but can use some powerful select techniques

SQL Has DML and DDL • DML means data manipulation language • SQL’s four DML verbs are select, insert, update, delete • For defining the database structure (schema), SQL has DDL • DDL means data definition language • DDL is less standardized between products

SQL select is Very Powerful • select supports extremely complex statements that can access data from multiple tables, potentially joining them using every available join technique in a single query • Much of select’s power is based on the ability to define derived tables within a statement, and use those tables as if they held real data – but no data is actually stored to run the query • The syntax is intended let you express intent without specifying a particular processing sequence • Powerful does not mean simple or intuitive

insert & update: SQL’s Weak Spot • When data is in memory (in an application) • insert statements allow you to add one row, giving values for each column in the new row • update statements typically reference a single row by its primary key (though it can change many columns in one update) • Applying a group of changes based on user input typically involves issuing many one-row-each insert or update statements

Multi-Row insert and update • insert supports adding the result of any select to an existing table • update similarly can update many rows of one table from the contents of another table, providing the tables can be joined (usually on the primary key) • The syntax for this is not completely standard and some products may not support this • Get the data on the server if possible!

SQL delete Does What It Says • delete can use the full power of select to determine what row or rows of a table should be deleted • When deleting rows based on a list of primary key values, implementations that limit the length of a SQL statement (or an individual clause within the statement) can require multiple delete statements to delete a large number of rows

Client / Server Concepts • Clients send requests to the server and it responds with the results • Only the DBMS server touches the data • It’s good to have the server do things like compute totals so that results are smaller • The number of back-and-forth interactions is potentially more significant than the amount of data being transferred

SQL is the Language for Server Requests and Responses • Your program sends SQL statements (only) to the server and handles the responses • Some tools hide the fact that everything is done with SQL, by generating SQL for you • SQL is very powerful for reading data (select) and deleting data (delete) • SQL is quite stupidly simple for writing data in memory into a table (insert, update) • The over-the-wire protocol is complex, but is hidden by (most) client libraries

One Program Touches the Data • A single coordinating program, the DBMS server, performs all access (reading and writing) to data files holding the SQL data • Access to data is controlled by the server • The server process can be run on a very powerful (expensive) computer • Conflicting requests from different users can be handled more easily

Let the Server Do the Work • The server should be a better computer than the one running your program • It is optimal if the server can get the job done without sending a lot of data over the wire to your program • Downside: the more you get the server to do the work (by sending SQL), the more you have to worry that things will need to change if you switch to another DBMS

Avoid “Chatty” Programs • Each interaction with the server takes time for the back-and-forth protocol, in addition to whatever actual data is passed • Rather than sending a series of steps to the server one after the other, it’s better to send one larger request • SQL’s simplistic insert & update is an issue • Downside: the more you “bunch up” your requests, the more difficult it is to handle errors

SQL’s Concept Called null • Intended to represent a “missing” or “unknown” value, rather than an “empty” value • If a value is null, it is not equal to any particular value, nor is it not-equal to any particular value. Suppose A is an arbitrary SQL expression: int_column = 5 char_column = ‘hello’ One would think that the result of A or (not A) has to be true but in SQL that’s not the case (!).

null Causes “three valued logic” • APLers think we understand booleans, but then along comes SQL’s notion of null that changes the rules we’ve learned • When nulls are involved, the result of a boolean expression is one of true, false, or null • The result of any expression is null if any value involved in the expression is null • The SQL function coalesce returns the first not-null parameter value passed: coalesce(column, value_if_null)

Use Null Values Sparingly • Allowing nulls sometimes makes sense • Definitely makes sense for date or time • Makes sense for numeric data when 0 is not an appropriate “no information” replacement • Don’t use a “missing value code” in the database, even if you use one in your APL data • Any distinction between NULL and ‘’ (an empty string) is hard to see, justify, or explain to non-programmers (or programmers!). • Sometimes it’s not your decision

All SQL Data is in Tables • There is no other persistent data structure • SQL table values are conceptual scalars • SQL can hold strings (enclosed char vectors) • Otherwise, no nested data (in standard SQL) • If APL had scalar strings, a SQL table would be just like a simple APL matrix • SQL locates rows using data values only • No notion like row number in an APL matrix

SQL Columns • Columns are named, not numbered • Each column has a (scalar) data type • Most data types can support the special SQL value null • Handling null values correctly can be difficult

Column Names • One difference between products is in what column names they allow • Some allow spaces in names, but that makes it necessary to “quote” those column names (so SQL can parse statements using them) • Standard quoting techniques vary (!) • Advice: use simple, not-very-long names • It’s not always your choice

Column Type Catgeories • Numbers • Strings • Dates and Times • Binary • Time Intervals • Boolean

Numbers • Integers of various sizes (1, 2, 4, 8) • Usually no “unsigned integer” • 4- or 8-byte floating point • Fixed width and number of decimals • Money sometimes available • 64-bit integer with 4 implied decimals

Strings • Fixed length with maximum size • Variable length with maximum size • “Unlimited size” often available • Multi-byte data (Unicode) often available

Dates and Times • One or more of date, time, datetime • Concept of NULL is very useful here • Dates are stored as Julian date values • Many support negative Julian dates • Time values are stored as fraction of a day • 0.25 is 6am, 0.5 is noon, 0.75 is 6pm • Datetime values stored as (day# + timefraction) • If only datetime is available, • time can are represented as being “on day 0” • date can be represented as being “at midnight”

Time Intervals • Supposedly standard • Not universally available (SQL Server) • Can be represented with a date or datetime value holding the result of subtracting one date or datetime value from another

Binary • Fixed length with a maximum size • Variable length with a maximum size • “Unlimited size” often available

Boolean • A single bit (!) • Not “fixed or variable length with a maximum size” like string and binary data • Sometimes boolean columns support null (so there are 3 states) • My advice: use a one-wide character column instead

Time for a Break ?

How Relationships Work in SQL • In APL, you can choose to represent related structures by nesting • In SQL, there are only tables and there is no possibility of nesting • Relationships are stored by having a column in one table store a value identifying a related row in another table (or null if there is no related row)

SQL Table Design • All columns in a table should represent information about one entity (or concept) • row = data about one instance • column = same data re many instances • The term “field” is deprecated (but widely used) • Each table needs a “primary key” column whose value can be stored in any other table that has a relationship with rows in this table • SQL-generated primary key values allow relationships to be represented easily and can prevent issues that arise when users think they want to change key values

SQL Relationships • One to many (parent - child) • Many to one (lookup) • Hierarchy • Many to many

Relationships – one to many • Each (one) invoice can have any number (many) of line items • LineItem table has an InvoiceID column holding Invoice table primary key (PK) value • “No line items for invoice N” is easily represented by there being no rows in the LineItem table with InvoiceID = N • An APL application might use nesting, storing a vector or matrix of lineitem data within each Invoice data structure

One to Many = Parent - Child • The “one” table (Invoice) is parent • The “many” table (LineItem) is child • Column holding primary key (PK) of another table is called a foreign key (FK) to (or into) that table • Foreign keys are critical to competent SQL database design

Relationships – many to one • There can be any number (many) of employees in each (one) department • Employee table has a column holding Department table PK value (read as “FK to Department”) • “No employees in department N” is again easy • Not thought of as parent / child; you don’t think of departments as “owning” employees the way invoices “own” line items • An APL application could (but wouldn’t) use nesting, storing a vector of EmployeeID values (but not Employee data) for each department

Relationships – hierarchy • Each employee has a manager; the manager is also an employee • Employee table has a ManagerID column holding FK to the same (Employee) table • Top of hierarchy is represented by null ManagerID value (no manager) • APL could use nesting, but I haven’t

Relationships – many to many • Requires a third table to hold two many-to-one relationships to the other two tables • ClassStudent table records that • each class can have any number of students enrolled in it • each student enrolls in any number of classes • You can’t use nesting for this (without double-storing the information), so APL applications usually do it the same way

More SQL Design Thoughts • Advice: use your product’s “identity” or “sequence” feature to create PK values • When in doubt, define another table • Examples of extra tables that add flexibility • Storing historical data (not just the current value, but changes over time)

Advice: Use DBMS-assigned Values as Primary Keys • Most implementations support having the DBMS assign the value of a particular column to be 1+ the previous value, during insert of a new row • Columns defined this way make excellent PKs that are efficient and easily used as FKs • Columns that are user-visible “primary keys” (e.g. Product Code, Region Code, Department Code) become attributes • Changing them no longer causes pain and trouble

When In (Any) Doubt, Define Another Table • When you see a numeric suffix on a column name, that’s an indication that another table should have been built • FK columns in an Operation table named Doctor1 Doctor2 Doctor3 Doctor4 would be a poor design • What if more than 4 doctors involved? • Better to have a child table with any number (even zero) of doctors for an operation

More Tables Are Often Better • Some cases where using more tables results in a more flexible system: • Address table where both Customer and Employee tables have AddressID column • Address table has column AddressType (FK to AddressType table with descriptions Billing, Shipping, Home, Office…) • Name table where Patient table has NameID • Column Primary in Name table marks the current (primary) name to use for this person

Historical Data • Data that changes over time where history needs to be maintained can be in a child table with a ValidUntil column, and possibly also a ValidSince column • The currently active child table row is marked by a null ValidUntil value • Find the active child table row on date X: (X >= coalesce( ValidSince, X )) ^ (X <= coalesce( ValidUntil, X )) • Code needs to ensure that timespans don’t overlap • Can store ID of active child table row in parent

SQL Joins • Because data values are used to define relationships, SQL needs to be able to combine (join)tables based on the defined FK-to-table relations – and it can do so quite flexibly • Each SQL statement must re-specify the join condition(s); tools to generate SQL code can make this easier to get right • SQL supports different kinds of joins: • Natural join • Left (or right) join • Cross join

Using SQL Databases from APL (Dyalog & other)