About Tables and Datatypes

5 About Tables and Datatypes

Introduction • This section is probably the most important in terms of performance of an IQ system • We discuss • Tables • Datatypes • The next section discusses the other vital part of IQ • Indexes

Tables • Actually in IQ tables do not really exist • Tables are implicit in the IQ Catalog Store meta-data • The concept of a table only comes to the fore in SQL, all other times IQ is a simple (hah!) Column Store • However the Create Table does have some interesting “features”

CREATE TABLE • Of all of the create table command the following are of interest: [ GLOBAL | TEMPORARY ] [ { IN | ON } ] [ AT location ] { UNIQUE | PRIMARY KEY | REFERENCES … }

GLOBAL TEMPORARY • In IQ a temporary table can be either TEMPORARY or GLOBAL TEMPORARY • A temporary table only exists for the duration of the transaction that creates it • In a Global Temporary table the schema lasts for ever, only the data is destroyed at transaction commit/rollback • All temporary data lives in IQ TEMP STORE

Temp Tables • You may no longer specify the table owner when creating temp tables • If you specify the owner it will create a permanent (base table) in the IQ Store • Create Table dbo.#my_temp_table (…) Creates a Permanent Table in the IQ Store • Declare Local Temporary Table dbo.my_temp_table(…) Results in a Syntax Error

IN | ON • In IQ you cannot position objects (tables or indexes) • The reason for the IN (or ON) clause is to allow you to create an ASA table (base or temporary) • An ASA table is created using • ON SYSTEM • System being the IQ Catalog Store • This table will obey all the rules of an ASA (not an IQ) table

AT • The AT clause is used to define a proxy table that maps to a table at a remote location • The remote server name must be defined to IQ • This is not a fast way of accessing data • CREATE TABLE fred • AT ‘anotherserver;adatabase;;fred’

Constraints • Check Constraints Added • Includes Check Constraints, Unique and Referential Integrity Constraints • Permits constraint modification without recreating a table • Constraints may be named for reuse UNIQUE, PK, FK, CHECK, IQ UNIQUE • No DEFAULT (expected in IQ 12.7/IQ 15) • New Stored procedures for maintenance • sp_iqprintconstraints • sp_iqdropconstraints

Identity Columns • IDENTITY/DEFAULT AUTOINCREMENT Property • Column may be defined as IDENTITY or DEFAULT AUTOINCREMENT • Must be enabled using IDENTITY_INSERT Option • Only one column per table may be defined with this property • New Global Variable @@identity retains last value inserted • New Database Option to Auto-Index Identity Columns • The option IDENTITY_ENFORCE_UNIQUENESS is 'OFF' by default • If ON creates a Unique HG index on Identity Columns • Alter Table supports modifying/adding column as IDENTITY or DEFAULT AUTOINCREMENT

New LOB Datatypes • Char() data type may be defined to 32K(-1) • Same as Sybase IQ varchar() • If defined > 255 bytes only FP, WD and CMP indexes are permitted • Varchar() and char() may now be the same • Certainly they behave identically, except that varchar() is one byte longer (per row) • "Select Into" a Permanent Table now permitted (select into temporary table support since 12.4.3)

DDL Locks • Concurrent DDL Lock Reduced to Table Level • This was a Database Lock in previous versions • You may perform multiple DDL operations in a database as long as they operate on different tables • "Begin Parallel IQ" to create multiple indexes on one table remains available • (Multi-column) Key length increased from 1024 bytes to 1530 bytes (can still only be composed of 255 columns)

Primary Key/Foreign Key • IQ-M does not enforce Primary Key/Foreign Key relationships – but it will in 12.5 (see following slides) • The optimiser does use the PK/FK relationship for query planning • Only specify this relationship if the relationship does exist • Incorrect specification can result in query plan errors (performance degradation) and possibly errors • ASA does modify a join that is defined as PK/FK to an ANSI NATURAL join – this can cause problems with orphan rows

Key Specification • In a Data Warehouse the production key is not, generally, used as the warehouse key • It is more acceptable practice to use a generated key • Make this key an Unsigned INT or BIGINT • This is the absolutely most efficient key datatype in IQ-M

Primary Keys • In IQ-M a Primary Key is an ANSI standard Primary Key • It is UNIQUE • It must not be null • If specified as a table or column constraint then a specialised form of the HG index is created

Foreign Keys • Always generate an HG index on a Foreign Key • If the relationship is 1:1 then generate the Foreign Key column as a UNIQUE • This will force auto generation of a unique HG index • Again try to specify join columns as Unsigned INT or BIGINT

Referential Integrity – 12.5 (1) • 12.5 supports Primary Key/Foreign Key referential Integrity on loads. • The overhead on loads is minimal. The maximum reduction in load performance that has been seen is under 8% of the total load time. • For RI to work there must be a HG index on both the Primary and Foreign keys – and both the Primary and Foreign keys must be defined at the table level. • This is the requirement (as above) for the Non-Unique Multi-Column Index.

Referential Integrity – 12.5 (2) • The RI checking is accomplished after the sort phase for the foreign key index. • At this point the keys are all in sorted sequence, so we read the Primary Key (PK) HG index (or rather we read the Leaf Nodes of the PK HG index – which is a Unique Index – hence has no G-Array), and we walk the PK index Leaf Nodes. • Because all the data is sorted we only have to walk the Leaf Nodes once for the entire load. • Hence the low overhead for Referential Integrity.

A digression on Datatypes • There are some very important issues concerning datatypes • We have discussed the actions of the indexes – there are areas where an index can be forced to run slowly if the datatype is specified wrongly • Always consider the requirements for the datatype • In correct datatype specification is as bad as incorrect index selection

Signed vs. Unsigned - 1 • If you don’t need signed data in an int or bigint – use UNSIGNED • This will speed up the accessing of the HNG index sometimes doubling the performance • HNG stores negative data as 1s complement • This means SUM() AVG() etc. run quickly • But range checks require another set of scans • If we stored as 2s complement then • Range checks would run with 1 scan • But SUM() AVG() would be slower!!

Signed vs. Unsigned - 2 • Use Unsigned for surrogate keys and join columns • Unsigned data comparisons are quicker (=, !=) • The caveat to this is that Open Client may misinterpret the value if it is too large as it does not understand large unsigned data • Can convert to signed integer, numeric, or decimal if returning data to an Open Client application • This caveat applies to moving data between IQ servers with INSERT FROM LOCATION

Other Datatype Issues • Signed vs. Unsigned does not affect the other indexes to any great degree • But… • The selection of datatypes does • We have already discussed keys but some other areas are worth commenting on…

Long Varchar() - 1 • A long varchar() is defined as a varchar() with a length greater than 255. • If you can avoid this please try to • Only FP and WRD index index is allowed • No enumerated indexes or HNG • We have seen a number of customers who use varchar(1024) as Primary Keys • please DO NOT DO THIS!!

Long Varchar() - 2 • Long varchar() are stored as 256 byte chunks, so using 4 bytes in a varchar(32000) only uses 256 bytes • By default these 256 byte chunks are memset (set to zeros to improve compression) • There is an upgrade option to memset existing 12.4.0 varchar() – this is worth doing, if you have the time!

Char() vs. Varchar() • Always, if you can, use char() • Generally this will improve performance, at the modest cost of storing some small number of extra bytes • Query performance on retrieval of char() vs. Varchar() indicates that there can be a 2-3% performance hit per column, and we have seen 10% degradation on single columns

Float, Real and Double • Unless you really need them – please do not use • FLOAT • REAL • DOUBLE • They can only have Flat FP indexes – no others • The do not store “exact” values – only approximate • Please try to use • NUMERIC • DECIMAL

NUMERIC and DECIMAL • Numeric and Decimal are aliases of each other • Any numeric or decimal with a precision of less than 12 will be stored as an INT (with conversions) • Any numeric or decimal with a precision of between 12 and 18 will be stored as a BIGINT (with conversions)

Join Columns • You must generate the database schema with the table join columns having the same datatype. • INT, UINT and BIGINT are best, but the column datatypes for each join must be the same • Conversion cost is horrendous

Case and Collation Sequences • In terms of RAW performance the fastest IQ database is one where CASE is set to RESPECT and the collation sequence is BINARY (ISO_bineng) • This is probably not suitable for the general application of the database or warehouse server • CASE set to IGNORE is the next fastest, then changes in the collation sequence • The performance hits can be quite high (around 10-20% - we think!)

String Searches • String Searches such as substr(1,3,col_name) are really very slow, they rely on FP searches • With low cardinality (1 and 2 byte FP) data the search is faster, but this can still be a restriction • Create a new column which is the first 3 characters of the col_name column, then search on this • This way there is no function call, so no projection, so the optimiser can use a fast index LF or HG (or if it is a range query an HNG)

Telephone Numbers • A classic example of the above is the telephone number • +1-301-896-1733 • +1 -> Country Code • 301 -> Area Code • 896 -> Sub Area Code • 1733 -> Local Number • Make this 4 columns (actually 5 - the whole number), then searches use fast indexes

Date time • As with telephone numbers, try storing a data time as as series of columns (or a dimension table) • Try creating columns DD MM YY HH MM SS DoWeek DoYear Quarter etc. • This changes in 12.5 with the DATE, TIME and DTTM indexes

Date vs. Datetime • A slightly better solution to the above can be considered in the light of the 1 and 2 byte FP indexes • Try storing the date part of a datetime as a date and the time part as hh mm ss • So: Datetime -> date_col, hh_col, mm_col, ss_col

Loading Dates • There is NO default date or datetime format for loads into IQ • The format must be explicitly set for the load/insert to get the best performance • However some formats are conversion enhanced

DD/MM/YYYY DD.MM.YYYY DD-MM-YYYY HH:NN:SS HHNNSS HH:NN:SS.S HH:NN:SS.SS HH:NN:SS.SSS HH:NN:SS.SSSS HH:NN:SS.SSSSS YYYY-MM-DD HH:NN:SS YYYYMMDD HHNNSS YYYY-MM-DD HH:NN:SS YYYY-MM-DD HH:NN:SS.S YYYY-MM-DD HH:NN:SS.SS YYYY-MM-DD HH:NN:SS.SSS YYYY-MM-DD HH:NN:SS.SSSS YYYY-MM-DD HH:NN:SS.SSSSS Enhanced Conversion formats

Date Load • So it is better to use Col1 DATE(‘YYYY-MM-DD’) • than Col1 ASCII(10) • The performance enhancement can be as much as a 100 fold speed up in loads (for small tables)

UNION • In IQ-M 12.4.3 the UNION clause has very few disadvantages • Generally UNIONs are all processed in parallel • so if you have a low user count they work well • Also the delete question now can be solved • Do not use DISTINCT in the UNION clause, or in the SELECT statement

UNION and Delete • If you are storing a fixed (in time) amount of data e.g.. 6 months • Then every month you delete 1/6th of the data in the table • This is expensive • It is better to split the fact table into 6 x one month tables • At the end of the month you truncate the oldest table • And possibly rename the table sets • Remember for Multiplex table rename is DDL and hence can only be done in simplex mode!

Cartesian Joins • These are expensive – they involve the join of every row in one table to every row in a second table. • Table A 1,000,000 rows • Table B 100,000 rows • Worktable 100,000,000,000 rows • Select * from T, R where T.a = 10 Cartesian • Select * from T, R where T.a between R.b and T.b Cartesian • Select * from T, R where ABS(T.a * R.b) = T.b Cartesian • But • Select * from T, R where ABS(T.a * T.b) = R.b Not Cartesian

Cursors • Avoid using cursors • Generally means row based processing • IQ was designed for set based processing • Sometimes they cannot be avoided • If used, make sure to use NO SCROLL cursors • Open With Hold • Allows the cursor to remain open across transactions • If not used, the cursor may be closed when a commit is issued (depends on connectivity type)

Watcom SQL vs. T-SQL • IQ (ASA) is not 100% T-SQL Compatible, but very close • Recommend using Watcom SQL • All system procedures written with it • Many more code examples and more IQ people versed in it • Watcom SQL has some extensions that T-SQL does not: • Dynamic SQL • Better Loop control • Full cursor movement rather than just read next • Batches and procedures must be written in the same dialect • Cannot mix T-SQL with Watcom SQL

Global variables Variable Names CALL FOR ASA requires variables to be declared immediately after a BEGIN Watcom SQL vs. T-SQL • Behavior differences include: • DECLARE CURSOR • GOTO • IF • PRINT • RAISERROR • SET • WHILE (T-SQL) vs. LOOP

Commit and Rollback • Use transaction control around logical units of work, even read only queries • Should commit before a read/write batch is started to ensure latest version of data is available • Should issue commit and rollback after batch completion to release all query resources • Rollback will free memory resources in use by previous operations • For systems with high number of connected users, freeing memory resources can aid in query performance

Custom Functions • Custom functions can be written in either SQL or Java • Great way to encapsulate business logic for transforming data • Can have a significant performance impact on queries • Functions are executed in the catalog portion of the engine • All result rows may need to be moved to ASA • Can be time consuming for large result sets • Turn on query plans to see what impact the functions have on effective query plans

About things - End

About Tables and Datatypes