Building Things : Disks Databases

4 Building Things : DisksDatabases

Introduction • This is a mix and match section • It contains all the information that does not have a home elsewhere in this presentation • It covers • Disk layouts • Disk Farms • Striping on Striping • Database build

Disk Layouts • Basically in an IQ-M system you can have a number of different disk configurations • We consider the following • Simple SCSI (or UDMA-[1]66) Disks • Basic Disk “Farms” (Collections of raw disks) • RAID Systems (NT RAID, Clariion etc.) • Disk Subsystems (EMC, MTI Guardian etc.)

Basic Disk Farms • Here we are considering a number of “dumb” disks • There is no intelligence in the disks – or the controllers • Here is where we use IQ-M striping • Here we have a number of disk drives (on various controllers) • We stripe across a series of disks

Disk Striping Basic Disks 1-dimensional IQ Striping In the 1-dimensional striping we are spreading the data writes (and reads) across a series of disk drives – limiting the overall disk head movement.

Problems with Disk Striping • IQ-M has to make location decisions as to where to write and read the data • It is not as simple as writing to “bit bucket” or a serial list of disk blocks • It is faster than simple 1 disk = 1 device • The performance hit is measured in extra micro seconds per read/write – but with a 10 TBytes database this could be important

In RAID 1 – Mirror, we can consider that Disks A and B are mirrored and Disks C and D This improves read and write performance, as the controller will pick which disk heads are closer to the data required. Also there is an improvement in data security – 2 copies A D A D C B C B RAID – 0,1 Redundant Array of Inexpensive Disks Raid 1 - Mirror In RAID 0 – Stripe, we can spread the reads and writes as per the last slide Raid 0 - Stripe This dramatically improves the read and write performance, but does nothing for the data security

In RAID 0/1 – Stripe and Mirror, we can see that the stripe sets on A and B mirror to C and D This gives security – in the mirroring (copies) of the data, and improved performance due both to the two copies and the stripe sets A D A D C B C B In RAID 5 – Csum/MR, The data is in “stripes” across the disks then a checksum is written to the final disk This gives a small performance improvement for reads, no improvement for writes (in fact it will slow writes down!). The only bonus is security. The system will allow a disk failure in the pack without requiring the 2 to 1 overhead of mirroring systems RAID – 0/1 and 5 Redundant Array of Inexpensive Disks Raid 0/1 – Stripe and Mirror Raid 5 – Checksum Multi-read

Other RAID • There is RAID 2, 3 and 4 • These are variations on the mirroring/striping and checksum theme • There are also proprietary RAID schemes • RAID S The is EMC’s RAID system • RAID M This is the MTI RAID Scheme • We will come on to talk about proprietary schemes in a few slides time

Well is RAID useful ? • Yes it is – mainly ! • If the RAID system handles the data distribution at the hardware level (with RAID controllers) then RAID is both fast and safe • Windows NT (and 2000) can drive RAID through the operating system • This can be slower than IQ-M performing the striping • Watch out for large numbers of reads – and individual read performance slowing down • Also Avoid RAID 5. It is very slow for writes

Disk Subsystems (emc, HDS) • These are disk arrays where we have no idea what the internal organization of the disks is (and nor do we care) • Generally we can consider that we are writing to memory (and usually we are!) • The tuning of the disks is usually best left up to the Hardware Support guys

Striping on Striping • There is one very contentious area in disk organisation • This is 2 dimensional disk striping or striping on stripes • This is driven by using an operating system or RAID unit to provide 1 dimensional striping on a series of disks • Then we apply IQ striping onto the existing striping • 2 Dimensional stripes • Do it right – it’s very fast, do it wrong and it isn’t!

A diagram Operating System Striping AS IQ-M Striping The reason this works is now the individual disks only have a very small amount of data to read and write so we rely on very fast disk -> processor communications If this is done properly it is the fastest disk access for ASIQ-M The secret, if there is one, is to make the block size the same as the “micro” stripe size… But this needs further experimentation

Disk Striping • By default Disk_Striping is set ON for RAW devices and OFF for file system devices • We should be using RAW devices – they are (generally) a lot faster and potentially safer than file system devices • You may want to play with this parameter if you are running devices on disk farms or RAID array systems

File system Storage • If you must have the IQ Store on Operating System File System • OS_File_Cache_Buffering can be set off for Solaris and Windows NT (and 2000) • However the system will slow down in the following areas • IQ Page Size < file system block size • During Loads • Solaris > 4Gb memory • The description in the manual is awesome!

Create Database • OK so we have built the device or devices – about time we considered some of the options to create the database • Of all of the options when issuing a CREATE DATABASE command the following slide details the most important

CASE - 1 • CASE • CASE RESPECT is the fastest • There is a 10-20% hit going to CASE IGNORE • Implications to FP Indexes • Regardless of RESPECT vs. IGNORE all 1-byte and 2-byte FP indexes store all the binary values for the data • So ABC, abc, Abc, Abc are all stored even for CASE IGNORE

CASE - 2 • Remember because we store all bitmaps we can go from 1-byte to 2-byte FP, or 2-byte to flat FP where we might not want to • A solution to this is to set the server in CASE RESPECT (because it is faster) • Then use an ETL tool to rtrim() and ucase() or lcase() all of the incoming character data

CASE - 2 • The HG index stores data in what is called “conditioned” mode. • For a CASE IGNORE database there is only one entry per logical value • ABC = abc = Abc etc. • For a CASE RESPECT database there has to be one entry per value • ABC != abc != Abc etc.

CASE – 3 • For an LF index we hold partially conditioned values • For CASE RESPECT and CASE IGNORE all values have a bit-map • This can be wasteful on space • The reason for having this is two fold • To allow for the recreation of the FP index from the LF • To allow for some rare cases (some group by’s) where we still project values from an LF index

COLLATION • COLLATION • Set to ISO_BINENG, this is the fastest • If you must have a collation sequence this will slow the system down by around 10% for 8 bit character sets and 50% for multi-byte character sets • There are substantial slow downs for all multi-byte character sets

IQ PAGE SIZE - 1 • This is an area of extreme contention • The IQ PAGE size (effectively) determines the the size of the smallest addressable area in memory • NOT ON DISK ! • The disk parameter is the BLOCK SIZE

IQ PAGE SIZE – 2 • The rules for IQ PAGE SIZE (or memory buffer size) are simple • Set to 64K unless…. • Set to 128K when the memory model exceeds around 1 Gbytes per cache, and the number of rows in the FACT table exceeds 100m • Set to 256K when the memory model exceeds 4 Gbytes per cache (this may be a rare case!) • Do not set to 512K – there is a little bug…

BLOCK SIZE • The block size is set automatically when the IQ PAGE SIZE is set • (In 11.x IQ you could set the Max Compression parameter that would vary the number of blocks per page) • In IQ 12, you can vary the BLOCK SIZE but it does not do much except in some extreme cases

So what is BLOCK SIZE? Memory Cache Disk Device A Page When a “page” is written out of memory it is compressed. Only the resulting used “blocks” within the page are written to disk, this set of blocks is called a “chunk” A Chunk (A variable length object)

Writing to Disk Default Page in memory 64Kb – 16 Blocks 10 Blocks used before compression Compress before Disk Write Operation Compressor 4 blocks used after compression Disk Write 1 Chunk on Disk the size of 4 blocks - 16kbytes

Disk_Striping_Packed • The problem with large systems is that we only have 1 freelist • This means that if we want 8 blocks of space we will grab the first > 8 space we can find, which tends to fragment the devices • If Disk_Striping_Packed is ON then we have one freelist for each “number of blocks” available • 1 for 1 block free • 1 for 2 blocks free • Etc. • The trade-off is better space usage against slightly worse localisation.

BLOCKS • So as can be seen the IO operation is performed in blocks • but how many are written at one time depends upon • The usage (how much of the page is used) • The Compressibility (how well the used blocks compress) • We have found the default works the best • That is for a 64K page 16 blocks of 4k

Block Size • The compressor is designed to run in 2Kbyte chunks so • Never make the block size less than 2Kbyte • Always make is a multiple of 2Kbyte • The compressor does not have a huge amount of memory to run in so • If the Page size is big (>128Kbyte) do not set the block size much bigger than 32-64Kbyte

Scoping - 1 • ASIQ-M 12.4.2 used the ASA scoping rules, not the ANSI rules • In ANSI there are strict rules as to where aliases can be used, and what they are • ASA has rules that an extension to the ANSI rules and allow much more freedom • In ASA you can re-use and rename anything, anywhere

Scoping - 2 • An example: Select alpha = 27, beta, gamma From tablename Where alpha between 27 and 29 • In ASE rows are returned that match the where clause i.e. alpha between 27 and 29 • In ASA/ASIQ all table rows are returned • Why ?

Scoping – 3 Select alpha = 27, beta, gamma From tablename Where alpha between 27 and 29 • In ASIQ the following happens • In the select we have defined an alias alpha and assigned a value to it (27) • Then in the where clause we check the value of alpha against the values 27 thru 29 and the condition is true because alpha is 27 • Note:Alpha is defined as an alias here not a column identifier

A Picture ASE In ASE this alpha is a alias Select alpha = 27, beta, gammaFrom tablenameWhere alpha between 27 and 29 In ASE this alpha is a column name In ASIQ this alpha is a alias In ASIQ this alpha is also a alias, because it has been defined ASIQ To make the code “correct” for ASIQ you need to prepend the second alpha with The table name. E.g. tablename.alpha

Building Things - End

Building Things : Disks Databases