1 / 25

BID211: Extending Your Sybase IQ Data Warehouse with Unstructured Data and Binary Large Objects

BID211: Extending Your Sybase IQ Data Warehouse with Unstructured Data and Binary Large Objects. Mark Mumy Principal Systems Consultant mumy@sybase.com August 15-19, 2004. The Enterprise. Unwired. The Enterprise. Unwired. Industry and Cross Platform Solutions. Manage Information. Unwire

ashby
Download Presentation

BID211: Extending Your Sybase IQ Data Warehouse with Unstructured Data and Binary Large Objects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BID211: Extending Your Sybase IQ Data Warehouse with Unstructured Data and Binary Large Objects Mark Mumy Principal Systems Consultant mumy@sybase.com August 15-19, 2004

  2. The Enterprise. Unwired.

  3. The Enterprise. Unwired. Industry and Cross Platform Solutions Manage Information Unwire Information Unwire People • Adaptive Server Enterprise • Adaptive Server Anywhere • Sybase IQ • Dynamic Archive • Dynamic ODS • Replication Server • OpenSwitch • Mirror Activator • PowerDesigner • Connectivity Options • EAServer • Industry Warehouse Studio • Unwired Accelerator • Unwired Orchestrator • Unwired Toolkit • Enterprise Portal • Real Time Data Services • SQL Anywhere Studio • M-Business Anywhere • Pylon Family (Mobile Email) • Mobile Sales • XcelleNet Frontline Solutions • PocketBuilder • PowerBuilder Family • AvantGo Sybase Workspace

  4. Target Audience • Information Technology Management • Project Management • Database Personnel • All Levels • System Administrators • All Levels

  5. For This Presentation • Let’s Keep It Interactive • I will entertain questions on a subject • Would be happy to speak to you offline, if desired • No Question is too Basic • Chances are others may have the same question • Open Question Forum at the end • Time Permitting

  6. Agenda In this session, we will be discussing: • What Is Unstructured Data? • Why Should I Place Unstructured Data In Sybase IQ? • Understanding How Sybase IQ Stores The Data • Sybase IQ Limitations for Unstructured Data • Creating, Loading, and Extracting Unstructured Data • New Functions • Sybase IQ 12.6 Functionality • Tuning Sybase IQ for Unstructured Data • Summary • Question And Answer

  7. What Is Unstructured Data? • Freeform binary or textual data that has meaning to a user or an application, but not to the database • Why is it unstructured? • Format is not known to the RDBMS • Cannot be indexed due to the unstructured nature • Large object types • Binary – Generally referred to as a BLOB (Binary Large Object) • Text – Generally referred to as a CLOB (Character Large Object) • Often referred to as a LOB or LOBs

  8. What Is Unstructured Data? • Can be any type of data you desire • Images • Video • Sound • Scanned or electronic documents • Applications • Company specific data format • It has to make sense to you, not the database!

  9. Why Should I Place Unstructured Data In Sybase IQ? • Single location for application data • The application can be written to access all structured and unstructured data using a single interface • The DBAs have control over all data for the application, not just the structured data • Filesystems are generally less efficient at handling large sequential files, causing more disk I/O • In order to achieve performance from filesystems, additional software and training is necessary • Requires tight coordination with system administrators • Application can be taken down for non-database problems • Filesystems going offline • Storage filling up

  10. Why Should I Place Unstructured Data In Sybase IQ? • Generally requires less storage • Filesystems cannot naturally compress the datafiles • IQ will compress the LOB data just as it compresses other data • Each IQ data page is run through a compression algorithm to compress the contents of the page into one or more disk blocks • Filesystems can have as much as 10% overhead • Inode table • Reserved space for the “root” account • Reading from filesystem must be done sequentially • The application can be designed to return portions of the data in parallel • Sybase IQ loads are done in parallel

  11. Understanding How Sybase IQ Stores The Data • Each object is stored in one or more data pages • If an object needs a portion of a page, the entire page is marked as in use for that LOB • Each page is then put through the IQ compression algorithms to squeeze as much storage savings as possible • No other indexing or changes take place on the LOB data

  12. Sybase IQ Limitations for Unstructured Data • There is no database limit on the number of LOB columns that can exist • Limited only by the total number of tables and columns in a database • A table can have an unlimited number of LOB columns • Limited only by the total number of columns allowed on a table • Each LOB columns is virtually unlimited in size • Maximum LOB size is 4GB * page size • Absolute maximum is 4GB * 512KB = 2 PetaBytes • Most RDBMS engines are limited to 2-4 GB per entry

  13. Creating, Loading, and Extracting Unstructured Data • Added new datatype “long binary” to support LOB data • Create a table with LOB support create table blob_data( file_id int primary key ,filename char(64) ,ext char(6) null ,file_size unsigned bigint ,lobcol long binary null ) LOB Column

  14. Creating, Loading, and Extracting Unstructured Data • New syntax added to support loading LOB data • “BINARY FILE()” added to column specification • Data for the binary file location must be delimited • LOB data does not exist in primary data file • Primary data file is a pointer to LOB data being loaded

  15. Creating, Loading, and Extracting Unstructured Data • Load a table with a LOB column set temporary option load_memory_mb=50; go load table blob_data ( file_id ',' ,filename ',‘ ,ext ',‘ ,file_size ',' ,lobcol binary file ( ',' ) ,filler(1) -- change this to filler(2) for Windows Data ) from ‘blob_file.dat’ format ascii preview on quotes off escapes off go commit go LOB Column

  16. Creating, Loading, and Extracting Unstructured Data • Sample blob_file.dat data 1,boston,jpg,/s1/loads/lobs/boston.jpg,1234, 2,map_of_concord,bmp,/s1/loads/maps/concord.bmp,321, 3,zero length test,NULL,,123, 4,null test,NULL,NULL,456, • Notice that there is no LOB data in the file • The IQ load engine will open the secondary files and load them into the appropriate rows • Row 1: /s1/loads/lobs/boston.jpg • Row 2: /s1/loads/maps/concord.bmp • Rows 3 and 4 will be NULL

  17. Creating, Loading, and Extracting Unstructured Data • More LOB loading notes… • If the file cannot be opened for any reason a NULL is loaded instead and no error is thrown • An error will be thrown if the column is created with “NOT NULL” • LOB data can only be loaded from an ASCII primary data file (blob_file.dat) • The ASCII file can be fixed width, however, the secondary file name column MUST have a delimiter after it

  18. Creating, Loading, and Extracting Unstructured Data • Three methods to extracting LOB data • Return the entire contents of the column • Return a portion of the column • Extract the column to disk • To return the entire contents of the column to the client application • select lobcol from my_table where file_id = 1 • To return a portion of the column to the client application • select byte_substr64( lobcol, 1, 1000 ) from my_table where file_id = 1 • Will return bytes 1 through 1000 of the LOB • Extract the column to disk • select bfile ( ‘/tmp/my_lob.dat’, lobcol ) from my_table where file_id = 1 • Will directly write the entire row/column to disk on the IQ server host

  19. New Functions • BYTE_LENGTH64( long binary ) • Returns the total bytes contained in the LOB column • select byte_length64( lobcol ) from my_table where file_id = 1 • BYTE_SUBSTR64( long binary, offset, length) • Returns a portion/substring of the LOB data • select byte_substr64( lobcol, 1, 1000 ) from my_table where file_id = 1 • BFILE( filename, long binary ) • Extract the contents of the LOB column to filename • Each row should have a unique filename • The filename can be any string manipulation • If the filename is non-unique the contents will be overwritten • select bfile( filename + ‘.’ + ext, lobcol ) from my_table where file_id = 1

  20. Sybase IQ 12.6 Functionality • New domain BLOB created for the “long binary” datatype • New domain CLOB created for the “long varchar” datatype • Single- and Multi-byte support added for the CLOB datatype • New load table syntax • Secondary_File_Error option specifies the desired error handling when an error occurred during opening/reading a secondary file • Supported syntax • UPDATE • INSERT..VALUES • INSERT..SELECT • LOAD • DELETE • TRUNCATE • SELECT..INTO • INSERT..LOCATION • SELECT

  21. Tuning Sybase IQ for Unstructured Data • CORE_Options14 • This sets the number of threads used to read an individual lob secondary file. Measurements have been inconsistent due to vast hardware differences and dependencies. A value much larger than 3 has shown to cause system time to be too high. Start with a value of 3 and tune from there. • CORE_Options15 • This sets the number of threads in the lob load team. A value of zero will use 1 thread per cpu, if available. A non-zero value of n sets the team size to n, if available. The lob load threads have to live within the other settings such as max threads per connection, max team size, etc. On a machine with many CPUs, this may need to be set manually.

  22. Tuning Sybase IQ for Unstructured Data • FP_LOB_Workunit_MBSize • This defaults to 100 MB and there is probably no reason to change it. It is the amount of data each thread reads in a single unit of work when loading lob data. • Loading LOB data is very sensitive to I/O and CPUs • Have sustained load rates of over 2 GB per second! • That’s 7 TB an hour • Had 72 CPUs and enough I/O controllers to handle the throughput • Loading LOB data is not memory intensive • A minimally configured IQ server (1-2 GB RAM total) is sufficient to load LOB data • Load_Memory_MB is not used heavily for LOB data

  23. Summary • A fast, compressible way to store unstructured data in an RDBMS • Compression ratio will vary, but is guaranteed to be no larger than the original file without the O/S overhead • Data can be written and read in parallel • Remove the need to rely on resources outside the database administrators and developers • Guaranteed performance since all resources are under the DBA’s control

  24. That’s It Questions?

  25. BID211 Extending Your Sybase IQ Data Warehouse with Unstructured Data and Binary Large Objects Mark Mumy Principal Systems Consultant mumy@sybase.com August 17, 2004

More Related