1 / 26

A System for Fast Low-Cost Access to Very Large Catalogs

A System for Fast Low-Cost Access to Very Large Catalogs. Michael Riggs, Vashundara Puttagunta, Maitreyee Pasad, Konstantinos Kalpakis, and Jeanne Behnke CSEE, UMBC and NASA GSFC. URL: http://www.csee.umbc.edu/~kalpakis/monet. Introduction. Motivation

tess
Download Presentation

A System for Fast Low-Cost Access to Very Large Catalogs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A System for Fast Low-Cost Access to Very Large Catalogs Michael Riggs, Vashundara Puttagunta, Maitreyee Pasad, Konstantinos Kalpakis, and Jeanne Behnke CSEE, UMBC and NASA GSFC URL: http://www.csee.umbc.edu/~kalpakis/monet

  2. Introduction • Motivation • Many catalogs in astronomy, few old and most of the new ones, contain data for very large number of spatial objects • Fast methods to access these catalogs are needed • Budget constraints limit the amount of $$ allocated to S/W and H/W to support fast access to the catalogs • Thus, low-cost solutions that achieve efficient and effective access to very large catalogs are needed • We demonstrate • how to achieve fast access to catalog data on a low cost desktop for a very large catalog using an Object-Relational Database System (ORDBMS) • how to enhance the functionality of the system by extending the ORDBMS with IDL

  3. Test Catalog Used • We use the USNO-A2.0 catalog (called the Monet catalog here) which has approximately 500 million stars (points) Image of the Monet catalog found at: http:/www.nofs.navy.mil/projects/pmm/universe.html

  4. Schemas Used • We experiment with a sample from the Monet catalog and the following four approaches (schemas/datablades/indexes) • Relational only • Traditional relational datatypes with B-tree indexes (clustered) on dec-RA and RA-dec (B schemas) • Object-Relational • Datatypes from the Geodetic Datablade (created for the EOSDIS project for geo-location) with an R-tree index on geo-location (G schemas) • Datatypes from Shapes2 Informix datablade with R-tree index on points (S schemas) • Datatypes from our custom-build SimpleShape datablade with R-tree index on points built bottom up using Hilbert coding of the point’s coordinates (I schemas)

  5. Types of Queries Tested • Spatial Window(Online catalog access) Find all stars within a user-defined bounding rectangle • Spatial Or-Window (Online catalog access) Find all the stars within at least one of two different bounding rectangles • Spatial Self-Join (Catalog mining) Find all stars that are within a specified box centered on each star • Spatial Multi-Join (Catalog Correlations) • Spatial chain-join Find a spatial region in a “chain-joined” sequence of tables (catalogs) • Spatial Star-join Find a spatial region in a “star-joined” sequence of tables (catalogs)

  6. Our Custom Datablade • The custom-made datablade SimpleShape extends the Informix Dynamic Server 2000 by providing • a fixed length opaque type (UDT) storing 4 small floats (16 bytes total) • full R-tree index support including new bottom-up index building • full B-tree index support based on 2 dimensional Hilbert Curves • the SimplePoint UDT holds the coordinates of a 2 dimensional point • the SimpleBox UDT holds the lower left and upper right SimplePoints of a rectangle to make a box • input/output functions and constructors

  7. Format of the USNO-A2.0 • Each star in the catalog comes packaged as three integers. • (int RA, int DEC, int MAG) • RA = (RA in decimal hours) * 15 * 3600 * 100 • DEC = ((DEC in decimal degrees) + 90) * 3600 * 100 • MAG = combination of field, blue, red values plus flags • The catalog data are stored in a database table with schema (coordinates SimplePoint, magnitude integer) • the actual data in the table are stored in their original (compressed) format as above • user-defined functions (UDFs) change coords and magnitude into meaningful fields

  8. Specializing the SimpleShape to the USNO-A2.0 Catalog • the RA and DEC fields are stored in a SimplePoint point (dec, ra) of type SimplePoint • the data format is not changed from packaged monet data • data conversion takes place in the DBMS when displaying results • the MAG field is stored as a single integer field in DBMS directly • additional utility UDFs are created • Ra() takes a SimplePoint and returns RA in decimal hours • Dec() takes a SimplePoint and returns DEC in decimal degrees • MonetBox() contructs a SimpleBox from RA and DEC in decimal hours and decimal degrees respectively • Blue() takes MAG and returns the blue magnitude • Red() takes MAG and returns the red magnitude • Field() takes MAG and returns the field of the star

  9. Custom Data Loader • Custom High Performance Loader • Built from the Virtual Table Interface (VTI) of Informix Server • Maps binary OS file into database table • Only select queries with NO where clause can be issued on table • Reasons for development • Lack of Informix High Performance Loader for PC platform • Bug in regular table load command for PC platform • USNO-A2.0 is distributed in binary format • absense of string conversions make loading fast

  10. Indexing for the Monet Data • R-tree index on coordinates • built using new bottom-up algorithm and the 2nd order Hilbert values of the points • allows fast window queries on entire catalog • faster build times than B-tree index • takes about 1 day to index the entire catalog • Second order Hilbert values • space filling curve converts two dimensions into one dimension • points are sorted based on their Hilbert value and stored adjacently in R-tree leaf pages • Hilbert value code and R-tree bottom-up index building algorithms are built into SimpleShape software

  11. Sample SQL Statements for the SimpleShape Datablade • Sample SQL statements using the SimpleShape datablade • create table test (product_id int, location SimplePoint) • insert into test values (1, SimplePoint(2.333,5.77)) • insert into test values (2, ‘point(4.123, 8.234)’ ) • create table products (product_id int, location_x int, location_y int) • insert into test select product_id, SimplePoint(location_x, location_y) from products • select * from test where within (location, ‘box(0, 0, 100, 100)’ ) • Sample SQL statements for the USNO-A2.0 specialization • create table monet_data(coords SimplePoint, mag int) • select Ra(coords), Dec(coords), Red(mag), Blue(mag), Field(mag) from monet_data where within (coords, monetbox (-3.22, 22.45, 1.23, 22.78) ) • select (Blue(mag) - Red(mag)) as energy, count(*) as count from monet_data where within (coords, monetbox (-3.22, 22.45, 1.23, 22.78) ) group by energy order by count desc

  12. More Sample SQL Statements • Custom High Performance Loader • create table load_data(value1 int, value2 int) using vti_load(file=‘/tmp/data.bin’); • select * from load_data • insert into real_table select * from load_data • Loading Monet Data • create table monet_all (coords SimplePoint, mag int) • create table monet_load(ra int, dec int, mag int) using vti_load(file=‘/tmp/zone0000.cat’) • insert into monet_all select SimplePoint(dec,ra), mag from monet_load • Indexing Monet Data • create index monet_all_idx on monet_all (coords SIMPLE_OPS) using rtree • Selecting Monet Data • select * from monet_all where within (coords, MonetBox(dec,ra,dec,ra))

  13. System Configuration • The experiments with our custom datablade were done on a system with the following configuration • Hardware (costing less than $2K) • AMD Duron 650MHz Processor • FIC model AZ11 motherboard w/IDE ATA-66 controller • 256MB of PC100 RAM • Two IBM Ultra/100 7200RPM IDE disks model DTLA307045 • Software • Linux Mandrake version 7.0 • Updated Linux Kernel version 2.4.0-test4 • Informix Dynamic Server 2000 version 9.20.UC1 • Custom datablade and data loading software

  14. System Configuration • The experiments with other datablades and the traditional relational solution were done on a system with the following configuration • Hardware • Sun UltraSparc 60 with 512 MB memory • two 8GB and two 4GB SCSI disks • Software • Solaris 2.6 • Informix Dynamic Server 9.14 with the Universal DataServer Option • Geodetic and Shapes2 datablades

  15. Experimental Results Elapsed time to load data from the Monet catalog Elapsed time in creating R-tree indexes for the three datablades

  16. Experimental Results • Loading and Indexing all the 500M stars The numbers for the Relational, Geodetic, and Shapes2 schemas are estimates based on regression from measurements for up to 140K stars. • Beware: Results are not for the faint of heart!

  17. Experimental Results Loading times for the Monet catalog using the SimpleShape datablade

  18. Experimental Results Bulding an R-tree index for the Monet catalog using the SimpleShape datablade Page Reads Elapsed time

  19. Experimental Results • Number of page reads for various queries and schemas for 60K stars • The query window(s) used are 0.0666x0.0666 decimal degrees

  20. Experimental Results • Performance of the custom-datablade (schema) for queries against a table with 60K stars

  21. Efficient storage and retrieval • Indexing and clustering of data • Concurrency control • Security Informix DBMS IDL user functions • Data-analysis routines • Visualization capabilities RPC Calls • Data-analysis routines • Visualization capabilities IDL server IDL Extension for the Informix Server • Powerful data-mining applications can be built by providing an interface between an object-relational DBMS (Informix) and a statistical/scientific-computing package (IDL). • The IDL extension to Informix allows users to tap the data-analysis and visualization capabilities of IDL through Informix. • Implementation is based on UDFs and the RPC mechanism.

  22. Data type Last IDL Data type Cast returned statement returned function from IDL in script by function . Integer $RETURN pointer to int toInt . Double $RETURN pointer to float toFloat . Binary arr $BLOB LO_Handle toBlob . structure $RETURN pointer to ROW toRow of lvarchars 2D Array $RETURN ptr to collection toList of rows The return variable is a double, i.e. float in Informix. So we use toFloat as casting function. Columns of the table that are used as arguments by idl_exec.$1  x ; $2  y $RETURN used to return a double Select toFloat ( idl_exec (ROW( x , y ), "r = double( sqrt ( ($1)^2 + ($2)^2 ) ) ; $RETURN r ” ) ) from Points; IDL script • The IDL interface provides two main functions • idl_execis a simple user defined function that executes the idl-script for each row of the table • idl_aggis a user defined aggregate function. • The functions are used along with casting functions that cast the pointer returned, to appropriate data type. Computing distance from the origin using idl_exec

  23. Clustering using idl_agg:We use the CLUST_WTS function in IDL that computes the weights (the cluster centers) of an mxn array and returns an array of cluster centers. idl_agg, in this case, returns a collection of rows that can be made to look like a virtual table. The return variable is a 2D-array. toList is casting function for the pointer to collection of rows that idl_exec returns The initialization IDL script ; executed once for each group Iteration script; executed once for every row in the group; here it assigns values to the matrix select toList ( idl_agg( ROW(ra, dec), ROW( "count=-1", "count=count+1 &if count eq 0 then y = [$1, $2] else y = [[y], [$1,$2]]", "w = CLUST_WTS(y,N_CLUSTERS=3); $RETURN w ” ) )) :: list(ROW(a float, b float) not null) from monet_temp ; Final script; executed once at the end of each group; here it clusters data using some IDL routines $RETURN used to return a collection of rows of two doubles each. Casts the collection into required list.

  24. Since we have $BLOB, idl_agg returns an LO_Handle to the blob. Hence toBlob is used for casting. Iteration script select toBlob ( idl_agg (ROW( ra, dec, blue, red ) , ROW( 'flag=0&i=long(-1)&temp=dblarr(4,1000)', 'i=i+1&if i ge 1000 then begin if flag eq 0 then begin a=temp & flag=1&end else a=[[a],[temp]] & i=0 & end & temp(*,i) = [$1,$2,$3,$4]', 'if i gt 0 then begin if flag eq 0 then a=temp(*,0:i) else a=[[a], [temp(*,0:i)]] & end & b = MONET_JPEG(a(1,*),a(0,*),a(2,*),a(3,*)); $BLOB b’ ) ) ) from monet_all; Initialization script Final script $BLOB used to return a blob that contains the jpeg image Getting a plot using idl_agg: We use a IDL UDF, MONET_JPEG that creates a plot with 1st two arguments (ra and dec) as the axis and the next two (red & blue) determining the color of the point. MONET_JPEG returns the jpeg of the plot in binary. The binary array is returned as a blob by idl_agg.

More Related