1 / 34

Spatial Indexing

Spatial Indexing . Clive Page. Formats of Raw Data. Radio: Complex visibility for each polarisation at set of points sampling the complex ( u,v ) plane. Infra-red, Optical, Ultra-violet: Images from 1k ×1k to 18k×20k, collected at intervals of a few seconds to several minutes.

huela
Download Presentation

Spatial Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spatial Indexing Clive Page

  2. Formats of Raw Data • Radio: • Complex visibility for each polarisation at set of points sampling the complex (u,v) plane. • Infra-red, Optical, Ultra-violet: • Images from 1k×1k to 18k×20k, collected at intervals of a few seconds to several minutes. • X-ray, Gamma-ray: • Photon-even lists: get properties (x, y, time, energy) for each detected photon. May get a list of millions of these over an integration period of a few hours.

  3. Formats of Reduced Data • Images • Time-series • Spectra • Catalogues of astronomical objects: • Vital to cross-identify objects from different wavebands, basis for many subsequent data mining investigations. • Problem: tables can be long or wide:

  4. Main Functionality • SELECT objects in a given small patch of sky • In a rectangle – to cover same region as an image • In a circle – to cover a radius around point of interest (also known as search in a cone). • In a polygon – e.g. around extended object. • Spatial JOIN: Cross-match objects from, for example, two wavebands or two epochs. • Principal matching criterion is overlap of error-circles. • Often important to find objects in one table which are NOT matched in the other: need a left outer join.

  5. Functionality (continued) • Self JOIN useful in some applications, for example: • Search for clusters of galaxies • Search for double stars • Find spatial distribution or spatial correlation functions for particular classes of objects • Those observing with adaptive optics need a suitable reference star in or near each field, e.g. might want to find all AGNs near to bright stars to plan an observing campaign.

  6. Current on-line services • Search in a cone – many on-line services • Cross-matching (spatial JOIN) – few current examples: • Astrobrowse (at GSFC) has pre-computed joins for a limited number of important catalogues. • Vizier service (CDS) allows cross-match of user’s own table of positions with list of catalogues. • But results of list of N sources and M catalogues are presented as unmerged list of M*N separate tables. • Skyserver – joins between SDSS and other tables. • Self-join: do not know of any on-line services at present.

  7. Problems handling object catalogues • Positions use spherical-polar coordinates (RA, Dec) • Right Ascension corresponds to geographic longitude • Declination corresponds to geographic latitude • There are singularities at the poles and distortions in the scales everywhere except at the equator. • RA wraps from 24 hours (360 degrees) to zero. • All object positions are imprecise  positions have an error radius. • Distances between points must use a great-circle distance function not cartesian distance. • Two-dimensional indexing is really needed

  8. Indexing Possibilities • Home-brew slicing of the sky • Use B-tree on one spatial axis only • Use 1-d to 2-d mapping function then simple B-tree • Use true spatial index such as R-tree.

  9. Home-brew slicing of sky • Both USNO-B and 2MASS issued as separate files, each covering 0.1 degree in declination, sorted in RA within each strip. • Software from Harvard-Smithsonian Center for Astrophysics (WCSTOOLS) allows fairly efficient access to data stored in these strips. • WFCSTOOLS is used by many astronomical archives around the world, (including LEDAS www.ledas.ac.uk). • Ok for cone-search services, not suitable for spatial joins.

  10. Index one spatial axis only • Widely used, including Vizier collection of ~3000 astronomical tables. • Index on declination (avoids RA wrap-around problem) • Poor (but acceptable) performance on cone-search: • E.g. consider USNO-B: a table of a billion rows • Typical search/join uses a radius of say 3 arc-seconds. • 17% chance of finding a match at a random position. • Index on declination effectively searches a strip 360° x 6 arc-seconds: get around 10,000 rows matching. Need to check all these to find the (0 or 1) true matches. • Conclusion: might gain five orders of magnitude in efficiency by using a true 2-d index.

  11. One-dimensional index: spatial join problem • Very inefficient, and efficiency is needed when joining one large table with another • Join criterion is, typically: • SELECT * FROM cat1,cat2 WHERE ABS(dec1 - dec2) < combinedError AND … • I have not yet found any DBMS with an optimiser which uses an index when confronted with a join criterion of this form (or indeed any other expression for joining on an overlap of error ranges).

  12. Mapping functions (2-d to 1-d) • Cover the space with cells (pixels) and number them. • Create conventional B-tree on resulting set of integers. • Each point in the sky maps to a single integer. • An area maps to a setof integers. • A seductive idea: • If a small spatial area maps to a smallish rangeof integers, then a search over a spatial area might be done by a B-tree search over a small integer range. • Various space-filling curves have been used in the hope that this works in practice, e.g. Z-order index, Hilbert, Peano curves, and many others.

  13. Hilbert Curve

  14. Z-order (bit-interleaved) Mapping Function

  15. Space-filling Curves as Mapping Functions • Excellent performance when searching for single points. • For area searches, the median performance is adequate, but all curves suffer the same drawback: • There exist cases in which nearby points in space are very far apart on the curve (in the Z-order index this corresponds to a high-order bit flipping). • Performance tests confirm this defect: the worst-case performance is so abysmal that the average performance is very poor. • Another problem: simple cartesian grids also unsuited to spherical-polar coordinate searches as there are too many tiny distorted pixels near the poles.

  16. Better Mapping Functions Aim: cover sky uniformly with pixels - integer pixel-codes. • HEALPix - Hierarchical Equal Area iso-Latitude Pixelisation – invented at ESO for COBE. • Pseudo-square pixels, 2-d arrays of pixels suitable for analysis of large-scale spatial structures. • HTM - Hierarchical Triangular Mesh – invented at Johns Hopkins University – now used by several projects. • Triangular pixels. • Hierarchical numbering: high-order bits are a valid HTM index for coarser grid. • Both algorithms perfectly good when searching for points.

  17. Hierarchical Equal Area iso-Latitude Pixelisation (HEALPix)

  18. Hierarchical Triangular Mesh (HTM)

  19. Given table CAT1 with columns: ID1 – primary key RA DEC POSERR MAGNITUDE etc Given table CAT2 with columns: ID2 – primary key RA DEC POSERR FLUX etc Spatial Join using Pixel-code Method

  20. Given table CAT1 with columns: ID1 – primary key RA DEC POSERR MAGNITUDE etc Create table P1: ID1 – foreign key PCODE1 – primary key Note: P1 and P2 have extra rows where error-circle overlaps two or more pixels. Given table CAT2 with columns: ID2 – primary key RA DEC POSERR FLUX etc Create table P2: ID2 – foreign key PCODE2 – primary key Create tables P1, P2 with pixel-code column

  21. Given table CAT1 with columns: ID1 – primary key RA DEC POSERR MAGNITUDE etc Create table P1: ID1 – foreign key PCODE1 – primary key Create table PJOIN by joining P1 and P2 on PCODE1=PCODE2 ID1 ID2 Given table CAT2 with columns: ID2 – primary key RA DEC POSERR FLUX etc Create table P2: ID2 – foreign key PCODE2 – primary key Join P1 and P2 on pixel-codes creating PJOIN

  22. PJOIN table • Join using SELECT DISTINCT to remove any duplicates (two error-circles may each overlap two or more pixels). • Table PJOIN identifies all pixels where error-circles potentially overlap • Circles may or may not actually overlap, may just be nearby in the same pixels. • Next step: create B-tree index on PJOIN(ID1)

  23. Use PJOIN table to match catalogue rows • Three-way join then produces required results, e.g. SELECT cols FROM CAT1, PJOIN, CAT2 WHERE CAT1.ID1=PJOIN.ID1 AND PJOIN.ID2=CAT2.ID2 AND (2 * asin(sqrt(pow(sin((cat1.dec- cat2.dec)/2),2) + cos(cat1.dec) * cos(cat2.dec) * pow(sin((cat1.ra- cat2.ra)/2),2))) <= cat1.poserr+cat2.poserr) ; • This works, and speed appears good, but more testing is needed on a wider range of datasets.

  24. True Spatial Indexing • Hot topic of research in computer science departments for more than 20 years • Very many algorithms have been proposed: • BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, hB-tree, kd-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q0-tree, Quadtree, R-tree, SKD-tree, SR-tree, SS-tree, TV-tree, UB-tree. • So many alternatives, but none of them has properties as good as the B-tree in one dimension (e.g. compact and efficient, with fairly good worst-case performance). • R-tree one of the earliest structures, one of the best. • R-trees are built into several modern DBMS.

  25. Spatial Options in current DBMS

  26. Using R-trees • Must draw a rectangular box outside each error circle • Boxes get rather extended (along RA axis) near poles • Need a subsequent filter to remove spurious matches where rectangles overlap but error-circles do not. • If the error criterion alters (e.g. user wants 99% probability circle rather than 90%) need to recreate column of boxes, and then recreate the R-tree index. • Solution: always use a box as large as anyone could possible want, subsequent filtering on error-circles is still quite cheap.

  27. R-tree Performance • Postgres: R-tree indexing works as advertised • R-trees are large, creation is slow, e.g. 2 hours for table of 3.5 million rows. • MySQL: latest version allows R-trees on any polygon (but anything above 4 sides is wasteful). • Very verbose external data format • Works as advertised, not yet measured performance. • Informix: Kalpakis et al. (ADASS conference report) loaded part of USNO-A2 and found data load and R-tree creation would have taken 39 days for the entire table of 500 million rows.

  28. Comparison of Pixel-code and R-tree Methods • Advantages • Pcode join seems to be faster (but not yet benchmarked with identical systems). • Takes up less disc space in total. • Can use any DBMS, not just those with an R-tree or other spatial data option. • Disadvantages • Additional tables and indices have to be created. • More complex set of joins. • Needs external code as neither HTM or HEALPix can be expressed as an SQL-callable function (because they return a variable-length array of integers).

  29. Indexing: summary • Indexing on just one spatial axis is simply too inefficient for large tables and cannot support joins. • R-trees are powerful and easy to use, but index creation times are a serious cause for concern. • 2d1d mapping functions such as HTM or HEALPix are more complicated to use but may be faster in some cases. • AstroGrid will continue work in this area.

  30. Idea: HTM as Universal Position Locator? Can choose HTM (or HEALPix) resolution as high as needed • 32-bit words: pixels around 24 arcsec on a side. • More pixels than objects in largest current catalogue • 64-bit words: pixels around 0.4 milliarcsec on a side. • Ample precision for some time to come Maybe every object catalogue should list the UPL of each celestial object, allowing fast and easy searches and joins? • Fine for points in sky, but real objects have finite extent, and some will correspond to a more than one UPL • Some objects move – should pixel-code change with time? • Reference frame also moves

  31. UPLs for extended objects: possible work-arounds • Choose a pixel size large enough to encompass the error region in each case • In some cases this would be very large indeed • Different objects would have different pixel size • Have extra rows in tables for objects crossing pixel boundaries. • Need to alter original tables, not always acceptable. • Add variable-length vector to table to contain list of UPLs • Non-normalised table; not allowed by most DBMS • Introduce separate table to map object IDs to UPLs. • Adds significant complexity to search/join operations

  32. Speeding up Spatial Data Access Use Parallel Hardware such as Beowulf Clusters • We may be able to spread a large table over the disc drives of several nodes in a cluster for faster searches. • Some support for this is built into DBMS such as Oracle, DB2 and SQL Server. There is currently nothing to support it in Open Source products such as MySQL and Postgres. • Possible do-it-yourself approach: split table into separate DBMS instances on separate nodes. Farm out query to all instances, then combine the results. Column-oriented storage • Have tested Sybase-IQ: performance good, but expensive. No multi-dimensional indexing.

  33. Can we do Distributed Searches and Joins? • Object catalogues of interest are located on data archive servers all over the world. Is it feasible to search or join without copying entire tables? • Cone search services are becoming widespread. A few servers allow a single search to be farmed out to many servers, no attempt is made to merge the results, they are just concatenated. Room for improvement here? • An outer join requires every row in table#1 to be present in the output, so there is little point in attempting a join over the network, when it is simpler to copy table#1 to the host holding table#2. • Perhaps need to set up astronomical data warehouse with services to import tables and join them.

More Related