110 likes | 199 Views
SkyQuery is a distributed probabilistic join infrastructure for astronomy, enabling cross-matching of astronomical catalogs in RDBMS with millions of objects. It supports multi-wavelength sky coverage and moving objects, enhancing SQL with custom syntax for coordinate-based joins within spatial constraints. The architecture includes Graywulf server cluster, web interface, job scheduler, and query execution modules for efficient data processing. Future work aims to broaden its application beyond astronomy.
E N D
László Dobos1,2, Tamás Budavári2, Nolan Li2,Alex Szalay2, István Csabai1 dobos@complex.elte.hu , budavari@jhu.edu 1 Eötvös Loránd University, Budapest, HungaryDepartment of PhysicsofComplex Systems 2 The JohnsHopkins University, Baltimore, USADepartment of Physics & Astronomy 24th SSDBM conference, 25-27 June, 2012 ‒Chania, Crete, Greece SkyQuery: distributedprobabilisticjoininfrastructureforastronomy
Crossmatching • Astronomicalcatalogs • in RDBMS • o(100 million) objects • o(1TB – 10TB) DB size • Donebycoordinates • RA, Dec • Astrometricerror • Differentskycoverage • Differentwavelengthrange • Movingobjects etc.
The multiwavelengthsky infrared (2MASS) visible (DSS) ultraviolet (Galex)
Problemdescription • Alldatain RDBMS • runcomputationinsidethedatabase • usemultiple servers and parallelize • must be transparentforusers • Astronomers„script” whattheydo • multiplere-runs, tweakparameters etc. • huge web forms: no-no • Use SQL toformulatetheproblem • functions and languageextensionstosupportastronomy • extra syntaxtodescribethecoordinate-basedprobabilisticjoin • spatialconstraints: celestialregions
Sample SQL query SELECTs.objId, g.objID, t.objID, s.ra, s.dec, g.ra, g.dec, t.ra, t.dec, x.ra, x.decFROMSDSSDR7:GalaxiesAS sCROSS JOIN Galex:GalaxiesAS g CROSS JOIN TwoMASS:ExtendedSourcesAS tXMATCH BAYESIAN AS xMUST s ONPOINT(s.cx, s.cy, s.cz), 0.1MUST g ONPOINT(g.ra, g.dec), 0.2 MAY t ONPOINT(t.ra, t.dec), 0.5HAVING LIMIT 1e3 REGIONCIRCLE J2000 165.7, 0.3, 60 Standard SQL Probabilisticcrossmatch Spatialconstraint
Howtoextend SQL • SQL is declarative • everythingcan be executedthatcanbeexpressed • extensions must be executableinanycase • Queryoptimization is hard • Design languagewitheasyoptimizationin mind • constrainonthelevel of thegrammar • customclausesinstead of complexwhereclauselogic
DistributedSkyQueryArchitecture GraywulfDatabase Server Cluster Internet Remote VirtualObservatory Data Source SQL queries MyDB SkyQueryWeb Interface Job Scheduler XMATCH query ClusterRegistry SDSS × 2MASS = ?
Modules I. • Registry • completedescription of the server cluster • frommachinegrouptodiskvolumes • containsallinfoforoptimaldatabaseallocation • Management tools • allocate, resize, copy, mirror etc. databases • monitor cluster status • Scheduler • co-locationawarequeryexecution • jobsimplementedasworkflows • .Net WorkflowFoundation (WF4) • parallel executionout-of-the-box • extensivelogging, persistence, retrylogic etc.
Modules II. • SQL parsergenerator • supportsgrammarinheritence • easyto add customextensionstoplain SQL • Metadatatools • Tag SQL scriptswithmetadata • Makeitaccessiblefrom web interface • Extractprovenanceinformationfromuserqueries • User web interface • write and submitqueries • accesstoowndatabase (MyDB) • Jobs • Crossmatchworkflow, etc.
Futurework • Currentsystem: focusonastronomy/crossmatch • Implementspatialconstraints • Extendto a genericframework + API • mirroring, sharding of datasets • querypartitioning • limited distributedjoins • transparentaccesstoremotedatasets • smartcaching of remotedata / queryresults