1 / 11

SkyQuery : distributed probabilistic join infrastructure for astronomy

László Dobos 1,2 , Tamás Budavári 2 , Nolan Li 2 , Alex Szalay 2 , István Csabai 1 dobos@ complex.elte.hu , budavari @ jhu.edu 1 Eötvös Loránd University, Budapest, Hungary Department of Physics of Complex Systems

vashon
Download Presentation

SkyQuery : distributed probabilistic join infrastructure for astronomy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. László Dobos1,2, Tamás Budavári2, Nolan Li2,Alex Szalay2, István Csabai1 dobos@complex.elte.hu , budavari@jhu.edu 1 Eötvös Loránd University, Budapest, HungaryDepartment of PhysicsofComplex Systems 2 The JohnsHopkins University, Baltimore, USADepartment of Physics & Astronomy 24th SSDBM conference, 25-27 June, 2012 ‒Chania, Crete, Greece SkyQuery: distributedprobabilisticjoininfrastructureforastronomy

  2. Crossmatching • Astronomicalcatalogs • in RDBMS • o(100 million) objects • o(1TB – 10TB) DB size • Donebycoordinates • RA, Dec • Astrometricerror • Differentskycoverage • Differentwavelengthrange • Movingobjects etc.

  3. The multiwavelengthsky infrared (2MASS) visible (DSS) ultraviolet (Galex)

  4. Problemdescription • Alldatain RDBMS • runcomputationinsidethedatabase • usemultiple servers and parallelize • must be transparentforusers • Astronomers„script” whattheydo • multiplere-runs, tweakparameters etc. • huge web forms: no-no • Use SQL toformulatetheproblem • functions and languageextensionstosupportastronomy • extra syntaxtodescribethecoordinate-basedprobabilisticjoin • spatialconstraints: celestialregions

  5. Sample SQL query SELECTs.objId, g.objID, t.objID, s.ra, s.dec, g.ra, g.dec, t.ra, t.dec, x.ra, x.decFROMSDSSDR7:GalaxiesAS sCROSS JOIN Galex:GalaxiesAS g CROSS JOIN TwoMASS:ExtendedSourcesAS tXMATCH BAYESIAN AS xMUST s ONPOINT(s.cx, s.cy, s.cz), 0.1MUST g ONPOINT(g.ra, g.dec), 0.2 MAY t ONPOINT(t.ra, t.dec), 0.5HAVING LIMIT 1e3 REGIONCIRCLE J2000 165.7, 0.3, 60 Standard SQL Probabilisticcrossmatch Spatialconstraint

  6. Queryexecution

  7. Howtoextend SQL • SQL is declarative • everythingcan be executedthatcanbeexpressed • extensions must be executableinanycase • Queryoptimization is hard • Design languagewitheasyoptimizationin mind • constrainonthelevel of thegrammar • customclausesinstead of complexwhereclauselogic

  8. DistributedSkyQueryArchitecture GraywulfDatabase Server Cluster Internet Remote VirtualObservatory Data Source SQL queries MyDB SkyQueryWeb Interface Job Scheduler XMATCH query ClusterRegistry SDSS × 2MASS = ?

  9. Modules I. • Registry • completedescription of the server cluster • frommachinegrouptodiskvolumes • containsallinfoforoptimaldatabaseallocation • Management tools • allocate, resize, copy, mirror etc. databases • monitor cluster status • Scheduler • co-locationawarequeryexecution • jobsimplementedasworkflows • .Net WorkflowFoundation (WF4) • parallel executionout-of-the-box • extensivelogging, persistence, retrylogic etc.

  10. Modules II. • SQL parsergenerator • supportsgrammarinheritence • easyto add customextensionstoplain SQL • Metadatatools • Tag SQL scriptswithmetadata • Makeitaccessiblefrom web interface • Extractprovenanceinformationfromuserqueries • User web interface • write and submitqueries • accesstoowndatabase (MyDB) • Jobs • Crossmatchworkflow, etc.

  11. Futurework • Currentsystem: focusonastronomy/crossmatch • Implementspatialconstraints • Extendto a genericframework + API • mirroring, sharding of datasets • querypartitioning • limited distributedjoins • transparentaccesstoremotedatasets • smartcaching of remotedata / queryresults

More Related