1 / 10

On-demand associations using database server c lusters

On-demand associations using database server c lusters. László Dobos, Tamás Budavári , Alex Szalay , István Csabai Eötvös University / JHU. Cross-match problem in astronomy. Astronomical catalogs in the TB range, o(100M) detections per catalog Geographically distributed:

skule
Download Presentation

On-demand associations using database server c lusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On-demand associations using database server clusters László Dobos, TamásBudavári, Alex Szalay, István Csabai Eötvös University / JHU IDIES Inaugural Symposium, Baltimore

  2. Cross-matchprobleminastronomy • Astronomical catalogs in the TB range, o(100M) detections per catalog • Geographically distributed: • reliable, lightweight transfer protocol needed • shouldbenefit from co-located datasets • Goals: • find the same object in every catalog • find drop-outs (requires complete description of footprints) • on-demand: do it quickly (< 5 min) • Matching primarily based oncelestial coordinates • astrometric error • error can vary from object to object • Additional match criteria: size, color, etc. IDIES Inaugural Symposium, Baltimore

  3. Cross-matchprobleminastronomy • The math: • Bayesian model selection[Budavári & Szalay 2008, „Probabilistic Cross-Identification of Astronomical Sources”] • First step: cut on distance • Including additional match criteria is easy and natural • Tested on simulations [Heinis et al. 2009] • The problems • one-to-one matching of objects is expensive • trigonometric computations • IO intensive if dataset is big: always have to keep the right subset of data in memory IDIES Inaugural Symposium, Baltimore

  4. Hardware and datalayout • JHU Graywulf cluster: • Dell PowerEdge 2950 + Dell PowerVauld MD 1000,2 × PERC 5/e raid controller • 1.2-1.4 GB/sec nominal IO bandwidth, InfiniBand • 2x4 core iXeon, 8-32 GB RAM • 5-20 machines partially assigned tocross-match engine • Catalogs are mirrored on every node • User catalogs uploaded to / located at a dedicated node • Remote data sources (via various protocols) • Queries are partitioned and executed in parallel on every machine IDIES Inaugural Symposium, Baltimore

  5. Xmatchdefinitionlanguage • A cross-matchquery: SELECT s.objId as SobjID, s.ra, s.dec,g.ra, g.dec, j_m FROM SDSS:PhotoObjAllAS s CROSS JOIN GALEX:PhotoObjAll AS g XMATCH BAYESIAN AS x MUST s ON Point(s.ra, s.dec), 0.1 MUST g ON Point(g.ra, g.dec), 0.5 HAVING x.BF > 1e3 WHERE s.type= 3 AND s.ra BETWEEN 200 AND 210 AND s.dec BETWEEN -2 AND 2 AND g.ra BETWEEN 200 AND 210 AND g.dec BETWEEN -2 AND 2 • A partitionedquery: SELECT s.ObjID FROM SDSS:PhotoObjAll s PARTITION ON Ra WHERE Ra BETWEEN 200 AND 210 AND Dec BETWEEN -5 AND 5 IDIES Inaugural Symposium, Baltimore

  6. QueryExecution 1 • Parse: • proprietary SQL parser written from scratch • covers ~80% of SQL Server’s SELECT statement grammar • extensions can be added easily by changing BNF grammar • Job assignment: (to be implemented) • determine sets of collocated catalogs using a central registry • send part of cross-match job to remote service • return only cross-matched result, not full raw datasets • merge resultsets at any node • Partition: • cross-match queries: on right ascension • simple queries: on specified column • partitioned determined based on histogram:histogram query executed on a subsample to get metrics IDIES Inaugural Symposium, Baltimore

  7. QueryExecution 2 • Cache: • cache remote datasets • copy myDB tables to worker nodes • can benefit from filters defined in query • Execute: • construct T-SQL queries • execute T-SQL queries on nodes in parallel • automatically retry on failure • Merge • merge resultsets • benefit from clever partitioning: no duplicates IDIES Inaugural Symposium, Baltimore

  8. Appliedtechnologies • Relational Database Management System: • SQL Server 2008 • CLR integration with parallel execution support • Windows Workflow Foundation: • coordinates the complex execution workflow • transactions help keep the system consistent • parallel execution support • SMO • SQL management objects • easy access to the database schema IDIES Inaugural Symposium, Baltimore

  9. Zonealgorithm • Zone algorithm: • Pure T-SQL: can leverage from query optimizer of SQL Server • Divide sphere into zones • ZoneID: very simple hash on declination • Indexes built on ZoneID and right ascension help very quick pre-filtering of match candidates • very well parallelized on multi-core machines • [Gray, Szalay & Nieto-Santisteban 2006, The Zones Algorithm for Finding Points-Near-a-Point or Cross-Matching Spatial Datasets] IDIES Inaugural Symposium, Baltimore

  10. Summary and futurework • On-demand cross-matching is feasible • Parser and partitioning logic built for handling cross-match job descriptions • Workflow built for executing partitioned jobs • New technologies allow rapid development of complex workflows and high performance data warehouses • Future work: • Develop GUI • Install and publish system • Add support for remote datasets • Add support to benefit from collocated datasets IDIES Inaugural Symposium, Baltimore

More Related