ps1 psps object data manager design n.
Download
Skip this Video
Download Presentation
PS1 PSPS Object Data Manager Design

Loading in 2 Seconds...

play fullscreen
1 / 147

PS1 PSPS Object Data Manager Design - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

PS1 PSPS Object Data Manager Design. PSPS Critical Design Review November 5-6, 2007 IfA. Outline. ODM Overview Critical Requirements Driving Design Work Completed Detailed Design Spatial Querying [AS] ODM Prototype [MN] Hardware/Scalability [JV] How Design Meets Requirements

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'PS1 PSPS Object Data Manager Design' - valentine-reilly


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ps1 psps object data manager design

PS1 PSPSObject Data Manager Design

PSPS Critical Design Review

November 5-6, 2007

IfA

outline
Outline
  • ODM Overview
  • Critical Requirements Driving Design
  • Work Completed
  • Detailed Design
  • Spatial Querying [AS]
  • ODM Prototype [MN]
  • Hardware/Scalability [JV]
  • How Design Meets Requirements
  • WBS and Schedule
  • Issues/Risks

[AS] = Alex, [MN] = Maria, [JV] = Jan

odm overview
ODM Overview

The Object Data Manager will:

  • Provide a scalable data archive for the Pan-STARRS data products
  • Provide query access to the data for Pan-STARRS users
  • Provide detailed usage tracking and logging
odm driving requirements
ODM Driving Requirements
  • Total size 100 TB,
    • 1.5 x 1011 P2 detections
    • 8.3x1010 P2 cumulative-sky (stack) detections
    • 5.5x109 celestial objects
  • Nominal daily rate (divide by 3.5x365)
    • P2 detections: 120 Million/day
    • Stack detections: 65 Million/day
    • Objects: 4.3 Million/day
  • Cross-Match requirement: 120 Million / 12 hrs ~ 2800 / s
  • DB size requirement:
    • 25 TB / yr
    • ~100 TB by of PS1 (3.5 yrs)
work completed so far
Work completed so far
  • Built a prototype
  • Scoped and built prototype hardware
  • Generated simulated data
    • 300M SDSS DR5 objects, 1.5B Galactic plane objects
  • Initial Load done – Created 15 TB DB of simulated data
    • Largest astronomical DB in existence today
  • Partitioned the data correctly using Zones algorithm
  • Able to run simple queries on distributed DB
  • Demonstrated critical steps of incremental loading
  • It is fast enough
    • Cross-match > 60k detections/sec
    • Required rate is ~3k/sec
detailed design
Detailed Design
  • Reuse SDSS software as much as possible
  • Data Transformation Layer (DX) – Interface to IPP
  • Data Loading Pipeline (DLP)
  • Data Storage (DS)
    • Schema and Test Queries
    • Database Management System
    • Scalable Data Architecture
    • Hardware
  • Query Manager (QM: CasJobs for prototype)
detailed design1
Detailed Design
  • Reuse SDSS software as much as possible
  • Data Transformation Layer (DX) – Interface to IPP
  • Data Loading Pipeline (DLP)
  • Data Storage (DS)
    • Schema and Test Queries
    • Database Management System
    • Scalable Data Architecture
    • Hardware
  • Query Manager (QM: CasJobs for prototype)
data transformation layer dx
Data Transformation Layer (DX)
  • Based on SDSS sqlFits2CSV package
    • LINUX/C++ application
    • FITS reader driven off header files
  • Convert IPP FITS files to
    • ASCII CSV format for ingest (initially)
    • SQL Server native binary later (3x faster)
  • Follow the batch and ingest verification procedure described in ICD
    • 4-step batch verification
    • Notification and handling of broken publication cycle
  • Deposit CSV or Binary input files in directory structure
    • Create “ready” file in each batch directory
  • Stage input data on LINUX side as it comes in from IPP
dx subtasks
DX Subtasks

DX

Initialization

Job

FITS schema

FITS reader

CSV Converter

CSV Writer

Batch

Ingest

Interface with IPP

Naming convention

Uncompress batch

Read batch

Verify Batch

Batch

Conversion

CSV Converter

Binary Converter

“batch_ready”

Interface with DLP

Batch

Verification

Verify Manifest

Verify FITS Integrity

Verify FITS Content

Verify FITS Data

Handle Broken Cycle

dx dlp interface
DX-DLP Interface
  • Directory structure on staging FS (LINUX):
    • Separate directory for each JobID_BatchID
    • Contains a “batch_ready” manifest file
      • Name, #rows and destination table of each file
    • Contains one file per destination table in ODM
      • Objects, Detections, other tables
  • Creation of “batch_ready” file is signal to loader to ingest the batch
  • Batch size and frequency of ingest cycle TBD
detailed design2
Detailed Design
  • Reuse SDSS software as much as possible
  • Data Transformation Layer (DX) – Interface to IPP
  • Data Loading Pipeline (DLP)
  • Data Storage (DS)
    • Schema and Test Queries
    • Database Management System
    • Scalable Data Architecture
    • Hardware
  • Query Manager (QM: CasJobs for prototype)
data loading pipeline dlp
Data Loading Pipeline (DLP)
  • sqlLoader – SDSS data loading pipeline
    • Pseudo-automated workflow system
    • Loads, validates and publishes data
      • From CSV to SQL tables
    • Maintains a log of every step of loading
    • Managed from Load Monitor Web interface
  • Has been used to load every SDSS data release
    • EDR, DR1-6, ~ 15 TB of data altogether
    • Most of it (since DR2) loaded incrementally
    • Kept many data errors from getting into database
      • Duplicate ObjIDs (symptom of other problems)
      • Data corruption (CSV format invaluable in catching this)
sqlloader design
sqlLoader Design
  • Existing functionality
    • Shown for SDSS version
    • Workflow, distributed loading, Load Monitor
  • New functionality
    • Schema changes
    • Workflow changes
    • Incremental loading
      • Cross-match and partitioning
sqlloader workflow
Distributed design achieved with linked servers and SQL Server Agent

LOAD stage can be done in parallel by loading into temporary task databases

PUBLISH stage writes from task DBs to final DB

FINISH stage creates indices and auxiliary (derived) tables

Loading pipeline is a system of VB and SQL scripts, stored procedures and functions

sqlLoader Workflow
data validation
Data Validation

Test Uniqueness

Of Primary Keys

Test the unique

Key in each table

Test

Foreign Keys

Test for consistency

of keys that link tables

Test

Cardinalities

Test consistency of

numbers of various

quantities

Test

HTM IDs

Test the Hierarchical

Triamgular Mesh IDs

used for spatial

indexing

Test Link Table

Consistency

Ensure that links are

consistent

  • Tests for data integrity and consistency
  • Scrubs data and finds problems in upstream pipelines
  • Most of the validation can be performed within the individual task DB (in parallel)
distributed loading
Distributed Loading

Master

Schema

View of

Master

Schema

View of

Master

Schema

Publish

Schema

Publish

Data

Task

Data

Task

Data

Task

Data

Samba-mounted CSV/Binary Files

Load Monitor

Master

LoadAdmin

Slave

Slave

LoadSupport

LoadSupport

LoadSupport

View of

Master

Schema

Task DB

Task DB

Task DB

Publish

Finish

schema changes
Schema Changes
  • Schema in task and publish DBs is driven off a list of schema DDL files to execute (xschema.txt)
  • Requires replacing DDL files in schema/sql directory and updating xschema.txtwith their names
  • PS1 schema DDL files have already been built
  • Index definitions have also been created
  • Metadata tables will be automatically generated using metadata scripts already in the loader
workflow changes
Workflow Changes

LOAD

  • Cross-Match and Partition steps will be added to the workflow
  • Cross-match will match detections to objects
  • Partition will horizontally partition data, move it to slice servers, and build DPVs on main

Export

Check

CSVs

Create

Task DBs

Build SQL

Schema

Validate

XMatch

PUBLISH

Partition

matching detections with objects
Matching Detections with Objects
  • Algorithm described fully in prototype section
  • Stored procedures to cross-match detections will be part of the LOAD stage in loader pipeline
  • Vertical partition of Objects table kept on load server for matching with detections
  • Zones cross-match algorithm used to do 1” and 2” matches
  • Detections with no matches saved in Orphans table
xmatch and partition data flow
XMatch and Partition Data Flow

Detections

Loadsupport

Load

Detections

ObjZoneIndx

Detections_In

XMatch

Orphans

LinkToObj_In

Pm

Detections_m

Detections_chunk

Merge

Partitions

Update

Objects

Pull

Chunk

LinkToObj_chunk

LinkToObj_m

PS1

Objects

Objects_m

Pull

Partition

Switch

Partition

Objects_m

LinkToObj_m

LinkToObj

detailed design3
Detailed Design
  • Reuse SDSS software as much as possible
  • Data Transformation Layer (DX) – Interface to IPP
  • Data Loading Pipeline (DLP)
  • Data Storage (DS)
    • Schema and Test Queries
    • Database Management System
    • Scalable Data Architecture
    • Hardware
  • Query Manager (QM: CasJobs for prototype)
data storage test queries
Data Storage – Test Queries
  • Drawn from several sources
    • Initial set of SDSS 20 queries
    • SDSS SkyServer Sample Queries
    • Queries from PS scientists (Monet, Howell, Kaiser, Heasley)
  • Two objectives
    • Find potential holes/issues in schema
    • Serve as test queries
      • Test DBMS iintegrity
      • Test DBMS performance
  • Loaded into CasJobs (Query Manager) as sample queries for prototype
data storage dbms
Data Storage – DBMS
  • Microsoft SQL Server 2005
    • Relational DBMS with excellent query optimizer
  • Plus
    • Spherical/HTM (C# library + SQL glue)
      • Spatial index (Hierarchical Triangular Mesh)
    • Zones (SQL library)
      • Alternate spatial decomposition with dec zones
    • Many stored procedures and functions
      • From coordinate conversions to neighbor search functions
    • Self-extracting documentation (metadata) and diagnostics
data storage scalable architecture
Data Storage – Scalable Architecture
  • Monolithic database design (a la SDSS) will not do it
  • SQL Server does not have cluster implementation
    • Do it by hand
  • Partitions vs Slices
    • Partitions are file-groups on the same server
      • Parallelize disk accesses on the same machine
    • Slices are data partitions on separate servers
    • We use both!
  • Additional slices can be added for scale-out
  • For PS1, use SQL Server Distributed Partition Views (DPVs)
distributed partitioned views
Distributed Partitioned Views
  • Difference between DPVs and file-group partitioning
    • FG on same database
    • DPVs on separate DBs
    • FGs are for scale-up
    • DPVs are for scale-out
  • Main server has a view of a partitioned table that includes remote partitions (we call them slices to distinguish them from FG partitions)
  • Accomplished with SQL Server’s linked server technology
  • NOT truly parallel, though
scalable data architecture
Scalable Data Architecture

S1

Detections_S1

Objects_S1

Head

Objects

Objects_S1

Objects_S2

S2

Detections_S2

Objects_S3

Objects_S2

Detections DPV

Detections_S1

Detections_S2

Detections_S3

S3

Detections_S3

Objects_S3

  • Shared-nothing architecture
  • Detections split across cluster
  • Objects replicated on Head and Slice DBs
  • DPVs of Detections tables on the Headnode DB
  • Queries on Objects stay on head node
  • Queries on detections use only local data on slices
hardware prototype
Hardware - Prototype

Storage:

S3

PS04

4

10A = 10 x [13 x 750 GB]

3B = 3 x [12 x 500 GB]

2A

Server Naming

Convention:

Function:

S2

PS03

4

LX = Linux

L = Load server

S/Head = DB server

M = MyDB server

W = Web server

PS0x = 4-core

PS1x = 8-core

2A

S1

PS12

8

L2/M

PS05

4

A

2A

W

PS02

4

Head

PS11

8

L1

PS13

LX

PS01

4

8

B

2B

2A

A

Web

Staging

MyDB

Function

DB

Loading

39 TB

0 TB

9 TB

10 TB

Total space

RAID10

RAID10

RAID10

RAID5

RAID config

12D/4W

14D/3.5W

Disk/rack config

hardware ps1
Hardware – PS1

Ingest

Queries

Queries

Queries

Queries

Queries

Live

(Copy 1)

Offline

(Copy 2)

Spare

(Copy 3)

Live

(Copy 1)

Offline

(Copy 2)

Spare

(Copy 3)

Replicate

Live

(Copy 2)

Offline

(Copy 1)

Spare

(Copy 3)

Replicate

Live

(Copy 1)

Live

(Copy 2)

Spare

(Copy 3)

  • Ping-pong configuration to maintain high availability and query performance
  • 2 copies of each slice and of main (head) node database on fast hardware (hot spares)
  • 3rd spare copy on slow hardware (can be just disk)
  • Updates/ingest on offline copy then switch copies when ingest and replication finished
  • Synchronize second copy while first copy is online
  • Both copies live when no ingest
  • 3x basic config. for PS1
detailed design4
Detailed Design
  • Reuse SDSS software as much as possible
  • Data Transformation Layer (DX) – Interface to IPP
  • Data Loading Pipeline (DLP)
  • Data Storage (DS)
    • Schema and Test Queries
    • Database Management System
    • Scalable Data Architecture
    • Hardware
  • Query Manager (QM: CasJobs for prototype)
query manager
Query Manager
  • Based on SDSS CasJobs
  • Configure to work with distributed database, DPVs
  • Direct links (contexts) to slices can be added later if necessary
  • Segregates quick queries from long ones
  • Saves query results server-side in MyDB
  • Gives users a powerful query workbench
  • Can be scaled out to meet any query load
  • PS1 Sample Queries available to users
  • PS1 Prototype QM demo
odm prototype components
ODM Prototype Components
  • Data Loading Pipeline
  • Data Storage
  • CasJobs
    • Query Manager (QM)
    • Web Based Interface (WBI)
  • Testing
common spatial questions
Common Spatial Questions

Points in region queries

  • Find all objects in this region
  • Find all “good” objects (not in masked areas)
  • Is this point in any of the regions

Region in region

  • Find regions near this region and their area
  • Find all objects with error boxes intersecting region
  • What is the common part of these regions

Various statistical operations

  • Find the object counts over a given region list
  • Cross-match these two catalogs in the region
sky coordinates of points
Sky Coordinates of Points
  • Many different coordinate systems
    • Equatorial, Galactic, Ecliptic, Supergalactic
  • Longitude-latitude constraints
  • Searches often in mix of different coordinate systems
    • gb>40 and dec between 10 and 20
    • Problem: coordinate singularities, transformations
  • How can one describe constraints in a easy, uniform fashion?
  • How can one perform fast database queries in an easy fashion?
    • Fast:Indexes
    • Easy: simple query expressions
describing regions
Describing Regions

Spacetime metadata for the VO (Arnold Rots)

  • Includes definitions of
    • Constraint: single small or great circle
    • Convex: intersection of constraints
    • Region: union of convexes
  • Support both angles and Cartesian descriptions
  • Constructors for
    • CIRCLE, RECTANGLE, POLYGON, CONVEX HULL
  • Boolean algebra (INTERSECTION, UNION, DIFF)
  • Proper language to describe the abstract regions
  • Similar to GIS, but much better suited for astronomy
we do spatial 3 ways
We Do Spatial 3 Ways
  • Hierarchical Triangular Mesh (extension to SQL)
    • Uses table valued functions
    • Acts as a new “spatial access method”
  • Zones: fits SQL well
    • Surprisingly simple & good
  • 3D Constraints: a novel idea
    • Algebra on regions, can be implemented in pure SQL
ps1 footprint
PS1 Footprint
  • Using the projection cell definitions as centers for tessellation (T. Budavari)
crossmatch zone approach
CrossMatch: Zone Approach
  • Divide space into declination zones
  • Objects ordered by zoneid, ra (on the sphere need wrap-around margin.)
  • Point search look in neighboring zones within ~ (ra ± Δ) bounding box
  • All inside the relational engine
  • Avoids “impedance mismatch”
  • Can “batch” comparisons
  • Automatically parallel
  • Details in Maria’s thesis

r

ra-zoneMax

x

zoneMax

ra ± Δ

indexing using quadtrees
Indexing Using Quadtrees

222

20

223

2,3,0

220

221

2,0

23

2,3,1

2,3,2

2,3,3

21

22

2,1

2,2

2,3

  • Cover the sky with hierarchical pixels
  • COBE – start with a cube
  • Hierarchical Triangular Mesh (HTM) uses trixels
    • Samet, Fekete
  • Start with an octahedron, andsplit each triangle into 4 children,down to 20 levels deep
  • Smallest triangles are 0.3”
  • Each trixel has a unique htmID
space filling curve
Space-Filling Curve

[0.120,0.121)

122

[0.122,0.130)

1,2,1

121

120

132

131

112

133

102

113

103

130

101

111

110

100

[0.12,0.13)

[0.120,0.121)

[0.121,0.122)

[0.122,0.123)

[0.123,0.130)

Triangles correspond to ranges

All points inside the triangle are inside the range.

sql htm extension
SQL HTM Extension
  • Every object has a 20-deep htmID (44bits)
  • Clustered index on htmID
  • Table-valued functions for spatial joins
    • Given a region definition, routine returns up to 10 ranges of covering triangles
    • Spatial query is mapped to ~10 range queries
  • Current implementation rewritten in C#
  • Excellent performance, little calling overhead
  • Three layers
    • General geometry library
    • HTM kernel
    • IO (parsing + SQL interface)
writing spatial sql
Writing Spatial SQL

-- region description is contained by @area

DECLARE @cover TABLE (htmStart bigint,htmEnd bigint)

INSERT @cover SELECT * from dbo.fHtmCover(@area)

--

DECLARE @region TABLE ( convexId bigint,x float, y float, z float)

INSERT @region SELECT dbo.fGetHalfSpaces(@area)

--

SELECT o.ra, o.dec, 1 as flag, o.objid

FROM (SELECT objID as objid, cx,cy,cz,ra,[dec]

FROM Objects q JOIN @cover AS c

ON q.htmID between c.HtmIdStart and c.HtmIdEnd

) AS o

WHERE NOT EXISTS (

SELECT p.convexId

FROM @region AS p

WHERE (o.cx*p.x + o.cy*p.y + o.cz*p.z < p.c)

GROUP BY p.convexId

)

status
Status
  • All three libraries extensively tested
  • Zones used for Maria’s thesis, plus various papers
  • New HTM code in production use since July on SDSS
  • Same code also used by STScI HLA, Galex
  • Systematic regression tests developed
  • Footprints computed for all major surveys
  • Complex mask computations done on SDSS
  • Loading: zones used for bulk crossmatch
  • Ad hoc queries: use HTM-based search functions
  • Excellent performance
ps1 psps object data manager design1

PS1 PSPSObject Data Manager Design

PSPS Critical Design Review

November 5-6, 2007

IfA

detail design
Detail Design
  • General Concepts
  • Distributed Database architecture
  • Ingest Workflow
  • Prototype
zones
Zones

Declination (Dec)

Right Ascension (RA)

Zones (spatial partitioning and indexing algorithm)

  • Partition and bin the data into declination zones
    • ZoneID = floor ((dec + 90.0) / zoneHeight)
  • Few tricks required to handle spherical geometry
  • Place the data close on disk
    • Cluster Index on ZoneID and RA
  • Fully implemented in SQL
  • Efficient
    • Nearby searches
    • Cross-Match (especially)
  • Fundamental role in addressing the critical requirements
    • Data volume management
    • Association Speed
    • Spatial capabilities
zoned table
Zoned Table

ZoneID = floor ((dec + 90.0) / zoneHeight)

* ZoneHeight = 8 arcsec in this example

sql crossneighbors
SQL CrossNeighbors

SELECT *

FROM prObj1 z1

JOIN zoneZone ZZ

ON ZZ.zoneID1 = z1.zoneID

JOIN prObj2 z2

ON ZZ.ZoneID2 = z2.zoneID

WHERE

z2.ra BETWEEN z1.ra-ZZ.alpha AND z2.ra+ZZ.alpha

AND

z2.dec BETWEEN z1.dec-@r AND z1.dec+@r

AND

(z1.cx*z2.cx+z1.cy*z2.cy+z1.cz*z2.cz) > cos(radians(@r))

partitions
Partitions
  • SQL Server 2005 introduces technology to handle tables which are partitioned across different disk volumes and managed by a single server.
  • Partitioning makes management and access of large tables and indexes more efficient
    • Enables parallel I/O
    • Reduces the amount of data that needs to be accessed
    • Related tables can be aligned and collocated in the same place speeding up JOINS
partitions1
Partitions
  • 2 key elements
    • Partitioning function
      • Specifies how the table or index is partitioned
    • Partitioning schemas
      • Using a partitioning function, the schema specifies the placement of the partitions on file groups
  • Data can be managed very efficiently using Partition Switching
    • Add a table as a partition to an existing table
    • Switch a partition from one partitioned table to another
    • Reassign a partition to form a single table
  • Main requirement
    • The table must be constrained on the partitioning column
partitions2
Partitions
  • For the PS1 design,
    • Partitions mean File Group Partitions
    • Tables are partitioned into ranges of ObjectID, which correspond to declination ranges.
    • ObjectID boundaries are selected so that each partition has a similar number of objects.
distributed partitioned views1
Distributed Partitioned Views
  • Tables participating in the Distributed Partitioned View (DVP) reside on different databases which reside in different databases which reside on different instances or different (linked) servers
concept slices
Concept: Slices
  • In the PS1 design, the bigger tables will be partitioned across servers
  • To avoid confusion with the File Group Partitioning, we call them “Slices”
  • Data is glued together using Distributed Partitioned Views
  • The ODM will manage slices. Using slices improves system scalability.
  • For PS1 design, tables are sliced into ranges of ObjectID, which correspond to broad declination ranges. Each slice is subdivided into partitions that correspond to narrower declination ranges.
  • ObjectID boundaries are selected so that each slice has a similar number of objects.
detail design outline
Detail Design Outline
  • General Concepts
  • Distributed Database architecture
  • Ingest Workflow
  • Prototype
slide68

PS1 Distributed DB system

objZoneIndx

orphans_l1

Detections_l1

LnkToObj_l1

detections

detections

objZoneIndx

Orphans_ln

Detections_ln

LnkToObj_ln

Linked servers

Load

Support1

Load

Supportn

LoadAdmin

PartitionsMap

Linked servers

P1

Pm

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

Meta

PS1

PartitionsMap

Objects

LnkToObj

Meta

Detections

PS1 database

Query Manager (QM)

Legend

Database

Full table [partitioned table]

Output table

Partitioned View

Web Based Interface (WBI)

design decisions objid
Design Decisions: ObjID
  • Objects have their positional information encoded in their objID
    • fGetPanObjID (ra, dec, zoneH)
    • ZoneID is the most significant part of the ID
  • It gives scalability, performance, and spatial functionality
  • Object tables are range partitioned according to their object ID
objectid clusters data spatially
ObjectID Clusters Data Spatially

Dec = –16.71611583 ZH = 0.008333

ZID = (Dec+90) / ZH = 08794.0661

ObjectID = 087941012871550661

RA = 101.287155

ObjectID is unique when objects are separated by >0.0043 arcsec

design decisions detectid
Design Decisions: DetectID
  • Detections have their positional information encoded in the detection identifier
    • fGetDetectID (dec, observationID, runningID, zoneH)
    • Primary key (objID, detectionID), to align detections with objects within partitions
    • Provides efficient access to all detections associated to one object
    • Provides efficient access to all detections of nearby objects
detectionid clusters data in zones
DetectionID Clusters Data in Zones

Dec = –16.71611583 ZH = 0.008333

ZID = (Dec+90) / ZH = 08794.0661

DetectID = 0879410500001234567

ObservationID = 1050000

Running ID = 1234567

odm capacity
ODM Capacity

5.3.1.3 The PS1 ODM shall be able to ingest into the

ODM a total of

  • 1.51011 P2 detections
  • 8.31010 cumulative sky (stack) detections
  • 5.5109 celestial objects

together with their linkages.

slide75

What goes into the main Server

Linked servers

P1

Pm

PS1

PartitionsMap

Objects

LnkToObj

Meta

PS1 database

Objects

LnkToObj

Meta

PartitionsMap

Legend

Database

Full table [partitioned table]

Output table

Distributed Partitioned View

slide76

What goes into slices

Linked servers

P1

Pm

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

PS1

PartitionsMap

Objects

LnkToObj

Meta

PS1 database

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

PartitionsMap

Legend

Database

Full table [partitioned table]

Output table

Distributed Partitioned View

slide77

What goes into slices

Linked servers

P1

Pm

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

PS1

PartitionsMap

Objects

LnkToObj

Meta

PS1 database

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

PartitionsMap

Legend

Database

Full table [partitioned table]

Output table

Distributed Partitioned View

duplication of objects lnktoobj
Duplication of Objects & LnkToObj
  • Objects are distributed across slices
  • Objects, P2ToObj, and StackToObj are duplicated in the slices to parallelize “inserts” & “updates”
  • Detections belong into their object’s slice
  • Orphans belong to the slice where their position would allocate them
    • Orphans near slices’ boundaries will need special treatment
  • Objects keep their original object identifier
    • Even though positional refinement might change their zoneID and therefore the most significant part of their identifier
slide79

Glue = Distributed Views

Linked servers

P1

Pm

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

PS1

PartitionsMap

Objects

LnkToObj

Meta

Detections

PS1 database

Detections

Legend

Database

Full table [partitioned table]

Output table

Distributed Partitioned View

slide80

Partitioning in Main Server

  • Main server is partitioned (objects) and collocated (lnkToObj) by objid
  • Slices are partitioned (objects) and collocated (lnkToObj) by objid

Linked servers

P1

Pm

PS1

PS1 database

Query Manager (QM)

Web Based Interface (WBI)

detail design outline1
Detail Design Outline
  • General Concepts
  • Distributed Database architecture
  • Ingest Workflow
  • Prototype
slide85

PS1 Distributed DB system

objZoneIndx

orphans_l1

Detections_l1

LnkToObj_l1

detections

detections

objZoneIndx

Orphans_ln

Detections_ln

LnkToObj_ln

Linked servers

Load

Support1

Load

Supportn

LoadAdmin

PartitionsMap

Linked servers

P1

Pm

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

Meta

PS1

PartitionsMap

Objects

LnkToObj

Meta

Detections

PS1 database

Query Manager (QM)

Legend

Database

Full table [partitioned table]

Output table

Partitioned View

Web Based Interface (WBI)

insert update
“Insert” & “Update”
  • SQLInsert and Update are expensive operations due to logging and re-indexing
  • In the PS1 design, Insert and Update have been re-factored into sequences of:

Merge + Constrain + Switch Partition

  • Frequency
    • f1: daily
    • f2: at least monthly
    • f3: TBD (likely to be every 6 months)
ingest workflow
Ingest Workflow

X(1”)

DZone

DXO_1a

X(2”)

NoMatch

Resolve

Detect

DXO_2a

P2PsfFits

Orphans

P2ToObj

ObjectsZ

CSV

ingest @ frequency f1
Ingest @ frequency = f1

Orphans_1

P2ToPsfFits_1

P2ToObj_1

StackToObj

P2ToObj

Objects

1

11

2

12

13

3

Orphans_1

Stack*_1

P2ToPsfFits_1

Objects_1

P2ToObj_1

ObjectsZ

P2ToObj

P2PsfFits

Metadata+

Orphans

SLICE_1

LOADER

MAIN

updates @ frequency f2
Updates @ frequency = f2

StackToObj

P2ToObj

Objects

1

11

2

12

3

13

Objects

P2ToObj_1

Stack*_1

P2ToPsfFits_1

Orphans_1

Objects_1

SLICE_1

LOADER

MAIN

Metadata+

updates @ frequency f21
Updates @ frequency = f2

StackToObj

P2ToObj

Objects

1

11

2

12

13

3

Objects

Objects_1

P2ToPsfFits_1

Stack*_1

Orphans_1

P2ToObj_1

Objects_1

Objects

Metadata+

SLICE_1

LOADER

MAIN

snapshots @ frequency f3
Snapshots @ frequency = f3

Objects

P2ToObj

StackToObj

Objects

1

2

3

Snapshot

Metadata+

MAIN

batch update of a partition
Batch Update of a Partition

select into

1

1

2

1

2

3

A1

A2

A3

merged

select into … where

select into … where

select into … where

B1 + PK index

B2 + PK index

B3 + PK index

switch

switch

switch

B1

slide93

Scaling-out

  • Apply Ping-Pong strategy to satisfy query performance during ingest

2 x ( 1 main + m slices)

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

[Objects_p2]

[LnkToObj_p2]

[Detections_p2]

Meta

Linked servers

P1

P2

Pm

P1

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

P2

P3

Pm-1

Pm

PS1

PS1

Detections

Detections

PartitionsMap

Objects

LnkToObj

Meta

PartitionsMap

Objects

LnkToObj

Meta

PS1 database

Query Manager (QM)

Legend

Database Duplicate

Full table [partitioned table]

Partitioned View Duplicate P view

slide94

Scaling-out

  • More robustness, fault-tolerance, and reabilability calls for

3 x ( 1 main + m slices)

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

[Objects_p2]

[LnkToObj_p2]

[Detections_p2]

Meta

Linked servers

P1

P2

Pm

P1

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

P2

P3

Pm-1

Pm

PS1

PS1

Detections

Detections

PartitionsMap

Objects

LnkToObj

Meta

PartitionsMap

Objects

LnkToObj

Meta

PS1 database

Query Manager (QM)

Legend

Database Duplicate

Full table [partitioned table]

Partitioned View Duplicate P view

slide95

Adding New slices

SQL Server range partitioning capabilities make it easy

  • Recalculate partitioning limits
  • Transfer data to new slices
  • Remove data from slices
  • Define an d Apply new partitioning schema
  • Add new partitions to main server
  • Apply new partitioning schema to main server
detail design outline2
Detail Design Outline
  • General Concepts
  • Distributed Database architecture
  • Ingest Workflow
  • Prototype
odm ingest performance
ODM Ingest Performance

5.3.1.6 The PS1 ODM shall be able to ingest the data

from the IPP at two times the nominal daily arrival rate*

* The nominal daily data rate from the IPP is defined as the total data volume to be ingested annually by the ODM divided by 365.

  • Nominal daily data rate:
    • 1.51011 / 3.5 / 365 = 1.2108 P2 detections / day
    • 8.31010 / 3.5 / 365 = 6.5107 stack detections / day
number of objects
Number of Objects

* “SDSS” includes a mirror of 11.3 <  < 30 objects to  < 0

Total GB of csv loaded data: 300 GB

CSV Bulk insert load: 8 MB/s

Binary Bulk insert: 18-20 MB/s

CreationStarted: October 15th 2007

Finished: October 29th 2007 (??)

Includes

  • 10 epochs of P2PsfFits detections
  • 1 epoch of Stack detections
size of prototype database
Size of Prototype Database

Table sizes are in billions of rows

size of prototype database1
Size of Prototype Database

Table sizes are in GB

9.6 TB of data in a distributed database

ingest and association times1
Ingest and Association Times

Educated Guess

Wild Guess

total time to i a daily data
Total Time to I/A daily Data

Requirement: Less than 12 hours (more than 2800 detections / s)

Detection Processing Rate: 8600 to 7400 detections / s

Margin on Requirement: 3.1 to 2.6

Using multiple loaders would improve performance

insert time @ slices
Insert Time @ slices

Educated Guess

total time for insert @ slice
Total Time for Insert @ slice

Daily insert may operate in parallel with daily ingest and association.

Requirement: Less than 12 hours

Margin on Requirement: 2.0

Using more slices will improve insert performance.

summary
Summary
  • Ingest + Association < 4 h using 1 loader (@f1= daily)
    • Scales with the number of servers
    • Current margin on requirement 3.1
    • Room for improvement
  • Detection Insert @ slices (@f1= daily)
    • 6 h with 8 partitions/slice
    • It may happen in parallel with loading
  • Detections Lnks Insert @ main (@f2 < monthly)
    • Unknown
    • 6 h available
  • Objects insert & update @ slices (@f2 < monthly)
    • Unknown
    • 6 hours available
  • Objects update @ main server (@f2 < monthly)
    • Unknown
    • 12 h available. Transfer can be pipelined as soon as objects have been processed
risks
Risks
  • Estimates of Insert & Update at slices could be underestimated
    • Need more empirical evaluation of exercising parallel I/O
  • Estimates and lay out of disk storage could be underestimated
    • Merges and Indexes require 2x the data size
engineering systems to support the database design
Engineering Systems to Support the Database Design
  • Sequential read performance is our life-blood. Virtually all science queries will be I/O-bound.
  • ~70 TB raw data: 5.9 hours for full scan on IBM’s fastest 3.3 GB/s Champagne-budget SAN
    • Need 20 GB/s IO engine just to scan the full data in less than an hour. Can’t touch this on a monolith.
  • Data mining a challenge even with good index coverage
    • ~14 TB worth of indexes: 4-odd times bigger than SDSS DR6.
  • Hopeless if we rely on any bulk network transfers: must do work where the data is
  • Loading/Ingest more cpu-bound, though we still need solid write performance
choosing i o systems
Choosing I/O Systems
  • So killer sequential I/O performance is a key systems design goal. Which gear to use?
    • FC/SAN?
    • Vanilla SATA?
    • SAS?
fibre channel san
Fibre Channel, SAN
  • Expensive but not-so-fast physical links (4 Gbit, 10 Gbit)
  • Expensive switch
  • Potentially very flexible
  • Industrial strength manageability
  • Little control over RAID controller bottlenecks
straight sata
Straight SATA
  • Fast
  • Pretty cheap
  • Not so industrial-strength
slide118
SAS
  • Fast: 12 Gbit/s FD building blocks
  • Nice and mature, stable
  • SCSI’s not just for swanky drives anymore: takes SATA drives!
  • So we have a way to use SATA without all the “beige”.
  • Pricey? $4400 for full 15x750GB system ($296/drive == close to Newegg media cost)
sas performance gory details
SAS Performance, Gory Details
  • SAS v. SATA differences
per controller performance
Per-Controller Performance
  • One controller can’t quite accommodate the throughput of an entire storage enclosure.
resulting ps1 prototype i o topology
Resulting PS1 Prototype I/O Topology
  • 1100 MB/s single-threaded sequential reads per server
raid 5 v raid 10
RAID-5 v. RAID-10?
  • Primer, anyone?
  • RAID-5 perhaps feasible with contemporary controllers…
  • …but not a ton of redundancy
  • But after we add enough disks to meet performance goals, we have enough storage to run RAID-10 anyway!
raid 10 performance
RAID-10 Performance
  • 0.5*RAID-0 for single-threaded reads
  • RAID-0 perf for 2-user/2-thread workloads
  • 0.5*RAID-0 writes
backup recovery replication strategies
Backup/Recovery/Replication Strategies
  • No formal backup
    • …except maybe for mydb’s, f(cost*policy)
  • 3-way replication
    • Replication != backup
      • Little or no history (though we might have some point-in-time capabilities via metadata
      • Replicas can be a bit too cozy: must notice badness before replication propagates it
    • Replicas provide redundancy and load balancing…
    • Fully online: zero time to recover
    • Replicas needed for happy production performance plus ingest, anyway
  • Off-site geoplex
    • Provides continuity if we lose HI (local or trans-Pacific network outage, facilities outage)
    • Could help balance trans-Pacific bandwidth needs (service continental traffic locally)
why no traditional backups
Why No Traditional Backups?
  • Money no object… do traditional backups too!!!
  • Synergy, economy of scale with other collaboration needs (IPP?)… do traditional backups too!!!
  • Not super pricey…
  • …but not very useful relative to a replica for our purposes
    • Time to recover
failure scenarios easy ones
Failure Scenarios (Easy Ones)
  • Zero downtime, little effort:
    • Disks (common)
      • Simple* hotswap
      • Automatic rebuild from hotspare or replacement drive
    • Power supplies (not uncommon)
      • Simple* hotswap
    • Fans (pretty common)
      • Simple* hotswap

* Assuming sufficiently non-beige gear

failure scenarios mostly harmless ones
Failure Scenarios (Mostly Harmless Ones)
  • Some downtime and replica cutover:
    • System board (rare)
    • Memory (rare and usually proactively detected and handled via scheduled maintenance)
    • Disk controller (rare, potentially minimal downtime via cold-spare controller)
    • CPU (not utterly uncommon, can be tough and time consuming to diagnose correctly)
failure scenarios slightly spooky ones
Failure Scenarios (Slightly Spooky Ones)
  • Database mangling by human or pipeline error
    • Gotta catch this before replication propagates it everywhere
    • Need lots of sanity checks before replicating
    • (and so off-the-shelf near-realtime replication tools don’t help us)
    • Need to run replication backwards from older, healthy replicas. Probably less automated than healthy replication.
  • Catastrophic loss of datacenter
    • Okay, we have the geoplex
      • …but we’re dangling by a single copy ‘till recovery is complete
      • …and this may be a while.
      • …but are we still in trouble? Depending on colo scenarios, did we also lose the IPP and flatfile archive?
failure scenarios nasty ones
Failure Scenarios (Nasty Ones)
  • Unrecoverable badness fully replicated before detection
  • Catastrophic loss of datacenter without geoplex
  • Can we ever catch back up with the data rate if we need to start over and rebuild with an ingest campaign? Don’t bet on it!
operating systems dbms
Operating Systems, DBMS?
  • Sql2005 EE x64
    • Why?
    • Why not DB2, Oracle RAC, PostgreSQL, MySQL, <insert your favorite>?
  • (Win2003 EE x64)
  • Why EE? Because it’s there. <indexed DPVs?>
  • Scientific Linux 4.x/5.x, or local favorite
  • Platform rant from JVV available over beers
systems database management
Systems/Database Management
  • Active Directory infrastructure
  • Windows patching tools, practices
  • Linux patching tools, practices
  • Monitoring
  • Staffing requirements
facilities infrastructure projections for ps1
Facilities/Infrastructure Projections for PS1
  • Power/cooling
    • Prototype is 9.2 kW (2.6 Tons AC)
    • PS1: something like 43 kW, 12.1 Tons
  • Rack space
    • Prototype is 69 RU, <2 42U racks (includes 14U of rackmount UPS at JHU)
    • PS1: about 310 RU (9-ish racks)
  • Networking: ~40 Gbit Ethernet ports
  • …plus sundry infrastructure, ideally already in place (domain controllers, monitoring systems, etc.)
how design meets requirements
How Design Meets Requirements
  • Cross-matching detections with objects
    • Zone cross-match part of loading pipeline
    • Already exceeded requirement with prototype
  • Query performance
    • Ping-pong configuration for query during ingest
    • Spatial indexing and distributed queries
    • Query manager can be scaled out as necessary
  • Scalability
    • Shared-nothing architecture
    • Scale out as needed
    • Beyond PS1 we will need truly parallel query plans
wbs development tasks
WBS/Development Tasks

2 PM

3 PM

1 PM

3 PM

3 PM

1 PM

2 PM

2 PM

2 PM

2 PM

4 PM

4 PM

Refine Prototype/Schema

Staging/Transformation

Initial Load

4 PM

Load/Resolve Detections

Workflow Systems

Logging

Data Scrubbing

SSIS (?) + C#

Resolve/Synchronize Objects

Create Snapshot

Replication Module

Query Processing

2 PM

Hardware

QM/Logging

Redistribute Data

Total Effort: 35 PM

Delivery: 9/2008

Documentation

Testing

personnel available
Personnel Available
  • 2 new hires (SW Engineers) 100%
  • Maria 80%
  • Ani 20%
  • Jan 10%
  • Alainna 15%
  • Nolan Li 25%
  • Sam Carliles 25%
  • George Fekete 5%
  • Laszlo Dobos 50% (for 6 months)
issues risks
Issues/Risks
  • Versioning
    • Do we need to preserve snapshots of monthly versions?
    • How will users reproduce queries on subsequent versions?
    • Is it ok that a new version of the sky replaces the previous one every month?
  • Backup/recovery
    • Will we need 3 local copies rather than 2 for safety
    • Is restoring from offsite copy feasible?
  • Handoff to IfA beyond scope of WBS shown
    • This will involve several PMs
query manager1
Query Manager

MyDB table that query results go into

Check query syntax

Name that this query job is given

Context that query is executed in

Get graphical query plan

Query

buffer

Run query in quick (1 minute) mode

Load one of the sample queries into query buffer

Submit query to long (8-hour) queue

query manager2
Query Manager

Stored procedure arguments

SQL code for stored procedure

query manager3
Query Manager

MyDB context is the default, but other contexts can be selected

User can browse DB Views, Tables, Functions and Procedures

The space used and total space available

Multiple tables can be selected and dropped at once

Table list can be sorted by name, size, type.

query manager4
Query Manager

The query that created this table

query manager5
Query Manager

Context to run search on

Search radius

Table to hold results