The pan starrs data challenge
Download
1 / 37

The Pan-STARRS Data Challenge - PowerPoint PPT Presentation


  • 250 Views
  • Uploaded on

The Pan-STARRS Data Challenge. Jim Heasley Institute for Astronomy University of Hawaii ICS 624 – 28 March 2011. What is Pan-STARRS?. Pan-STARRS - a new telescope facility 4 smallish (1.8m) telescopes, but with extremely wide field of view

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The Pan-STARRS Data Challenge' - hazel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The pan starrs data challenge

The Pan-STARRS Data Challenge

Jim Heasley

Institute for Astronomy

University of Hawaii

ICS 624 – 28 March 2011


What is pan starrs
What is Pan-STARRS?

  • Pan-STARRS - a new telescope facility

  • 4 smallish (1.8m) telescopes, but with extremely wide field of view

  • Can scan the sky rapidly and repeatedly, and can detect very faint objects

    • Unique time-resolution capability

  • Project led by IfA with help from Air Force, Maui High Performance Computer Center, MIT’s Lincoln Lab.

  • The prototype, PS1, will be operated by an international consortium

ICS624


Pan starrs overview
Pan-STARRS Overview

  • Pan-STARRS observatory specifications

    • Four 1.8m R-C + corrector

    • 7 square degree FOV - 1.4Gpixel cameras

    • Sited in Hawaii

    • A  = 50

    • R ~ 24 in 30 s integration

      • -> 7000 square deg/night

    • All sky + deep field surveys in g,r,i,z,y

  • Time domain astronomy

    • Transient objects

    • Moving objects

    • Variable objects

  • Static sky science

    • Enabled by stacking repeated scans to form a collection of ultra-deep static sky images

ICS624





Front of the wave
Front of the Wave

  • Pan-STARRS is only the first of a new generation of astronomical data programs that will generate such large volumes of data:

    • SkyMapper, southern hemisphere optical

    • VISTA, southern hemisphere IR survey

    • LSST, an all sky survey like Pan-STARRS

  • Eventually, these data sets will be useful for data mining.

ICS624




Ps1 data products
PS1 Data Products

  • Detections—measurements obtained directly from processed image frames

    • Detection catalogs

    • “Stacks” of the sky images source catalogs

    • Difference catalogs

      • High significance (> 5 transient events)

      • Low significance (transients between 3 and 5 )

    • Other Image Stacks (Medium Deep Survey)

  • Objects—aggregates derived from detections

ICS624


What s the challenge
What’s the Challenge?

  • At first blush, this looks pretty much like the Sloan Digital Sky Survey…

  • BUT

    • Size – Over its 3 year mission, PS1 will record over 150 billion detections for approximately 5.5 billion sources

    • Dynamic Nature – new data will be always coming into the database system, for things we’ve seen before or new discoveries

ICS624


Book learning
Book Learning

  • The books on database design tell you to

    • Interview your users to determine what they want to use the database for

    • Determine the most common queries your users are going to ask

    • Organize your data into a normalized logical schema

    • Select a physical schema appropriate to your problem.

ICS624


Real world
Real World

  • The infamous “20 Queries” of Alex Szalay (JHU) in designing the SDSS

  • Normalized schema are good but can result in very big performance penalties

  • Money talks – in the real world you are constrained by a budget and not all physical implementations of your database may be affordable (for one reason or another)!

ICS624


Psps top level requirements
PSPS Top Level Requirements

  • 3.3.01 The PSPS shall be able to ingest a total of 1.5x1011 P2 detections, 8.3x1010 cumulative sky detections, and 5.5 x109 celestial objects together with their linkages.

ICS624


Psps top level requirements1
PSPS Top Level Requirements

  • 3.3.02 The PSPS shall be able to ingest the observational metadata for up to a total of 1.1x1010 observations.

  • 3.3.0.3 The PS1 PSPS shall be capable of archiving up to ~ 100 Terabytes of data.

ICS624


Psps top level requirements2
PSPS Top Level Requirements

  • 3.3.0.4 The PSPS shall archive the PS1 data products.

  • 3.3.0.5 The PSPS shall possess a computer security system to protect potentially vulnerable subsystems from malicious external actions.

ICS624


Psps top level requirements3
PSPS Top Level Requirements

  • 3.3.0.6 The PSPS shall provide end-users access to detections of objects in the Pan-STARRS databases.

  • 3.3.0.7 The PSPS shall provide end-users access to the cumulative stationary sky images generated by the Pan-STARRS.

ICS624


Psps top level requirements4
PSPS Top Level Requirements

  • 3.3.0.8 The PSPS shall provide end-users with metadata required to interpret the observational legacy and processing history of the Pan-STARRS data products.

  • 3.3.0.9 The PSPS shall provide end-users with Pan-STARRS detections of objects in the Solar System for which attributes can be assigned.

ICS624


Psps top level requirements5
PSPS Top Level Requirements

  • 3.3.0.10 The PSPS shall provide end-users with derived Solar System objects deduced from Pan-STARRS attributed observations and observations from other sources.

  • 3.3.0.11 The PSPS shall provide the capability for end-users to construct queries to search the Pan-STARRS data products over space and time to examine magnitudes, colors, and proper motions.

ICS624


Psps top level requirements6
PSPS Top Level Requirements

  • 3.3.0.12 The PSPS shall provide a mass storage system with a reliability requirement of 99.9% (TBR).

  • 3.3.0.13 The PSPS baseline configuration should accommodate future additions of databases (i.e., be expandable).

ICS624


How to approach this challenge
How to Approach This Challenge

  • There are many possible approaches to deal with this data challenge.

  • Shared what?

    • Memory

    • Disk

    • Nothing

  • Not all of these approaches are created equal, either in cost and/or performance (DeWitt & Gray, 1992, “Parallel Database Systems: The Future of High Performance Database Processing”).

ICS624


Conversation with the pan starrs project manager
Conversation with the Pan-STARRS Project Manager

  • Jim: Tom, what are we going to do if the solution proposed by SAIC is more than you can afford?

  • Tom: Jim, I’m sure you’ll think of something!

  • Not long after that, SAIC did give us a hardware/software plan we couldn’t afford. Not long after, Tom resigned from the project to pursue other activities…

ICS624


The saic odm architecture proposal
The SAIC ODM Architecture Proposal

Ingest

Query

RIGHT BRAIN

LEFT BRAIN

Publish

  • Single multi-processor machine

  • High performance storage

    • Objects

    • Staging

    • Ingest detections

  • Clustered small processor machines

  • High capacity storage

    • Published detections

ICS624


The saic odm architecture proposal1
The SAIC ODM Architecture Proposal

Ingest

Query

RIGHT BRAIN

LEFT BRAIN

$

Publish

  • Single multi-processor machine

  • High performance storage

    • Objects

    • Staging

    • Ingest detections

  • Clustered small processor machines

  • High capacity storage

    • Published detections

ICS624


Conversation with the pan starrs project manager1
Conversation with the Pan-STARRS Project Manager

  • The Pan-STARRS project teamed up with Alex Szalay and his database team at JHU as they were the only game in town with real experience building large astronomical databases.

ICS624


Building upon the sdss heritage
Building upon the SDSS Heritage

  • In teaming with the group at JHU we hoped to build upon the experience and software developed for the SDSS.

  • A key question was how could we scale the system to deal with the volume of data expected from PS1 (> 10X SDSS in the first year alone).

  • The second key question, could the system keep up with the data flow.

  • The heritage is more one of philosophy than recycled software, as to deal with the challenges posed by PS1 we’ve had to generate a great deal of new code.

ICS624



Data storage logical schema
Data Storage Logical Schema

ICS624


The object data manager
The Object Data Manager

  • The Object Data Manager (ODM) was considered to be the “long pole” in the development of the PS1 PSPS.

  • Parallel database systems can provide both data redundancy and spreading very large tables that can’t fit on a single machine over multiple storage volumes.

  • For PS1 (and beyond) we need both.

ICS624


Distributed architecture
Distributed Architecture

  • The bigger tables will be spatially partitioned across servers called Slices

  • Using slices improves system scalability

  • Tables are sliced into ranges of ObjectID, which correspond to broad declination ranges

  • ObjectID boundaries are selected so that each slice has a similar number of objects

  • Distributed Partitioned Views “glue” the data together

ICS624


Data storage logical schema1
Data Storage Logical Schema

ICS624


Design decisions objid
Design Decisions: ObjID

  • Objects have their positional information encoded in their objID

    • fGetPanObjID (ra, dec, zoneH)

    • ZoneID is the most significant part of the ID

    • objID is the Primary Key

  • Objects are organized (clustered indexed) so nearby objects in the sky are stored on disk nearby as well

  • It gives good search performance, spatial functionality, and scalability

ICS624


Pan starrs data flow
Pan-STARRS Data Flow

← Behind the Cloud|| User facing services →

Data Valet Workflows

Astronomers

(Data Consumers)

The Pan-STARRS Science Cloud

Data Consumer

Queries & Workflows

WarmSlice DB 1

Data Creators

Load Workflow

Load DB

CSV

Files

Cold Slice DB 1

Image Procesing Pipeline (IPP)

MyDB

Merge Workflow

Flip Workflow

Hot Slice DB 2

MainDB

Distributed View

CSV

Files

Load DB

CASJobs

Query Service

Load Workflow

Telescope

Merge Workflow

Cold Slice DB 2

MainDB Distributed View

Flip Workflow

WarmSlice DB 2

Hot Slice DB 1

Validation Exception

Notification

MyDB

Admin & Load-Merge Machines

Slice Fault Recover Workflow

Data flows in one direction→, except for error recovery

Production Machines

ICS624


Pan-STARRS Data Layout

Image

Pipeline

L1 Data

csv

csv

csv

csv

csv

csv

L2 Data

Load-Merge Nodes

LOAD

Load

Merge 1

Load

Merge 2

Load

Merge 3

Load

Merge 4

Load

Merge 5

Load

Merge 6

COLD

S

1

S

2

S

3

S

4

S

5

S

6

S

7

S

8

S

9

S

10

S

11

S

12

S

13

S

14

S

15

S

16

Slice Nodes

Slice

1

Slice

2

Slice

3

Slice

4

Slice

5

Slice

6

Slice

7

Slice

8

HOT

S

1

S

2

S

3

S

4

S

5

S

6

S

7

S

8

S

9

S

10

S

11

S

12

S

13

S

14

S

15

S

16

WARM

s

16

s

3

s

2

s

5

s

4

s

7

s

6

s

9

s

8

s

11

s

10

s

13

s

12

s

15

s

14

s

1

Head 1

Main

Main

Head 2

ICS624

Head Nodes

Distributed View


The odm infrastructure
The ODM Infrastructure

  • Much of our software development has gone into extending the ingest pipeline developed for SDSS.

  • Unlike SDSS, we don’t have “campaign” loads but a steady from of data from the telescope through the Image Processing Pipeline to the ODM.

  • We have constructed data workflows to deal with both the regular data flow into the ODM as well as anticipated failure modes (lost disk, RAID, and various severer nodes).

ICS624


Pan-STARRS

Object Data Manager Subsystem

System Operation UI

System Health Monitor UI

Query Performance UI

Data Flow

Control Flow

System & Administration Workflows

Orchestrates all cluster changes, such as, data loading, or fault tolerance

Configuration, Health & Performance Monitoring

Cluster deployment and operations

Pan-STARRS Cloud Services for Astronomers

Internal Data Flow and State Logging

Tools for supporting workflow authoring and execution

Loaded Astronomy Databases

~70TB Transfer/Week

Deployed Astronomy Databases

~70TB Storage/Year

Query Manager

Science queries and MyDB for results

Pan-STARRS

Telescope

Image Processing Pipeline

Extracts objects like stars and galaxies from telescope images

~1TB Input/Week

ICS624

36


What next
What Next?

  • Will this approach scale to our needs?

    • PS1 – yes. But, we already see the need for better parallel processing query plans.

    • PS4 – unclear! Even though I’m not from Missouri, “show me!” One year of PS4 produces > data volume than the entire PS1 3 year mission!

  • Column based databases?

  • Cloud computing?

    • How can we test issues like scalability without actually building the system?

    • Does each project really need its own data center?

    • Having these databases “in the cloud” may greatly facilitate data sharing/mining.

ICS624


ad