The world wide telescope astronomy with terabytes l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 44

The World-Wide Telescope: Astronomy with Terabytes PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

The World-Wide Telescope: Astronomy with Terabytes. Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research. Outline. Challenges: A New Kind of Science Publishing the Data Data Federation The Virtual Observatory Analyzing Terabytes of Data. Living in an Exponential World.

Download Presentation

The World-Wide Telescope: Astronomy with Terabytes

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The world wide telescope astronomy with terabytes l.jpg

The World-Wide Telescope: Astronomy with Terabytes

Alex SzalayThe Johns Hopkins University

Jim Gray Microsoft Research


Outline l.jpg

Outline

  • Challenges: A New Kind of Science

  • Publishing the Data

  • Data Federation

  • The Virtual Observatory

  • Analyzing Terabytes of Data


Living in an exponential world l.jpg

Living in an Exponential World

  • Astronomers have a few hundred TB now

    • 1 pixel (byte) / sq arc second ~ 4TB

    • Multi-spectral, temporal, … → 1PB

  • They mine it looking fornew (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space

  • Data doubles every year

  • Data is public after 1 year

  • So, 50% of the data is public

  • Same access for everyone


The challenges l.jpg

The Challenges

Exponential data growth: Distributed collections Soon Petabytes

Data

Collection

Discovery

and Analysis

Publishing

New analysis paradigm: Data federations, Move analysis to data

New publishing paradigm: Scientists are publishers and Curators


Evolving science l.jpg

Evolving Science

  • Thousand years ago: science was empirical

    describing natural phenomena

  • Last few hundred years: theoretical branch

    using models, generalizations

  • Last few decades: a computational branch

    simulating complex phenomena

  • Today: data exploration (eScience)

    synthesizing theory, experiment and computation with advanced data management and statistics


Outline6 l.jpg

Outline

  • Challenges: A New Kind of Science

  • Publishing the Data

  • Data Federation

  • The Virtual Observatory

  • Analyzing Terabytes of Data


Publishing data l.jpg

Roles

Authors

Publishers

Curators

Consumers

Traditional

Scientists

Journals

Libraries

Scientists

Emerging

Collaborations

Project www site

Bigger Archives

Scientists

Publishing Data

  • Exponential growth:

    • Projects last at least 3-5 years

    • Data sent upwards only at the end of the project

    • Data will never be centralized

  • More responsibility on projects

    • Becoming Publishers and Curators

  • Data will reside with projects

    • Analyses must be close to the data


Data access is hitting a wall l.jpg

You can GREP 1 MB in a second

You can GREP 1 GB in a minute

You can GREP 1 TB in 2 days

You can GREP 1 PB in 3 years

Oh!, and 1PB ~4,000 disks

At some point you need indices to limit searchparallel data search and analysis

This is where databases can help

You can FTP 1 MB in 1 sec

You can FTP 1 GB / min (= 1 $/GB)

… 2 days and 1K$

… 3 years and 1M$

Data Access is Hitting a Wall

FTP and GREP are not adequate


Analysis and databases l.jpg

Analysis and Databases

  • Much statistical analysis deals with

    • Creating uniform samples –

    • data filtering

    • Assembling relevant subsets

    • Estimating completeness

    • censoring bad data

    • Counting and building histograms

    • Generating Monte-Carlo subsets

    • Likelihood calculations

    • Hypothesis testing

  • Traditionally these are performed on files

  • Most of these tasks are much better done inside a database

  • Move Mohamed to the mountain, not the mountain to Mohamed.


Smart data l.jpg

Smart Data

  • If there is too much data to move around,

    take the analysis to the data!

  • Do all data manipulations at database

    • Build custom procedures and functions in the database

  • Automatic parallelism guaranteed

  • Easy to build-in custom functionality

    • Databases & Procedures being unified

    • Example temporal and spatial indexing

    • Pixel processing

  • Easy to reorganize the data

    • Multiple views, each optimal for certain analyses

    • Building hierarchical summaries are trivial

  • Scalable to Petabyte datasets

active databases!


Outline11 l.jpg

Outline

  • Challenges: A New Kind of Science

  • Publishing the Data

  • Data Federation

  • The Virtual Observatory

  • Analyzing Terabytes of Data


Making discoveries l.jpg

Making Discoveries

  • Where are discoveries made?

    • At the edges and boundaries

    • Going deeper, collecting more data, using more colors….

  • Metcalfe’s law

    • Utility of computer networks grows as the number of possible connections: O(N2)

  • Federating data

    • Federation of N archives has utility O(N2)

    • Possibilities for new discoveries grow as O(N2)

  • Current sky surveys have proven this

    • Very early discoveries from SDSS, 2MASS, DPOSS


Data federations l.jpg

Data Federations

  • Massive datasets live near their owners:

    • Near the instrument’s software pipeline

    • Near the applications

    • Near data knowledge and curation

    • Super Computer centers become Super Data Centers

  • Each Archive publishes (web) services

    • Schema: documents the data

    • Methods on objects (queries)

  • Scientists get “personalized” extracts

  • Uniform access to multiple Archives

    • A common global schema

Federation


The key web services l.jpg

The Key: Web Services

Your

program

  • Web SERVER:

    • Given a url + parameters

    • Returns a web page (often dynamic)

  • Web SERVICE:

    • Given a XML document (soap msg)

    • Returns an XML document

    • Tools make this look like an RPC.

      • F(x,y,z) returns (u, v, w)

    • Distributed objects for the web.

    • + naming, discovery, security,..

  • Internet-scale distributed computing

Web

Server

http

Web page

Your

program

Web

Service

soap

Data

In your address space

objectin xml

NVO WESIX service: build your object catalog in 5 mins


Skyquery http skyquery net l.jpg

SkyQuery (http://skyquery.net/)

  • Distributed Query tool using a set of web services

  • Many astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England).

  • Feasibility study, built in 6 weeks

    • Tanu Malik (JHU CS grad student)

    • Tamas Budavari (JHU astro postdoc)

    • With help from Szalay, Thakar, Gray

  • Implemented in C# and .NET

  • Allows queries like:

SELECT o.objId, o.r, o.type, t.objId

FROM SDSS:PhotoPrimary o,

TWOMASS:PhotoPrimary t

WHERE XMATCH(o,t)<3.5

AND AREA(181.3,-0.76,6.5)

AND o.type=3 and (o.I - t.m_j)>2

Now: http://openskyquery.net/


Skyquery structure l.jpg

Each SkyNode publishes

Schema Web Service

Database Web Service

Portal is

Plans Query (2 phase)

Integrates answers

Is itself a web service

ImageCutout

SkyQuery

Portal

2MASS

INT

SDSS

FIRST

SkyQuery Structure


Outline17 l.jpg

Outline

  • Challenges: A New Kind of Science

  • Publishing the Data

  • Data Federation

  • The Virtual Observatory

  • Analyzing Terabytes of Data


Why is astronomy special l.jpg

Why Is Astronomy Special?

  • Especially attractive for the wide public

  • It has no commercial value

    • No privacy concerns, freely share results with others

    • Great for experimenting with algorithms

  • It is real and well documented

    • High-dimensional (with confidence intervals)

    • Spatial, temporal

  • Diverse and distributed

    • Many different instruments from many different places and many different times

  • The questions are interesting

  • There is a lot of it (soon petabytes)


The virtual observatory l.jpg

The Virtual Observatory

  • Premise: most data is (or could be online)

  • So, the Internet is the world’s best telescope:

    • It has data on every part of the sky

    • In every measured spectral band: optical, x-ray, radio..

    • As deep as the best instruments (2 years ago).

    • It is up when you are up

    • The “seeing” is always great

    • It’s a smart telescope: links objects and data to literature on them

  • Software became the capital expense

    • Share, standardize, reuse..

  • It has to be SIMPLE


National virtual observatory l.jpg

National Virtual Observatory

  • NSF ITR project, “Building the Framework for the National Virtual Observatory” is a collaboration of 17 funded and 3 unfunded organizations

    • Astronomy data centers

    • National observatories

    • Supercomputer centers

    • University departments

    • Computer science/information technology specialists

  • Natural cohesion with Grid Computing

http://us-vo.org/


International collaboration l.jpg

International Collaboration

  • Similar efforts now in 15 countries:

    • USA, UK, Canada, France, Germany, Italy, Holland, Japan, Australia, India, China, Russia, Hungary, South Korea, ESO, Spain

  • Total awarded funding world-wide is over $60M

  • Active collaboration among projects

    • Standards, common demos

    • International VO roadmap being developed

    • Regular telecons over 10 timezones

  • Formal collaboration

    International Virtual Observatory Alliance (IVOA)


Boundary conditions l.jpg

Boundary Conditions

Standards driven by evolving new technologies

  • Exchange of rich and structured data (XML…)

  • DB connectivity, Web Services, Grid computing

Application to astronomy domain

  • Data dictionaries (UCDs)

  • Data models

  • Protocols

  • Registries and resource/service discovery

  • Provenance, data quality, DATA CURATION!!!!

Boundary conditions

Dealing with the astronomy legacy

  • FITS data format

  • Software systems


Main vo challenges l.jpg

Main VO Challenges

  • How to avoid trying to be everything for everybody?

  • Database connectivity is essential

    • Bring the analysis to the data

  • Core web services, higher level applications on top

  • Use the 90-10 rule:

    • Define the standards and interfaces

    • Build the framework

    • Build the 10% of services that are used by 90%

    • Let the users build the rest from the components

  • Rapidly changing “outside world”

  • Make it simple!!!


Nvo from research to services l.jpg

NVO – from research to services

  • First two years spent on

    • Team building

    • Defining the standards

    • Building prototypes and pilot studies

    • Get feedback from astronomy SW community

  • Third year

    • Define core applications

      • Prototypes to services

    • Build them

    • Document them


First light l.jpg

First Light

  • Jan 2005: the first real applications

  • Taking some of the most common tasks

    • Discovery and data access

    • Analysis and exploration

    • Visualization

  • Immediate future:

    • engage the whole astronomy community


Openskyquery l.jpg

OpenSkyQuery

Cross-match your data with numerous catalogs

OpenSkyQuery allows you to cross-match astronomical catalogs and select subsets of catalogs with a general and powerful query language. You can also import a personal catalog of objects and cross-match it against selected databases.


Spectrum services l.jpg

Spectrum Services

Search, plot, and retrieve SDSS, 2dF, and other spectra

The Spectrum Services web site is dedicated to spectrum related VO services. On this site you will find tools and tutorials on how to access close to 500,000 spectra from the Sloan Digital Sky Survey (SDSS DR1) and the 2 degree Field redshift survey (2dFGRS). The services are open to everyone to publish their own spectra in the same framework. Reading the tutorials on XML Web Services, you can learn how to integrate the 45 GB spectrum and passband database with your programs with few lines of code.


Web enabled source identification with cross matching wesix l.jpg

Web Enabled Source Identification with Cross-Matching (WESIX)

Upload images to SExtractor and cross-correlate the objects found with selected survey catalogs.

This NVO service does source extraction and cross-matching for any astrometric FITS image. The user uploads a FITS image, and the remote service runs the SExtractor software for source extraction. The resulting catalog can be cross-matched with any of several major surveys, and the results returned as a VOTable. The web page also allows use of Aladin or VOPlot to visualize results.


Skyserver l.jpg

SkyServer

  • Sloan Digital Sky Survey: Pixels + Objects

  • About 500 attributes per “object”, 300M objects

  • Spectra for 1M objects

  • Currently 2TB fully public

  • Prototype eScience lab

    • Moving analysis to the data

    • Fast searches: color, spatial

  • Visual tools

    • Join pixels with objects

  • Prototype in data publishing

    • 70 million web hits in 3.5 years

      http://skyserver.sdss.org/


Db loading l.jpg

DB Loading

  • Automated table driven workflow system for loading

    • Included lots of verification code

    • Over 16K lines of SQL code

  • Loading process was extremely painful

    • Lack of systems engineering for the pipelines

    • Poor testing (lots of foreign key mismatch)

    • Detected data bugs even a month ago

    • Most of the time spent on scrubbing data

    • Fixing corrupted files (RAID5 disk errors)

  • Once data is clean, everything loads in 1 week

  • Neighbors calculation took about 10 hours

  • Reorganization of data took about 1 week of experiments in partitioning/layouts


Data delivery l.jpg

Data Delivery

  • Small requests (<100MB)

    • Putting data on the stream

  • Medium requests (<1GB)

    • Use DIME attachments to SOAP messages

  • Large requests (>1GB)

    • Save data in scratch area and use asynch delivery

    • Only practical for large/long queries

  • Iterative requests

    • Save data in temp tables in user space

    • Let user manipulate via web browser

  • Paradox: if we use web browser to submit, users want immediate response from batch-size queries


Queue management l.jpg

Queue Management

  • Need to register batch ‘power users’

  • Query output goes to ‘MyDB’

  • Can be joined with source database

  • Results are materialized from MyDB upon request

  • Users can do:

    • Insert, Drop, Create, Select Into, Functions, Procedures

    • Publish their tables to a group area

  • Data delivery via the CASJobs (C# WS)


Spatial features l.jpg

Spatial Features

  • Precomputed Neighbors

    • All objects within 30”

  • Boundaries, Masks and Outlines

    • 27,000 spatial objects

    • Stored as spatial polygons

      Time Domain:

  • Precomputed Match

    • All objects with 1”, observed at different times

    • Found duplicates due to telescope tracking errors

    • Manual fix, recorded in the database

  • MatchHead

    • The first observation of the linked list used as unique id to chain of observations of the same object


Things can get complex l.jpg

Things Can Get Complex


3 ways to do spatial l.jpg

3 Ways To Do Spatial

  • Hierarchical Triangular Mesh (extension to SQL)

    • Uses table valued stored procedures

    • Acts as a new “spatial access method”

    • Ported to Yukon CLR for a 17x speedup.

  • Zones: fits SQL well

    • Surprisingly simple & good on a fixed scale

  • Constraints: a novel idea

    • Lets us do algebra on regions., implemented in pure SQL

  • Paper:There Goes the Neighborhood: Relational Algebra for Spatial Data Search


Footprint poster child app l.jpg

Footprint: Poster Child App

  • Used as footprint service.

  • Take many footprints

  • Fuzz them (buffer) to make coarser footprintconvex hull of vertices

  • See if two footprints overlap

  • ~20 lines of code +130 lines of logic/comments


Crossmatch zone approach l.jpg

CrossMatch: Zone Approach

  • Divide space into zones

  • Key points by Zone, offset(on the sphere this need wrap-around margin.)

  • Point search look in a few zonesat a limited offset: ra ± Δa bounding box that has 1-π/4 false positives

  • All inside the relational engine

  • Avoids “impedance mismatch”

  • Can “batch” all-all comparisons

  • faster and 60,000x parallel1 hours, not 6 months!

  • This is Maria Nieto Santisteban’s PhD thesis

r

ra-zoneMax

x

√(r2+(ra-zoneMax)2)

cos(radians(zoneMax))

zoneMax

Ra ± x


Slide38 l.jpg

Zones allow 60,000 Parallel Jobs

Partition Parallelism: 3.7 hours

2MASS:USNOBZone:ZoneComparison

MergeAnswer

Build Index

Source Tables

Zoned Tables

2MASS→USNOB

350 Mrec

12 GB

2MASS

471 Mrec

140 GB

0:-1

64 Mrec

2 GB

2MASS

471 Mrec

65 GB

0:0

260 Mrec

9 GB

USNOB

1.1 Brec

233 GB

USNOB

1.1 Brec

106 GB

350 Mrec

12 GB

0:+1

26 Mrec

1 GB

USNOB→2MASS

2 hours

1.2 hour

.5 hour


Slide39 l.jpg

Pipeline Parallelism: 2.5 hours

Or… as fast as we can read USNOB + .5 hours

2MASS:USNOBZone:ZoneComparison

MergeAnswer

Build Index

Source Tables

Zones

2MASS→USNOB

350 Mrec

12 GB

2MASS

471 Mrec

140 GB

0:-1

64 Mrec

2 GB

Next zone

0:0

260 Mrec

9 GB

USNOB

1.1 Brec

233 GB

Next zone

350 Mrec

12 GB

0:+1

26 Mrec

1 GB

USNOB→2MASS

2 hours

.5 hour


Outline40 l.jpg

Outline

  • Challenges: A New Kind of Science

  • Publishing the Data

  • Data Federation

  • The Virtual Observatory

  • Analyzing Terabytes of Data


Next generation data analysis l.jpg

Next-Generation Data Analysis

  • Looking for

    • Needles in haystacks – the Higgs particle

    • Haystacks: Dark matter, Dark energy

  • Needles are easier than haystacks

  • ‘Optimal’ statistics have poor scaling

    • Correlation functions are N2, likelihood techniques N3

    • For large data sets main errors are not statistical

  • As data and computers grow with Moore’s Law, we can only keep up with N logN

  • A way out?

    • Discard notion of optimal (data is fuzzy, answers are approximate)

    • Don’t assume infinite computational resources or memory

  • Requires combination of statistics & computer science


Organization algorithms l.jpg

Organization & Algorithms

  • Use of clever data structures (trees, cubes):

    • Up-front creation cost, but only N logN access cost

    • Large speedup during the analysis

    • Tree-codes for correlations (A. Moore et al 2001)

    • Data Cubes for OLAP (all vendors)

  • Fast, approximate heuristic algorithms

    • No need to be more accurate than cosmic variance

    • Fast CMB analysis by Szapudi et al (2001)

      • N logN instead of N3 => 1 day instead of 10 million years

  • Take cost of computation into account

    • Controlled level of accuracy

    • Best result in a given time, given our computing resources


Trends l.jpg

Trends

CMB Surveys

  • 1990 COBE 1000

  • 2000 Boomerang 10,000

  • 2002 CBI 50,000

  • 2003 WMAP 1 Million

  • 2008 Planck10 Million

Angular Galaxy Surveys

  • 1970 Lick 1M

  • 1990 APM 2M

  • 2005 SDSS200M

  • 2008 VISTA 1000M

  • 2012 LSST 3000M

Time Domain

  • QUEST

  • SDSS Extension survey

  • Dark Energy Camera

  • PanStarrs

  • SNAP…

  • LSST…

Galaxy Redshift Surveys

  • 1986 CfA 3500

  • 1996 LCRS 23000

  • 2003 2dF 250000

  • 2005 SDSS 750000

Petabytes/year by the end of the decade…


Summary l.jpg

Summary

  • Data growing exponentially

  • Publishing so much data requires a new model

  • Multiple challenges for different communities

    • publishing, visualization, statistics, algorithms, educational

  • Information at your fingertips

    • Students see the same data as professional astronomers

  • More data coming: Petabytes/year by 2010

    • Need scalable solutions

    • Move analysis to the data!

  • Same thing happening in all sciences

    • High energy physics, genomics, cancer research,medical imaging, oceanography, remote sensing, …

  • eScience: an emerging new branch of science


  • Login