Escience a transformed scientific method l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 58

eScience -- A Transformed Scientific Method PowerPoint PPT Presentation


  • 162 Views
  • Uploaded on
  • Presentation posted in: General

eScience -- A Transformed Scientific Method . Jim Gray , eScience Group, Microsoft Research http://research.microsoft.com/~Gray in collaboration with Alex Szalay Dept. Physics & Astronomy Johns Hopkins University http://www.sdss.jhu.edu/~szalay/. Talk Goals.

Download Presentation

eScience -- A Transformed Scientific Method

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Escience a transformed scientific method l.jpg

eScience -- A Transformed Scientific Method

Jim Gray,

eScience Group,

Microsoft Research

http://research.microsoft.com/~Gray

in collaboration with Alex Szalay

Dept. Physics & Astronomy

Johns Hopkins University

http://www.sdss.jhu.edu/~szalay/


Talk goals l.jpg

Talk Goals

Explain eScience (and what I am doing) &

Recommend CSTB foster tools for

  • data capture (lab info management systems)

  • data curation (schemas, ontologies, provenance)

  • data analysis (workflow, algorithms, databases, data visualization )

  • data+doc publication (active docs, data-doc integration)

  • peer review (editorial services)

  • access (doc + data archives and overlay journals)

  • Scholarly communication (wiki’s for each article and dataset)


Escience what is it l.jpg

eScience: What is it?

Synthesis of information technology and science.

Science methods are evolving (tools).

Science is being codified/objectified.How represent scientific information and knowledge in computers?

Science faces a data deluge.How to manage and analyze information?

Scientific communication changing

publishing data & literature (curation, access, preservation)


Science paradigms l.jpg

Science Paradigms

Thousand years ago: science was empirical

describing natural phenomena

Last few hundred years: theoretical branch

using models, generalizations

Last few decades: a computational branch

simulating complex phenomena

Today:data exploration (eScience)

unify theory, experiment, and simulation

Data captured by instrumentsOr generated by simulator

Processed by software

Information/Knowledge stored in computer

Scientist analyzes database / filesusing data management and statistics


X info l.jpg

Experiments &

Instruments

facts

questions

facts

Other Archives

facts

answers

Literature

facts

?

Simulations

X-Info

  • The evolution of X-Info and Comp-X for each discipline X

  • How to codify and represent our knowledge

The Generic Problems

  • Data ingest

  • Managing a petabyte

  • Common schema

  • How to organize it

  • How to reorganize it

  • How to share with others

  • Query and Vis tools

  • Building and executing models

  • Integrating data and Literature

  • Documenting experiments

  • Curation and long-term preservation


Experiment budgets software l.jpg

Software for

Instrument scheduling

Instrument control

Data gathering

Data reduction

Database

Analysis

Modeling

Visualization

Millions of lines of code

Repeated for experiment after experiment

Not much sharing or learning

CS can change this

Build generic tools

Workflow schedulers

Databases and libraries

Analysis packages

Visualizers

Experiment Budgets ¼…½ Software


Experiment budgets software7 l.jpg

Software for

Instrument scheduling

Instrument control

Data gathering

Data reduction

Database

Analysis

Modeling

Visualization

Millions of lines of code

Repeated for experiment after experiment

Not much sharing or learning

CS can change this

Build generic tools

Workflow schedulers

Databases and libraries

Analysis packages

Visualizers

Experiment Budgets ¼…½ Software

Action item

Foster Tools and Foster Tool Support


Project pyramids l.jpg

Project Pyramids

In most disciplines there are a few “giga” projects,

several “mega” consortia and then many small labs.

Often some instrument creates need for giga-or mega-project

Polar station

Accelerator

Telescope

Remote sensor

Genome sequencer

Supercomputer

Tier 1, 2, 3 facilities to use instrument + data


Pyramid funding l.jpg

Pyramid Funding

  • Giga Projects need Giga FundingMajor Research Equipment Grants

  • Need projects at all scales

  • computing example: supercomputers, + departmental clusters + lab clusters

  • technical+ social issues

  • Fully fund giga projects, fund ½ of smaller projectsthey get matching funds from other sources

  • “Petascale Computational Systems: Balanced Cyber-Infrastructure in a Data-Centric World ,” IEEE Computer,  V. 39.1, pp 110-112, January, 2006.


Slide10 l.jpg

Action item

Invest in tools at all levels


Need lab info management systems limss l.jpg

Need Lab Info Management Systems (LIMSs)

  • Pipeline Instrument + Simulator data to archive & publish to web.

  • NASA Level 0 (raw) data Level 1 (calibrated) Level 2 (derived)

  • Needs workflow tool to manage pipeline

  • Build prototypes.

  • Examples:

    • SDSS, LifeUnderYourFeetMBARI Shore Side Data System.


Need lab info management systems limss12 l.jpg

Need Lab Info Management Systems (LIMSs)

Action item

Foster generic LIMS

  • Pipeline Instrument + Simulator data to archive & publish to web.

  • NASA Level 0 (raw) data Level 1 (calibrated) Level 2 (derived)

  • Needs workflow tool to manage pipeline

  • Build prototypes.

  • Examples:

    • SDSS, LifeUnderYourFeetMBARI Shore Side Data System.


Science needs info management l.jpg

Science Needs Info Management

  • Simulators produce lots of data

  • Experiments produce lots of data

  • Standard practice:

    • each simulation run produces a file

    • each instrument-day produces a file

    • each process step produces a file

    • files have descriptive names

    • files have similar formats (described elsewhere)

  • Projects have millions of files (or soon will)

  • No easy way to manage or analyze the data.


Data analysis l.jpg

Data Analysis

  • Looking for

    • Needles in haystacks – the Higgs particle

    • Haystacks: Dark matter, Dark energy

  • Needles are easier than haystacks

  • Global statistics have poor scaling

    • Correlation functions are N2, likelihood techniques N3

  • We can only do N logN

  • Must accept approximate answersNew algorithms

  • Requires combination of

    • statistics &

    • computer science


Analysis and databases l.jpg

Analysis and Databases

  • Much statistical analysis deals with

    • Creating uniform samples –

    • data filtering

    • Assembling relevant subsets

    • Estimating completeness

    • Censoring bad data

    • Counting and building histograms

    • Generating Monte-Carlo subsets

    • Likelihood calculations

    • Hypothesis testing

  • Traditionally performed on files

  • These tasks better done in structured store with

    • indexing,

    • aggregation,

    • parallelism

    • query, analysis,

    • visualization tools.


Data delivery hitting a wall l.jpg

You can GREP 1 MB in a second

You can GREP 1 GB in a minute

You can GREP 1 TB in 2 days

You can GREP 1 PB in 3 years

Oh!, and 1PB ~4,000 disks

At some point you need indices to limit searchparallel data search and analysis

This is where databases can help

You can FTP 1 MB in 1 sec

FTP 1 GB / min(~1 $/GB)

… 2 days and 1K$

… 3 years and 1M$

Data Delivery: Hitting a Wall

FTP and GREP are not adequate


Accessing data l.jpg

Accessing Data

  • If there is too much data to move around,

    take the analysis to the data!

  • Do all data manipulations at database

    • Build custom procedures and functions in the database

  • Automatic parallelism guaranteed

  • Easy to build-in custom functionality

    • Databases & Procedures being unified

    • Example temporal and spatial indexing

    • Pixel processing

  • Easy to reorganize the data

    • Multiple views, each optimal for certain analyses

    • Building hierarchical summaries are trivial

  • Scalable to Petabyte datasets

active databases!


Analysis and databases18 l.jpg

Analysis and Databases

Action item

Foster Data Management

Data Analysis Data Visualization Algorithms &Tools

  • Much statistical analysis deals with

    • Creating uniform samples –

    • data filtering

    • Assembling relevant subsets

    • Estimating completeness

    • Censoring bad data

    • Counting and building histograms

    • Generating Monte-Carlo subsets

    • Likelihood calculations

    • Hypothesis testing

  • Traditionally performed on files

  • These tasks better done in structured store with

    • indexing,

    • aggregation,

    • parallelism

    • query, analysis,

    • visualization tools.


Let 100 flowers bloom l.jpg

Let 100 Flowers Bloom

  • Comp-X has some nice tools

    • Beowulf

    • Condor

    • BOINC

    • Matlab

  • These tools grew from the community

  • It’s HARD to see a common pattern

    • Linux vs FreeBSD why was Linux more successful?Community, personality, timing, ….???

  • Lesson: let 100 flowers bloom.


Talk goals20 l.jpg

Talk Goals

Explain eScience (and what I am doing) &

Recommend CSTB foster tools and tools for

  • data capture (lab info management systems)

  • data curation (schemas, ontologies, provenance)

  • data analysis (workflow, algorithms, databases, data visualization )

  • data+doc publication (active docs, data-doc integration)

  • peer review (editorial services)

  • access (doc + data archives and overlay journals)

  • Scholarly communication (wiki’s for each article and dataset)


All scientific data online l.jpg

All Scientific Data Online

  • Many disciplines overlap and use data from other sciences.

  • Internet can unify all literature and data

  • Go from literature to computation to data back to literature.

  • Information at your fingertipsFor everyone-everywhere

  • Increase Scientific Information Velocity

  • Huge increase in Science Productivity


Unlocking peer reviewed literature l.jpg

Unlocking Peer-Reviewed Literature

  • Agencies and Foundations mandating research be public domain.

    • NIH (30 B$/y, 40k PIs,…)(see http://www.taxpayeraccess.org/)

    • Welcome Trust

    • Japan, China, Italy, South Africa,.…

    • Public Library of Science..

  • Other agencies will follow NIH


How does the new library work l.jpg

How Does the New Library Work?

  • Who pays for storage access (unfunded mandate)?

    • Its cheap: 1 milli-dollar per access

  • But… curation is not cheap:

    • Author/Title/Subject/Citation/…..

    • Dublin Core is great but…

    • NLM has a 6,000-line XSD for documents http://dtd.nlm.nih.gov/publishing

    • Need to capture document structure from author

      • Sections, figures, equations, citations,…

      • Automate curation

    • NCBI-PubMedCentral is doing this

      • Preparing for 1M articles/year

    • Automate it!


Pub med central international l.jpg

Pub Med Central International

  • “Information at your fingertips”

  • Deployed US, China, England, Italy, South Africa, Japan

  • UK PMCI http://ukpmc.ac.uk/

  • Each site can accept documents

  • Archives replicated

  • Federate thru web services

  • Working to integrate Word/Excel/…with PubmedCentral – e.g. WordML, XSD,

  • To be clear: NCBI is doing 99.99% of the work.


Overlay journals l.jpg

articles

Data Sets

Overlay Journals

  • Articles and Data in public archives

  • Journal title page in public archive.

  • All covered by Creative Commons License

    • permits: copy/distribute

    • requires: attribution

      http://creativecommons.org/

Data

Archives


Overlay journals26 l.jpg

title

page

articles

Data Sets

Overlay Journals

  • Articles and Data in public archives

  • Journal title page in public archive.

  • All covered by Creative Commons License

    • permits: copy/distribute

    • requires: attribution

      http://creativecommons.org/

JournalManagement

System

Data

Archives


Overlay journals27 l.jpg

title

page

comments

articles

Data Sets

Overlay Journals

  • Articles and Data in public archives

  • Journal title page in public archive.

  • All covered by Creative Commons License

    • permits: copy/distribute

    • requires: attribution

      http://creativecommons.org/

JournalCollaboration

System

JournalManagement

System

Data

Archives


Overlay journals28 l.jpg

Action item

Do for other scienceswhat NLM has done for BIOGenbank-PubMedCentral…

title

page

comments

articles

Data Sets

Overlay Journals

  • Articles and Data in public archives

  • Journal title page in public archive.

  • All covered by Creative Commons License

    • permits: copy/distribute

    • requires: attribution

      http://creativecommons.org/

JournalCollaboration

System

JournalManagement

System

Data

Archives


Better authoring tools l.jpg

Better Authoring Tools

  • Extend Authoring tools to

    • capture document metadata (NLM tagset)

    • represent documents in standard format

      • WordML (ECMA standard)

    • capture references

    • Make active documents (words and data).

  • Easier for authors

  • Easier for archives


Conference management tool l.jpg

Conference Management Tool

  • Currently a conference peer-review system (~300 conferences)

    • Form committee

    • Accept Manuscripts

    • Declare interest/recuse

    • Review

    • Decide

    • Form program

    • Notify

    • Revise


Publishing peer review l.jpg

Publishing Peer Review

  • & improve author-reader experience

  • Manage versions

  • Capture data

  • Interactive documents

  • Capture Workshop

    • presentations

    • proceedings

  • Capture classroom ConferenceXP

  • Moderated discussions of published articles

  • Connect to Archives

  • Add publishing steps

    • Form committee

    • Accept Manuscripts

    • Declare interest/recuse

    • Review

    • Decide

    • Form program

    • Notify

    • Revise

    • Publish


Why not a wiki l.jpg

Why Not a Wiki?

  • Peer-Review is different

    • It is very structured

    • It is moderated

    • There is a degree of confidentiality

  • Wiki is egalitarian

    • It’s a conversation

    • It’s completely transparent

  • Don’t get me wrong:

    • Wiki’s are great

    • SharePoints are great

    • But.. Peer-Review is different.

    • And, incidentally: review of proposals, projects,… is more like peer-review.

  • Let’s have Moderated Wiki re published literature PLoS-One is doing this


Why not a wiki33 l.jpg

Action item

Foster new document

authoring and publication models and tools

Why Not a Wiki?

  • Peer-Review is different

    • It is very structured

    • It is moderated

    • There is a degree of confidentiality

  • Wiki is egalitarian

    • It’s a conversation

    • It’s completely transparent

  • Don’t get me wrong:

    • Wiki’s are great

    • SharePoints are great

    • But.. Peer-Review is different.

    • And, incidentally: review of proposals, projects,… is more like peer-review.

  • Let’s have Moderated Wiki re published literature PLoS-One is doing this


So what about publishing data l.jpg

So… What about Publishing Data?

  • The answer is 42.

  • But…

    • What are the units?

    • How precise? How accurate 42.5 ± .01

    • Show your work data provenance


Thought experiment l.jpg

Thought Experiment

  • You have collected some dataand want to publish science based on it.

  • How do you publish the data so that others can read it and reproduce your results in 100 years?

    • Document collection process?

    • How document data processing (scrubbing & reducing the data)?

    • Where do you put it?


Objectifying knowledge l.jpg

Objectifying Knowledge

  • This requires agreement about

    • Units: cgs

    • Measurements: who/what/when/where/how

    • CONCEPTS:

      • What’s a planet, star, galaxy,…?

      • What’s a gene, protein, pathway…?

  • Need to objectify science:

    • what are the objects?

    • what are the attributes?

    • What are the methods (in the OO sense)?

  • This is mostly Physics/Bio/Eco/Econ/... But CS can do generic things


Objectifying knowledge37 l.jpg

Objectifying Knowledge

Warning!Painful discussions ahead:

The “O” word: Ontology

The “S” word: Schema

The “CV” words: Controlled Vocabulary

Domain experts do not agree

  • This requires agreement about

    • Units: cgs

    • Measurements: who/what/when/where/how

    • CONCEPTS:

      • What’s a planet, star, galaxy,…?

      • What’s a gene, protein, pathway…?

  • Need to objectify science:

    • what are the objects?

    • what are the attributes?

    • What are the methods (in the OO sense)?

  • This is mostly Physics/Bio/Eco/Econ/... But CS can do generic things


The best example entrez genbank http www ncbi nlm nih gov l.jpg

PubMed

Entrez Genomes

PubMed abstracts

Complete Genomes

Publishers

Genome Centers

Taxon

3 -D Structure

Phylogeny

MMDB

Nucleotide sequences

Protein sequences

The Best Example: Entrez-GenBankhttp://www.ncbi.nlm.nih.gov/

  • Sequence data deposited with Genbank

  • Literature references Genbank ID

  • BLAST searches Genbank

  • Entrez integrates and searches

    • PubMedCentral

    • PubChem

    • Genbank

    • Proteins, SNP,

    • Structure,..

    • Taxonomy…

    • Many more


Publishing data l.jpg

Roles

Authors

Publishers

Curators

Consumers

Traditional

Scientists

Journals

Libraries

Scientists

Emerging

Collaborations

Project www site

Bigger Archives

Scientists

Publishing Data

  • Exponential growth:

    • Projects last at least 3-5 years

    • Data sent upwards only at the end of the project

    • Data will never be centralized

  • More responsibility on projects

    • Becoming Publishers and Curators

  • Data will reside with projects

    • Analyses must be close to the data


Data pyramid l.jpg

Data Pyramid

  • Very extended distribution of data sets:

    data on all scales!

  • Most datasets are small, and manually maintained (Excel spreadsheets)

  • Total volume dominated by multi-TB archives

  • But, small datasets have real value

  • Most data is born digital collected via electronic sensorsor generated by simulators.


Data sharing publishing l.jpg

Data Sharing/Publishing

  • What is the business model (reward/career benefit)?

  • Three tiers (power law!!!)

    (a) big projects

    (b) value added, refereed products

    (c) ad-hoc data, on-line sensors, images, outreach info

  • We have largely done (a)

  • Need “Journal for Data” to solve (b)

  • Need “VO-Flickr” (a simple interface) (c)

  • Mashups are emerging in science

  • Need an integrated environment for ‘virtual excursions’ for education (C. Wong)


The best example entrez genbank http www ncbi nlm nih gov42 l.jpg

Action item

Foster Digital Data Libraries(not metadata, real data)and integration with literature

PubMed

Entrez Genomes

PubMed abstracts

Complete Genomes

Publishers

Genome Centers

Taxon

3 -D Structure

Phylogeny

MMDB

Nucleotide sequences

Protein sequences

The Best Example: Entrez-GenBankhttp://www.ncbi.nlm.nih.gov/

  • Sequence data deposited with Genbank

  • Literature references Genbank ID

  • BLAST searches Genbank

  • Entrez integrates and searches

    • PubMedCentral

    • PubChem

    • Genbank

    • Proteins, SNP,

    • Structure,..

    • Taxonomy…

    • Many more


Talk goals43 l.jpg

Talk Goals

Explain eScience (and what I am doing) &

Recommend CSTB foster tools and tools for

  • data capture (lab info management systems)

  • data curation (schemas, ontologies, provenance)

  • data analysis (workflow, algorithms, databases, data visualization )

  • data+doc publication (active docs, data-doc integration)

  • peer review (editorial services)

  • access (doc + data archives and overlay journals)

  • Scholarly communication (wiki’s for each article and dataset)


Backup l.jpg

backup


Astronomy l.jpg

Astronomy

  • Help build world-wide telescope

    • All astronomy data and literature online and cross indexed

    • Tools to analyze the data

  • Built SkyServer.SDSS.org

  • Built Analysis system

    • MyDB

    • CasJobs (batch job)

  • OpenSkyQueryFederation of ~20 observatories.

  • Results:

    • It works and is used every day

    • Spatial extensions in SQL 2005

    • A good example of Data Grid

    • Good examples of Web Services.


World wide telescope virtual observatory http www us vo org http www ivoa net l.jpg

World Wide TelescopeVirtual Observatoryhttp://www.us-vo.org/http://www.ivoa.net/

  • Premise: Most data is (or could be online)

  • So, the Internet is the world’s best telescope:

    • It has data on every part of the sky

    • In every measured spectral band: optical, x-ray, radio..

    • As deep as the best instruments (2 years ago).

    • It is up when you are up.The “seeing” is always great (no working at night, no clouds no moons no..).

    • It’s a smart telescope: links objects and data to literature on them.


Why astronomy data l.jpg

ROSAT ~keV

DSS Optical

IRAS 25m

2MASS 2m

GB 6cm

WENSS 92cm

NVSS 20cm

IRAS 100m

Why Astronomy Data?

  • It has no commercial value

    • No privacy concerns

    • Can freely share results with others

    • Great for experimenting with algorithms

  • It is real and well documented

    • High-dimensional data (with confidence intervals)

    • Spatial data

    • Temporal data

  • Many different instruments from many different places and many different times

  • Federation is a goal

  • There is a lot of it (petabytes)


Time and spectral dimensions the multiwavelength crab nebulae l.jpg

Time and Spectral DimensionsThe Multiwavelength Crab Nebulae

Crab star

1053 AD

X-ray,

optical,

infrared, and

radio

views of the nearby Crab Nebula, which is now in a state of chaotic expansion after a supernova explosion first sighted in 1054 A.D. by Chinese Astronomers.

Slide courtesy of Robert Brunner @ CalTech.


Skyserver sdss org l.jpg

SkyServer.SDSS.org

  • A modern archive

    • Access to Sloan Digital Sky SurveySpectroscopic and Optical surveys

    • Raw Pixel data lives in file servers

    • Catalog data (derived objects) lives in Database

    • Online query to any and all

  • Also used for education

    • 150 hours of online Astronomy

    • Implicitly teaches data analysis

  • Interesting things

    • Spatial data search

    • Client query interface via Java Applet

    • Query from Emacs, Python, ….

    • Cloned by other surveys (a template design)

    • Web services are core of it.


Skyserver skyserver sdss org l.jpg

SkyServerSkyServer.SDSS.org

  • Like the TerraServer, but looking the other way: a picture of ¼ of the universe

  • Sloan Digital Sky Survey Data: Pixels + Data Mining

  • About 400 attributes per “object”

  • Spectrograms for 1% of objects


Demo of skyserver l.jpg

Demo of SkyServer

  • Shows standard web server

  • Pixel/image data

  • Point and click

  • Explore one object

  • Explore sets of objects (data mining)


Skyquery http skyquery net l.jpg

SkyQuery (http://skyquery.net/)

  • Distributed Query tool using a set of web services

  • Many astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England)

  • Has grown from 4 to 15 archives,now becoming international standard

  • WebService Poster Child

  • Allows queries like:

SELECT o.objId, o.r, o.type, t.objId

FROM SDSS:PhotoPrimary o,

TWOMASS:PhotoPrimary t

WHERE XMATCH(o,t)<3.5

AND AREA(181.3,-0.76,6.5)

AND o.type=3 and (o.I - t.m_j)>2


Skyquery structure l.jpg

Each SkyNode publishes

Schema Web Service

Database Web Service

Portal is

Plans Query (2 phase)

Integrates answers

Is itself a web service

ImageCutout

SkyQuery

Portal

2MASS

INT

SDSS

FIRST

SkyQuery Structure


Skyserver skyquery evolution mydb and batch jobs l.jpg

SkyServer/SkyQuery EvolutionMyDB and Batch Jobs

Problem: need multi-step data analysis (not just single query).

Solution: Allow personal databases on portal

Problem: some queries are monsters

Solution: “Batch schedule” on portal. Deposits answer in personal database.


Ecosystem sensor net lifeunderyourfeet org l.jpg

Ecosystem Sensor NetLifeUnderYourFeet.Org

  • Small sensor net monitoring soil

  • Sensors feed to a database

  • Helping build system to collect & organize data.

  • Working on data analysis tools

  • Prototype for other LIMSLaboratory Information Management Systems


Rna structural genomics l.jpg

RNA Structural Genomics

  • Goal: Predict secondary and tertiary structure from sequence.Deduce tree of life.

  • Technique: Analyze sequence variations sharing a common structure across tree of life

  • Representing structurally aligned sequences is a key challenge

  • Creating a database-driven alignment workbench accessing public and private sequence data


Vha health informatics l.jpg

VHA: largest standardized electronic medical records system in US.

Design, populate and tune a ~20 TB Data Warehouse and Analytics environment

Evaluate population health and treatment outcomes,

Support epidemiological studies

7 million enrollees

5 million patients

Example Milestones:

1 Billionth Vital Sign loaded in April ‘06

30-minutes to population-wide obesity analysis (next slide)

Discovered seasonality in blood pressure -- NEJM fall ‘06

VHA Health Informatics


Slide58 l.jpg

HDR Vitals Based Body Mass Index Calculation on VHA FY04 Population

Source: VHA Corporate Data Warehouse

Total Patients

23,876 (0.7%)

701,089 (21.6%)

1,177,093 (36.2%)

1,347,098 (41.5%)

3,249,156 (100%)


  • Login