Sustainable Access to the Records of Science: Biomedicine

Sustainable Access to the Records of Science: Biomedicine 3rd International Digital Curation Conference Washington DC Tim Hubbard Wellcome Trust Sanger Institute

Background • Wellcome Trust Sanger Institute • Sequenced 1/3 of human genome sequence • 1000 m2 data centre; ~1 petabyte storage • Wellcome Trust • Worlds 2nd richest charity ($22 billion) • Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK. • Wellcome Trust Sanger Institute • European Bioinformatics Institute (c.f. NCBI)

Biomedical data • Old • Mainly small scale experiments around genes, proteins: publication linked to submission of some data types into global repositories (sequence and protein structure) • Small patient collections organised by individual clinicians • New • Large scale biological datasets: genomes, high throughput functional data; images etc. • Increasingly large patient collections: UK biobank; digitalisation of patient records

Topics • Data openness • Data volume • Data federation • Cultural change

The era of sequencing genomes Size Genes Completion (Mb) date H. influenzae 2 1,700 1/1kb Bacterium 1995 Yeast 13 6,000 1/2kb Eukaryotic cell 1996 Nematode 100 18,000 1/6kb Metazoon 1998 Human 3000 30,000 1/75kb Mammal 2000/3 Mouse, fish (3), rat, another worm, fly … shotgun drafts in 2001/2

Human genome racewon by public projectopen access for all

International agreement on data release “All human genomic sequence information should be freely available and in the public domain in order to encourage research and development and to maximise its benefit to society.” The Bermuda Statement, February 1996 Assemblies of 1-2 kb are deposited in public database (GenBank, EBI) every 24 hours No patents are filed

Evolution of Data Release Policy • Bermuda principles reaffirmed at January 2003 NIH/WT meeting*, leading to new NIH/WT policy to divide funding into two classes: • R01 projects: • Competitive • Release data on publication • “Community Projects” • Non-competitive • Managed • Release data in real time *Nature 421 , 875 (2003)

Open data partnerships • Example:The SNPs consortium (TSC): • 12 companies & Wellcome Trust • $3 million membership • >1 million SNPs data freely accessible, August 2000 • Why? • The more people analysing a block of data, the more valuable it is. • Internet context of open data: • create “level playing field” • Aids capacity building

Problems to solve • Cultural attitudes towards data sharing • New ways of allocating credit • Adjustment to more competitive environment • Practical issues of data sharing • Standardization of data sets • Engineering to allow distributed data access • Stable infrastructure funding to support data archives

Technical delay Human genome 13 months doubling time Raw sequence data: the last few years

11 months doubling time A new repository: the Trace Archive

How is exponential growth sustained? Reduction in cost Improvement in technology Increase in demand Increase in volume This clearly doesn’t work for all technologies. Why computing and genome sequence data? The key is that these are information activities; there is no inherent physical outcome, or constraint.

Thirst for natural variation data • Reference genome • Sequence variation (collecting SNPs) • Sanger ExoSeq project: 35,000 novel rare SNPs identified from exons from 14 human chromosomes in 48 Caucasian individuals. • Cancer Genome Project: Greenman et al. Patterns of somatic mutation in human cancer genomes Nature446, 153 (2007). • Haplotypes (genotyping from reference SNPs) • HapMap project • Wellcome Trust Case Control Consortium(WTCCC) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls Nature447, 661 (2007) • Copy number variations (CNVs) • Redon et al. Global variation in copy number in the human genome Nature444, 444 (2006) • Multiple complete genomes of individuals

DNA sequencing revolution • Genome sequencing costs are falling very fast • ABI 3730: “old” Sanger technology • 80kb per run in ~800bp reads, ~$500/Mb • 454: introduced in 2005 • 100Mb per run in ~250bp reads, ~$100/Mb • Illumina/Solexa: introduced in 2006; ABI SOLiD in 2007 • 1Gb per run in ~35bp reads, ~$5/Mb • Substantial informatics requirements • Raw output of each run is 1 Tb (each 3 days) • Storage of output in after processing (trace format) from 30 illumina machines ~200 Tb per year

Changing landscape of human healthcare research • Genome sequencing costs are falling fast • 2000 1,000,000,000 $ per genome • 2004 10,000,000 $ per genome • 2008 100,000 $ per genome • 2012? 1,000 $ per genome? • Sequencing expected to displace genotyping as costs drop • Already 1,000,000 SNP chips, which allow whole genome association studies through genotyping, however will not necessarily identify causal variations. • Already seeing companies starting to sell personal genome services (23andme, Decode) • Future human health research will be increasingly driven by the availability of this data

Human Genome Project Functional annotation Human re-sequencing Other genomes HapMap 192H1-2 59F4-1 PPARg 3: 11,500 3: 12,000 3: 12,500 3: 13,000 3: 13,500 kb Personal Genetic Information (owned by individual) Genomic Information Functional variants Medical consultation Structural context PGI id: 5910322 – 61215923014 Personal Genetic Assessment Biological consequences Clinical decision Issued: 01 MAR 07 Recommended next check: 28 FEB 10 Future evolution of healthcare Individual human sequence

Data access • Website • Data mining interfaces • Data downloads • Public APIs / Open Source code • Federation technologies

Genome resources deliver more than sequence Less than 1/1,000,000 part of the genome sequence

Gene transcripts (Ensembl identifies 22,287 genes) Evidence Other features ~80 different types of information Under 10% data in Ensembl is sequence

Modern day maps: topography…

… plus annotation

Data integration • Complete genomes provide the framework to pull all biological data together such that each piece says something about biology as a whole • Biology is too complex for any organisation to have a monopoly of ideas or data • The more organisations provide data or analysis separately, the harder it becomes for anyone to make use of the results

Coordinate Synchronisation Server Server Server Server Sequence Programs Annotation Viewer Distributed Annotation External Contributors Database providers html xml Users xml Hubbard & Birney, Open annotation offers a democratic solution to genome sequencing (1999) Nature, 403, 825. See: http://biodas.org/ Original proposal: Lincoln Stein et al

DAS Server DAS Server DAS Server Viewer DAS v Web Different Web sites Different interfaces No integration Web Model: links DAS Model: Different DAS sites Automatic Integration Single interface

DAS in a nutshell • Standardized set of web services • Reference servers (the sequence) • Annotation servers (features: chr:start-end) • Alignment servers (chr:start-end matches chr:start-end) • Identifier based servers (ref item X rather than coordinate)

Split data and presentation • Databases responsible for curating data and serving it as primitive datatypes defined by open standards (high cost) • Different front ends or components of front ends compete for users (development of each low cost) c.f. browsers.

Data Services

Utility of bioinformatics Scientific impact Too little bioinformatics Too many databases Too diverse interfaces

Academics Company Open Sequence Data Academics Competition? Evaluators Investors The Human Genome Project Academics Funders Weekly conference calls Competitive Collaboration

Market properties of openness • “Optimal allocation of resources based on transparency of information” • Evaluation can be continuous and in parallel with research • Researchers can stay more focused on their work, rather than writing heavy grants and reports • Potential innovation and competition in evaluation as data is open to all

Openness makes assessment easier • Researchers frequently assessed on where papers are published • Nature, Science have some of high impact factors • Crude measure: many papers published in top journals never cited • Open access publication has potential to provide granting agencies with much more evaluation of research output

Cultural changes • Text for Rfam (RNA family data) now in wikipedia (think: community annotation) • New generation of scientists will mostly have profiles on facebook et al • What will be the effect on science of combining: • open access publishing • open data access • social networking

What is the problem with government? • Where are open ideas are changing policy? Why? • +ve Switching publishing to Open Access publishing model • Less risky than most since self sustaining • Value protected through peer review • -ve Long term support for resources that cost money • e.g Science infrastructure • Continuously expanding; mechanisms for merging, eliminating ones with little value poor, or perceived to be poor • Genuine worries about whether change will affect stability • Potential unknown pernicious incentives • If you change the system from one that currently partly works to one that breaks, we get blamed • Lack of technically aware people in government? • Culture of contracting out and lawyers (self reinforcing)

Hospital Software Development Company • “Competition” based on: • - Call for proposals, • Contracts • Lawyers • Result: • - Closed source solution • - Disconnected from Users Open Source Code Hospital Doctors Investors Public computer projects Hospital Treasury Continuous evaluation by Doctors Competitive Collaboration Innovation

Wellcome Trust commissioned two reports on Open Access publishing September 2003“An Economic Analysis of Scientific Research Publishing” April 2004“Costs and Business Models in Scientific Research Publishing” Conclusions: Open access could cut scientific publishing costs by 30 per cent and still provide a viable business model UK Parliamentary select committee is now investigating the issue of open access

Open access publishing allows anyone to analysis and organise scientific text

New context of data sharing • Internet facilitates easy data sharing • Previously practical issues as big as anything else • Its our data, our research: we paid for it! • Both funders and public beginning to realise what they have been missing • Complex problems require open models • Some problems will only be solved by collaborative analysis of shared data collections

Sustainable Access to the Records of Science: Biomedicine