1 / 44

Elixir WP2 Data Resources

Elixir WP2 Data Resources. The data resources workpackage is at the heart of the project, dealing with the very stuff for which ELIXIR was conceived. The stuff of ELIXIR. genes and genomes transcripts and protein sequences patterns of gene expression three dimensional molecular structures

Download Presentation

Elixir WP2 Data Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Elixir WP2 Data Resources

  2. The data resources workpackage is at the heart of the project, dealing with the very stuff for which ELIXIR was conceived The stuff of ELIXIR

  3. genes and genomes transcripts and protein sequences patterns of gene expression three dimensional molecular structures interactions, pathways and processes metabolites and drugs The choreograph of molecular activity in the cell The data

  4. Medicine and health Personal care Agriculture Food science Brewing and fermentation Forestry Fishery Environment Benefits

  5. S. Palcy & A. de Daruvar Université Bordeaux 2 - France Research Domain • Biotechnology • Systems Biology • Computational biology • Pharmacology • Pharmacy • Toxicology • Biomedical informatics • Biostatistics • Chemoinformatics • Consumer goods company, Safety and Environment Assurance Centre • Pharmacognosy • Statistics • Physics • Wastewater treatment

  6. Databases – expected use

  7. The status quo – user survey

  8. 200 Databases 700 People 100 Institutions Total investment to date €308 million Annual cost €35 million About 1/3 of responders report NO costs 60% of polled databases didn’t respond It is almost certain that the non-responders are smaller on average 60 million web hits per month The EBI reports €22 million per year directly on databases Certainly EBI’s reporting is more complete than other sites Total European effort

  9. The cost of the science supported by the infrastructure is possibly two orders of magnitude higher than that of the proposed ELIXIR infrastructure Comparative cost

  10. UK 2005 ~ €3.8 Billion

  11. User survey

  12. 531 Databases surveyed 208 Responded, 323 did not Responders Dead = no update since 2005 Non-responders

  13. Databases by country

  14. Security of the databases

  15. Subject matter - keywords assigned

  16. Specialised Molecular Data Resources Galperin (2005 NAR) • In 2007 more than 900 databases • ~30% in Europe • Most use core resources as reference data

  17. User survey

  18. 200 databases, 100 institutions EBI – 27 databases Most institutions – 1 database Cumulative databases N Institutions

  19. Costs per database €K Cumulative costs to date Cumulative €K Cumulative annual costs N Databases

  20. Gigabytes per database Gigabytes N Databases

  21. Modalities A further 20% intend to offer Web Services

  22. Specialist/General

  23. Usage restrictions, 28 Yes, 113 Technical limits, 67 Data downloadable in their entirety 32 charge commercial users 48 restrict reuse 23 report confidentiality constraints

  24. Sources of funding National Institutional 25 22 18 12 12 18 16 Some non-European 14 Some commercial 12 European 38 No-formal funding

  25. Cumulative citations per database Top 30 (in no particular order) CATH, Dali, DSSP, Ensembl, GeneCards, GO, IMGT/HLA, InterPro, MIPS PlantsDB, Pfam, PRINTS, PROSITE, SMR, UniProt, UniProtKB/Swiss-Prot, ArrayExpress, CAZy, CYGD, GOA-UniProt, HGMD, HSSP, MEROPS, miRBase, PDBREPORT, Rfam, SUPERFAMILY, GOLD, Reactome, STRING, BRENDA Cumulative Citations N Databases

  26. Cumulative monthly hits Cumulative Hits N Dbs

  27. Unique users N databases N users

  28. Mirrors are the exception Substantial curation in a half of databases (30% do none) 42% consider themselves unique. 58% have comparable partners Only a fifth (of 58%) exchange data 70% collect usage data Mostly directed at bioinformaticians and biologists/bench scientists 30-45% of databases admit being incomplete/out-of-date Users not normally asked to register (<5%) Preferred usage metric = web hits 106 institutions also host tools Survey: miscellaneous points

  29. Scope and nature of database provision some reflections from the committee

  30. The committee endorsed ELIXIR’s biomolecular focus Cautioned against over-expansion of that focus For example, connect to medical data, rather than expanding the scope of ELIXIR Shared ontologies will be crucial to this, and should receive appropriate attention within ELIXIR Biologically active small molecules are in scope important to have public domain chemical resources available ELIXIR scope

  31. Core databases are complete collections of universal scientific value Core databases require that the providing institution takes on long-term responsibility Investigator-led databases scope and persistence reflects the interests research group not core funded from appropriate research funds Examples of core databases: UniProt, EMBL-Bank, MSD, Ensembl and ArrayExpress Non-core databases can be candidates for core Mechanism to move databases in and out of core To create or discontinue database projects Core and non-core distinction

  32. Support databases are built to support the operation of the core databases or to be used in conjunction with them to increase their value. For example they may provide controlled vocabularies for a range of core databases (say organism names). Investigator-led databases are typically the product of research groups (though they may well be served to external users). Their content reflects the research interests of their provider (E.g., documenting catalytic sites). Specialist databases handle data whose structure cannot easily be represented in the more general database (say immunoglobulins). Derivative/Summarising databases combine and organise data from a range of other databases, such as a non-redundant set of coding sequences. Non-core databases

  33. Where data can be identified to individuals access restrictions associated with confidentiality, consent and ethics must be applied This must not be confused with protectionism Consent and confidentiality

  34. Data discussed in a publication should be made available before or at the time of publication In some domains early publication might actually be a far from complete analysis of a complex data set Even where this is the case the normal rules should apply for biomolecular data These norms apply to “conventional”, “hypothesis-driven” research Projects whose funding is justified by the creation of shared data collections should make their data available as soon as they are useful Data release and publication

  35. ELIXIR - European focus, with global perspective Typically core resources are: are embedded in global collaborations have global data exchange agreements. Global perspective

  36. Node EBI Node Node Data structure Core Non-core

  37. We cannot do everything – we will have to review existing databases and consider proposals for new data bases.

  38. Demand Scientific case User demand Funding agency demand Data generator demand Journal demand Standardisation and connectivity Appropriateness to ELIXIR In scope Freely available Community support Arrangements with global peers Strategic need, e.g, as a European player Can it be left to another provider? Can ELIXIR be globally competitive? Is persistence assured? New and proposed data resources

  39. Scale and cost Database size and complexity Cost Staff requirement Data flow rates Usage Volume of usage Number of users Citations Scale and cost effectiveness

  40. Related domains

  41. Large resources in related disciplines BRENDA IMGT Pasteur DBs Model organism resource examples Specialist biomolecular data resource examples Medical data resources Core biomolecular resources Biodiversity data resources SGD Flybase Chemical data resources MGD Eumorphia/ Phenotypes Mutants Mouse Atlas

  42. Large resources in related disciplines BRENDA IMGT Pasteur DBs Model organism resource examples Specialist biomolecular data resource examples Medical data resources Core biomolecular resources Biodiversity data resources SGD Flybase Chemical data resources MGD Eumorphia/ Phenotypes Mutants Mouse Atlas

  43. Medical data resources Core biomolecular resources

  44. Data sharing is the norm in the biomolecular domain ELIXIR should espouse the strongest possible public domain principles (Question ?) Data supported by the infrastructure should be downloadable in their entirety and subject to no restrictions in use and reuse (Question ?) Insistence on acknowledgement is acceptable Prohibiting the distribution of a modified data collection in a form which could be confused with the original is acceptable Service organisations should exist in a research context Collaboration and avoidance of duplication in core data archives essential Creative competition on services desirable Data bases produced by research with primary responsibilities only to their research group are not Elixir and should be identified as such (hobby databases) Elixir data resources must connect with their global context Standardisation and interoperability are crucial Principles & Recommendations

More Related