1 / 17

Who are we? And why do we care?

Data Citation in the Earth and Physical Sciences Sarah Callaghan [sarah.callaghan@stfc.ac.uk] with thanks and acknowledgement to a lot of other people! Developing Data Attribution and Citation Practices and Standards An International Symposium and Workshop August 22-23, 2011.

maia
Download Presentation

Who are we? And why do we care?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Citation in the Earth and Physical Sciences Sarah Callaghan [sarah.callaghan@stfc.ac.uk]with thanks and acknowledgement to a lot of other people!Developing Data Attribution and Citation Practices and StandardsAn International Symposium and WorkshopAugust 22-23, 2011

  2. Who are we? And why do we care? ... And what do we know about data? We’re one of the NERC data centres

  3. Some BADC numbers for context • Dataset: A collection of files sharing some administrative and/or project heritage. • BADC has approximately 150 real datasets (and thousands of virtual datasets). • BADC has approx 200 million files containing thousands of measured or simulated parameters. • BADC tries to deploy information systems that describe those data, parameters, projects and files, along with services that allow one to manipulate them … • Calendar year 2010: 2800 active users (of 12000 registered), downloaded 64 TB data in 16 million files from 165 datasets. • Less than half of the BADC data consumers are “atmospheric science” users!

  4. What does data mean to us? • Data can be anything from: • A measurement taken at a single place and time (e.g. water sample, crystal structure, particle collision) • Measurements taken at a point over a period of time (e.g. rain gauge measurements, temperature) • Measurements taken across an area at multiple times by a static instrument (e.g. meteorological radar, satellite radiometer measurements) • Measurements taken over and area and a time by a moving instrument (e.g. ocean traces, air quality measurements taken during an airplane flight, biodiversity measurements) • Results from computer models (e.g. climate models, ocean circulation models) • Video and images (e.g. cloud camera images, photos and video from flood events, wildlife camera traps) • Physical samples (e.g. rock cores, tree ring samples, ice cores) Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665

  5. Method: standard set of model simulations in order to: evaluate how realistic the models are in simulating the recent past, provide projections of future climate change on two time scales, near term (out to about 2035) and long term (out to 2100 and beyond), and understand some of the factors responsible for differences in model projections, including quantifying some key feedbacks such as those involving clouds and the carbon cycle Case Study: CMIP5 • CMIP5: Fifth Coupled Model Intercomparison Project • Global community activity under the auspices of the World Meteorological Organisation (WMO) via the World Climate Research Programme (WCRP) • Aim: • to address outstanding scientific questions that arose as part of the AR4 process, • improve understanding of climate, and • to provide estimates of future climate change that will be useful to those considering its possible consequences.

  6. FAR:1990 SAR:1995 TAR:2001 AR4:2007 AR5:2013

  7. CMIP5 numbers • Simulations: • ~90,000 years • ~60 experiments • ~20 modelling centres (from around the world) using • ~30 major(*) model configurations • ~2 million output “atomic” datasets • ~10's of petabytes of output • ~2 petabytes of CMIP5 requested output • ~1 petabyte of CMIP5 “replicated” output Which will be replicated at a number of sites (including ours). • Of the replicants: • ~ 220 TB decadal • ~ 540 TB long term • ~ 220 TB atmosphere-only • ~80 TB of 3hourly data • ~215 TB of ocean 3d monthly data! • ~250 TB for the cloud feedbacks! • ~10 TB of land-biochemistry (from the long term experiments alone). (May 2011: All these data output volumes probably a factor of two too low!)

  8. CIMP5 and Data Citation CMIP5 will produce a lot of data! It’s an international effort, with everyone involved wanting to ensure proper citation, attribution and location of the data produced. From http://cmip-pcmdi.llnl.gov/cmip5/citation.html?submenuheader=3 : “Digital Object Identifiers will be assigned to various subsets of the CMIP5 multi-model dataset and, when available and as appropriate, users should cite these references in their publications. These DOI’s will provide a traceable record of the analyzed model data, as tangible evidence of their scientific value. Instructions will be forthcoming on how to cite the data using DOI’s.” There are also plans to work with journal publishers to publish data papers about various key model runs and ensembles (more about data publication later!)

  9. Earth Sciences: BADC It is possible to reference our datasets using a specific citation given on the main dataset information page. We’re currently working on assigning DOIs to certain datasets which meet our technical quality standards.

  10. Earth Sciences: Pangaea

  11. Physics and Life Science: ISIS • The ISIS pulsed neutron and muon source produces beams of neutrons and muons that allow scientists to study materials at the atomic level using a suite of instruments, often described as ‘super-microscopes’. It supports a national and international community of more than 2000 scientists who use neutrons and muons for research in physics, chemistry, materials science, geology, engineering and biology. • ISIS is now issuing DOIs for experiment data to allow easy citation. Principal Investigators will be sent DOIs shortly before their experiment is due to start. • DOIs issued by ISIS are in the form: 10.5286/ISIS.E.1234567 • The recommended format for citation is: • Author, A N. et al; (2010): RB123456, STFC ISIS Facility, doi:10.5286/ISIS.E.1234567 Identifying materials for hydrogen storage

  12. Chemistry: PubChem

  13. Astronomy: Seamless Astronomy and Dataverse The Seamless Astronomy Group at the Harvard-Smithsonian Center for Astrophysics brings together astronomers, computer scientists, information scientists, librarians and visualization experts involved in the development of tools and systems to study and enable the next generation of online astronomical research.  The are evaluating the Dataverse, an open data archive hosted by Harvard University and managed by the Institute for Quantitative Social Science (IQSS), as a project-based repository for the storage, access, and citation of reduced astronomical data. • Dataverse data citation standard: • offers proper recognition to authors • permanent identification through the use of global, persistent identifiers in place of URLs, • uses universal numerical fingerprints (UNFs) to guarantee that future researchers will be able to verify that data retrieved is identical to that used in a publication decades earlier, even if it has changed storage media, operating systems, hardware, and statistical program format. • Following is an authentic example of a replication data-set citation (from International Studies Quarterly, King and Zeng, 2007, p.209): • Gary King; Langche Zeng, 2006, "Replication Data Set for 'When Can History be Our Guide? The Pitfalls of Counterfactual Inference'" hdl:1902.1/DXRXCFAWPK UNF:3:DaYlT6QSX9r0D50ye+tXpA== Murray Research Archive [distributor] http://projects.iq.harvard.edu/seamlessastronomy/

  14. (Scientific) Communication through the ages • Science, as a process, requires the exchange of information and ideas. • We can make this exchange face-to-face (conferences, meetings, seminars) or through another medium (text, video, images), or both. • No matter what method we use, we wind up telling each other stories about what we’ve discovered. • Technology has given us new tools, but it’s also provided new challenges http://www.intoon.com/#68559

  15. The Data Deluge “the amount of data generated worldwide...is growing by 58% per year; in 2010 the world generated 1250 billion gigabytes of data” The Digital Universe Decade – Are You Ready? IDCC White Paper, May 2010 • Journals can’t now communicate everything we need to know about a scientific event • - whether that’s an observation, simulation, development of a theory, or any combination of these. • Data always has been the foundation of scientific progress – without it, we can’t test any of our assertions. • Previously data was hard to capture, but could be (relatively) easily published in image or table format • We need to publish data – but how?

  16. Serving, Citing and Publishing Data • Citation forms an important part of the scientific record. • We draw a clear distinction between: • publishing = making available for consumption (e.g. on the web), and • Publishing = publishing after some formal process which adds value for the consumer: • e.g. PloS ONE type review, or • EGU journal type public review, or • More traditional peer review. • AND • provides commitment to persistence Doi:10232/123ro This involves the peer-review of data sets, and gives “stamp of approval” associated with traditional journal publications. Can’t be done without effective linking/citing of the data sets. 2. Publication of data sets Doi:10232/123 1. Data set Citation This is our first step for this project – formulate and formalise a way of citing data sets. Will provide benefits to our users – and a carrot to get them to provide data to us! 0. Serving of data sets This is what data centres do as our day job – take in data supplied by scientists and make it available to other interested parties.

  17. Final remarks • There is obviously a need for data citation, not only for scientists, but also to provide traceability and accountability for the general public (c.f. issues surrounding Climategate) • There is serious pressure in the Earth and climate sciences to publish data • but there is also a need to ensure proper accreditation • How we communicate scientific findings is changing – data citation is a big part of that. http://www.keepcalm-o-matic.co.uk/default.aspx#createposter

More Related