1 / 52

Dealing with Data: An Introduction

Dealing with Data: An Introduction. Stephen Pinfield University of Nottingham, UK. Overview. The potential of research data The policy context The key challenges The place of data in research e-infrastructure The attitudes and practices of researchers The need for support for researchers

benny
Download Presentation

Dealing with Data: An Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dealing with Data: An Introduction Stephen Pinfield University of Nottingham, UK

  2. Overview • The potential of research data • The policy context • The key challenges • The place of data in research e-infrastructure • The attitudes and practices of researchers • The need for support for researchers • The need for coordinated initiatives

  3. The Potential “Because digital data are so easily shared and replicated and so recombinable, they present tremendous reuse opportunities, accelerating investigations already under way and taking advantage of past investments in science.” Clifford Lynch* * Clifford Lynch (2008) ‘Big data: How do your data grow?’ Nature 455 (7209), 28-29 http://dx.doi.org/10.1038/455028a

  4. Human Genome • “The Human Genome Project marked a new approach in biomedical research, one in which the scientific community came together to characterize systematically a large domain of important biological knowledge.” Francis Collins* • More than 2,800 researchers at 20 institutions took part in the International Human Genome Sequencing Consortium • Working with advanced instrumentation and large shared datasets * FS Collins et al (2004) ‘Finishing the euchromatic sequence of the human genome’ Nature 431 (7011), 931-945 http://dx.doi.org/10.1038/nature03001

  5. Hubble Telescope • Each day the orbiting observatory generates about 10 gigabytes of data • The Hubble archive sends about 66 gigabytes of data each day to astronomers around the world • Astronomers using Hubble data have published nearly 7,000 scientific papers • About 4,000 astronomers from all over the world have used the telescope • Researchers submit proposals for observations, which are peer-reviewed and, if accepted, the data is collected • About 1000 proposals are submitted each year and 200 selected • “There are more research papers written by ‘second use’ of research data, than by the use initially proposed” Ross Wilkinson* * Ross Wilkinson ‘Current Developments in Australia’ UKRDS conference presentation, February 2009

  6. Swine Flu Worldwide response to the new strain of influenza A H1N1v in 2009 helped by data sharing: • 23 April: same strain of swine influenza affected people in Mexico and USA • 27 April: H1N1 sequence data available in GenBank • 30 April: a number of phylogenetic analyses on the data available Source: Jon Cohen (2009) ‘Flu Researchers Train Sights On Novel Tricks of Novel H1N1’Science 324 (5929), 870-871 http://dx.doi.org/10.1126/science.324_870

  7. Link with Publications • Example: UK PubMed Central provides: • Access to the full text • Links to related datasets • Potential for text/data mining • Potential for semantic enrichment of articles including data access, e.g. David Shotton* * David Shotton (2009) ‘Semantic publishing: the coming revolution in scientific journal publishing’. Learned Publishing 22 (2) 85–94 http://dx.doi.org/10.1087/2009202

  8. Research Data* • Data generated through: • Scientific experiments • Models or simulations • Observations of specific phenomena at a specific time or location • Making data available: • “There are two essential reasons for making research data publicly-available: first, to make them part of the scholarly record that can be validated and trusted; second, so they can be reused by others in the research.” * Alma Swan and Sheridan Brown To Share or Not to Share: Publication and Quality Assurance of Research Data Outputs Research Information Network, 2009 http://www.rin.ac.uk/data-publication

  9. Data Deluge Growth of the Protein Data Bank Source: Sustaining the digital investment: issues and challenges of economics is sustainable digital preservation. Interim report of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (December 2008) http://brtf.sdsc.edu/biblio/BRTF_Interim_Report.pdf

  10. Growth in Data Use The Wellcome Trust reports that its Sanger Institute website regularly received 15 million hits per week in 2007/2008 (a 25% rise on the previous year)

  11. Policy Context Funder requirements: • OECD: international non-governmental agency • UK Research Councils: government-supported agency • Wellcome Trust: charity • Nature: publisher • etc

  12. OECD • OECD Principles and Guidelines for Access to Research Data from Public Funding, 2007*: • A set of important principles encouraging improved access to and sharing of research data in order to promote effectiveness and efficiency of scientific research * http://www.oecd.org/dataoecd/9/61/38500813.pdf

  13. UK Research Councils: MRC • Medical Research Council Policy on Data Sharing and Preservation*: • “The MRC expects valuable data arising from MRC-funded research to be made available to the scientific community with as few restrictions as possible. Such data must be shared in a timely and responsible manner.” *http://www.mrc.ac.uk/Ourresearch/Ethicsresearchguidance/Datasharinginitiative/Policy/index.htm

  14. UK Research Councils: NERC • Natural Environment Research Council Data Policy*: • “…science in general and environmental science in particular involves the collection of data, and the subsequent management of these data is implicit in NERC’s mission. While data will indeed be manipulated by the researcher to provide material for publication, data are a resource in their own right. Properly managed and preserved, they can potentially be used and re-used by future researchers, and exploited commercially or educationally. Such further uses, often never envisaged in the first instance, will make an additional contribution to NERC’s objectives.” * http://www.nerc.ac.uk/research/sites/data/policy.asp

  15. Wellcome Trust • Policy on Data Management and Sharing*: • “The Trust considers that the benefits gained from research data will be maximised when they are made widely available to the research community as soon as feasible, so that they can be verified, built upon and used to advance knowledge.” • “…the Trust expects the researchers that it funds to maximise the availability of research data with as few restrictions as possible.” *http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTX035043.htm

  16. Publishers: Nature • Nature authorship policy relating to data*: • “Before submitting the paper, at least one senior member from each collaborating group must take responsibility for their group's contribution. Three major responsibilities are covered: preservation of the original data on which the paper is based, verification that the figures and conclusions accurately reflect the data collected and that manipulations to images are in accordance with Nature journal guidelines…, and minimization of obstacles to sharing materials, data and algorithms through appropriate planning.” * http://www.nature.com/nature/journal/v458/n7242/full/4581078a.html

  17. Key Challenges • Data storage challenges, including technologies and responsibilities • Data often stored locally on different devices • Data is stored in multiple formats and with numerous software dependencies • Often the data is unstructured and therefore inaccessible to others • Raw data may often not be easily shareable; derived data may still require work to make it shareable • It may not be clear who owns the intellectual property rights on data

  18. Key Challenges (2) • Fixed-term funding may mean data may not be permanently available • There are no widely-agreed formal mechanisms for data quality control • Lack of adequate metadata and search tools • Provision of skills for data management and curation is underdeveloped • Inadequate institutional, consortium, national, and subject facilities • Data is often an untapped resource

  19. Data Life-Cycle Management • Whole data life-cycle is important, not just storage • Life-cycle: creation, selection, ingest, storage, metadata creation, retrieval, preservation • Enabling subsequent: access, analysis, synthesis, reuse, transformation Source: Digital Curation Centre Curation Lifecycle Model http://www.dcc.ac.uk/lifecycle-model/#11

  20. Data Citation and Linking • Data citation standards beginning to develop e.g.: • eBank UK survey: http://www.ukoln.ac.uk/projects/ebank-uk/data-citation/ • Altman and King proposal* • Data linking: • “It’s not just about collecting data, it’s about connecting it. Only then do the interesting patterns start to emerge” (Tim Berners-Lee) * Micah Altman and Gary King (2007) ‘A proposed standard for the scholarly citation of quantitative data’ D-Lib Magazine 13 (3/4) http://dx.doi.org/10.1045/march2007-altman

  21. European Initiative • European Initiative to Facilitate Access to Research Data, March 2009 • Joint initiative of various research libraries and technical information providers from Denmark, France, Germany, the Netherlands, Switzerland, and the UK • Memorandum of Understanding* signed stating: • “Our long-term vision is to support research is by providing methods for them to locate, identify, and cite research datasets with confidence.” • “In order to achieve its long-term vision, we will establish a not for profit agency that enables organisations to register research datasets and assign persistent identifiers to them.” *http://www.icsti.org/documents/PressReleaseMarch2009-JointDOIforData.pdf

  22. Other Technical Issues • e-IRG DMTF (e-Infrastructure Reflection Group of the European Commission, Data Management Task Force) work 2007-2008 • Interim Report, June 2009*: now out for consultation • Highlights two key technical areas requiring further work: • Metadata quality • Interoperability * http://www.e-irg.eu/index.php?option=com_content&task=view&id=223

  23. Metadata e-IRG report recommendations: • Usage: providing metadata is a major priority • Scope: need for disciplines to agree metadata sets • Provenance: metadata should include or refer to provenance information • Persistence: metadata descriptions need to be persistent • Aggregations: metadata have the potential to describe citable groupings • Standardisation: elements across different fields • Interoperability: metadata needs to be open and offered for harvesting • Quality: researchers need to produce high-quality metadata • Earliness: it is important that metadata production is timely • Availability: the metadata is an essential part of providing a resource or service

  24. Interoperability e-IRG report recommendations: • Need to actively encourage programmes to support cross-disciplinary access • Support interoperation within multinational and multidisciplinary grids • Prioritise interoperation activities aiming at standardising interfaces and/or protocols • Digital objects deserve infrastructure components in systems design • Underlying proper data management is proper repository setup

  25. Criteria for Interoperability* * Source: Liz Lyon, Simon Coles, Monica Duke, Traugott KochScaling Up: Towards a Federation of Crystallography Data Repositories 2008 http://www.ukoln.ac.uk/ukoln/staff/e.j.lyon/reports/Ebank3report.pdf

  26. What to Preserve? • Not all data should be preserved and/or shared • Examples of added value from preservation and sharing: • Data which can be reused and recombined with other data e.g. clinical trials • Data recording so unique events e.g. census data • Data which has cumulative value e.g. human genome • There is a need for mechanisms and criteria for selection for preservation as part of data management plans

  27. What to Preserve? UKRDS • Consultation with over 700 researchers in four case study sites (Bristol, Leeds, Leicester, Oxford), April-June 2008 • Respondents estimated about 50% of their data had a useful life of up to 10 years • 26% were seen as having indefinite retention value

  28. NSF Long-Lived Data Identifies three categories of collection: • Research data collections: the products of one or more focused research projects and typically contain data that are subject to limited processing or curation. • Resource or community data collections: serve a single science or engineering community. These digital collections often establish community-level standards. • Reference data collections: intended to serve large segments of the scientific and education community. Characteristic features of this category of digital collections are a broad scope and a diverse set of user communities including scientists, students, and educators from a wide variety of disciplinary, institutional, and geographical settings. Source: Long-Lived Digital Data Collections Enabling Research and Education in the 21st Century National Science Board, NSB-05-40, 2005 http://www.nsf.gov/pubs/2005/nsb0540/start.jsp

  29. Open Access to Data? • Open Access (OA) • “…where the full content is freely, immediately, and permanently available and can be accessed and reused in an unrestricted way.” Stephen Pinfield* • Different scenarios: • Some data will be OA • Some will be ‘delayed OA’, but needs to be shared amongst collaborators earlier • Some will be OA derived/summary data (rather than raw data) • Some data will remain closed access * Stephen Pinfield (2009) ‘Journals and repositories: an evolving relationship?’ Learned Publishing 22 (3), 165-175 http://dx.doi.org/10.1087/2009302

  30. Challenges: Stakeholders • Researchers: need to manage and/or contribute to the data production, curation and reuse process • Funders: need to get value from data and ensure research is of a high quality • Institutions: need to provide services and guidance • Subject communities: need to develop standards, best practice • Data managers: need to provide reliable services for data producers and potential reusers

  31. Funders Work with funders on policy issues and data management planning UKRDS Communities and Headline Processes • HEIs • & Research Institutes • Researchers • IT Directors • Librarians • Archivists • other experts Public sector users & generators of data Services covering: data management advice, DCC lifecycle adoption and guidance, training in DMPs, tools / discovery development, and accession planning Provision of conditional data set access Commercial users & generators of data Service providers Other educational institutions Coordinate capacity planning and help address implications for long-term storage and infrastructure investment Vendors Engage as appropriate to maximise exploitation of financial support for long-term data management capabilities Engage as appropriate to maximise exploitation of vendor support for long-term data management capabilities Ensure provision of accession and access procedures Facilitate provision of persistent citation links Venture Capitalists International links Journal and data publishers

  32. Graphic courtesy of Liz Lyon, UKOLN Presentation services: subject, media-specific, data, commercial portals Searching , harvesting, embedding Resource discovery, linking, embedding Resource discovery, linking, embedding Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media Data analysis, transformation, mining, modelling Aggregator services: national, commercial Learning object creation, re-use Harvestingmetadata Learning & Teaching workflows Research & e-Science workflows Repositories : institutional, e-prints, subject, data, learning objects Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules Deposit / self-archiving Deposit / self-archiving Validation Validation Publication Resource discovery, linking, embedding Validation Linking Peer-reviewed publications: journals, conference proceedings Data curation: databases & databanks Quality assurance bodies

  33. Context: eInfrastructure • ‘eInfrastructure’ (or ‘cyberinfrastructure’) is a term used to denote a combination of: • large-scale computing systems • data storage/management repositories • advanced instruments • analytical, visualisation and modelling tools • information search and retrieval tools • middleware, including identity and access management systems • collaboration tools • high-performance networks • and people, organisations and processes which support these • These are provided in an integrated way to support research activity and to achieve greater research productivity. The research supported is often highly collaborative within and between institutions.

  34. eInfrastructure Components • Enabling components • Access management • Data/content storage and curation • High-performance and grid computing • High-capacity networks • Collaboration components • Shared data facilities • Shared spaces • Shared tools • Coordination components • Governance/leadership • Skills development • Support Source: Strategic Roadmap for Australian Research Infrastructure, Department of Innovation, Industry, Science and Research, Canberra, 2008. http://www.innovation.gov.au/ScienceAndResearch/Documents/Strategic%20Roadmap%20Aug%202008.pdf Graphic courtesy of Prof Jonathan Hirst

  35. eInfrastructure: Levels • Institutional • Consortial • National • International

  36. Attitudes of Researchers Some barriers to sharing: • ‘Data mining’: ‘What’s mine is mine…What’s yours is mine’! • Lack of explicit career rewards for data sharing • Concern to protect IPR and get value from the data • Ethical issues: personal data may not be shared and some data should not be reused for purposes other than that which was collected • Significant extra effort/skills required for sharing

  37. Disciplinary Differences Significant differences across disciplines (as with publication practices): • Different kinds of data collected • Different traditions of data sharing • Different attitudes to data preservation • Different views of the responsibilities of the different players

  38. Willingness to Share UKRDS research (2008) indicated: • 21% of research has consulted use a national or international facility • Most share data freely amongst research collaborators • 18% share data via a data archive • 43% believe their research could be improved by access to a wider range of data

  39. Incentives for Researchers • Piwowar, Day and Fridsma* suggest a citation advantage associated with sharing data • “…examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression.” • “This correlation between publicly available data and increased literature impact may further motivate investigators to share their detailed research data.” * HA Piwowar, RS Day, DB Fridsma (2007) ‘Sharing Detailed Research Data Is Associated with Increased Citation Rate’. PLoS ONE 2(3): e308. http://dx.doi.org/10.1371/journal.pone.0000308

  40. Incentives for Researchers • Citation advantage? • Funder requirements • The changing nature of research

  41. What Do Researchers Want?* • Confidence that their data will be permanently stored and remain readily accessible • Confidence that the charges for managing, curating and storing data will be met • Ability to access freely other people’s data preferably on a worldwide basis • Training in data mining and intelligent management and use of data * John Coggins ‘A researchers perspective: the valuing challenge of data’ UKRDS conference presentation, February 2009

  42. Researchers: Attitudes and Skills • “Relatively few researchers have the expertise, resources and inclination to perform themselves all the tasks necessary to make their data not only available, are readily accessible and usable by others.” Alma Swan and Sheridan Brown* *Alma Swan and Sheridan Brown To Share or Not to Share: Publication and Quality Assurance of Research Data Outputs Research Information Network, 2009 http://www.rin.ac.uk/data-publication

  43. Roles for Librarians? Potential to apply ‘traditional’ skills in new areas? • Data validation • Metadata protection • Data search and retrieval • Linking data and published outputs • Encouraging interdisciplinary approaches • Curation skills

  44. Co-ordinated Approaches • Subject communities • UK: UKRDS (UK Research Data Service) • Australia: ANDS (Australian National Data Service) • USA: DataNets (Sustainable Digital Data Preservation and Access Network Partners) • European Initiative to Facilitate Access to Research Data

  45. Subject-Based Initiatives: EU • e-IRG DMTF Survey of Existing Data Management Initiatives identifies over 60 major European initiatives across different disciplines • Most have been designed to collect and manage particular data types in a specific discipline • Some are country-specific • Some archives collect raw data, others focus on post-analysis data only, some both • The aim of many archives is to “curate data in order to allow further analysis beyond the original experimental measurement.” • Data is acquired in a number of ways: • In the Natural Sciences data is often acquired directly from scientific instruments • In the Arts and Social Sciences it is usually a combination of research deposit and acquisition of external data sets • In the Health Sciences there is often a tradition of journals requiring data deposit before publishing related articles and it is the responsibility of the author to do this

  46. UKRDS • Joint initiative of top research-led universities’ library and IT directors • Sponsored by the Higher Education Funding Centre for England (HEFCE) and the Joint Information Systems Committee (JISC) • National shared services feasibility study • Report: December 2008 • Concludes that a nationally-coordinated approach to research data management is feasible and likely to achieve major savings and wide benefits • Needs ‘buy-in’ from variety of agencies • Now proceeding with initial investigatory/piloting phase

  47. UKRDS Basic Processes UKRDS Basic Processes UKRDS Basic Processes Research Project Process Research Project Process Research Project Process Research Research Research Research Research YES YES YES Funder Funder Project Report Project Report Professional Professional Professional Professional Professional Research Team Research Team Research Team Project Report Approves Proposal Approves Proposal Formulates Formulates Prepares Proposal Prepares Proposal Prepares Proposal Carries Out Carries Out Carries Out Y Y / / N N Research Concept Research Concept ( ( ( Including DMP Including DMP Including DMP ) ) ) Research Project Research Project Research Project s s s l l l e e e a a a NO NO c c c i i i n n n Update for status Update for status Update for status Research Research v v v o o o Researchers Researchers r r r i i i t t t e e e + + + changes since changes since changes since Research Research Research Consults UKRDS Consults UKRDS u u u S S S Review For Review For t t t i i i initial registration initial registration initial registration t t t Data Data Data Data Management Data Management Data Management a a a about other about other s s s Possible Possible t t t n n n a a a Plan Plan Plan I I I research research & & data data D D D Re Re - - Submission Submission sources sources Research Data Sharing Process Research Data Sharing Process Research Data Sharing Process Manual Manual d d d & & s s l l e e e s s User User a a r r t t & & n n n e e n n o o n n y y y r r S S S e e e S S S o o o On On - - Line Line t t Enquiry Enquiry t t l l r r r e e o o c c i i i g g g a a t t t n n D D D D D D i i Enquiry Enquiry , , discovery discovery s s s s s s t t m m e e a a a n n e e Authorised Authorised s s s i i i a a R R R R R R r r r s s o o n n c c i i i g g g o o o n n and advisory and advisory i i K K K r r K K K m m m r r e e e t t t t t User Enquiry User Enquiry r r a a e e e e U U U U U U a a e e R R R S S S t t m m m v v services services t t h h N N a a n n o o t t o o o d d i i o o G G C C C UKRDS Services and Administration UKRDS Services and Administration UKRDS Services and Administration Access Access Access Relationship Relationship Relationship Foresight Foresight Foresight Capacity planning Capacity planning Capacity planning International International International Advisory Services Advisory Services Advisory Services Management Management Management Management Management Management Development Development Development & & & Investment Investment Investment Access Services Access Services Access Services Tools Tools Tools , , , Service Provider Policy and Policy and Policy and Service Provider Service Provider Training and Training and Training and Accreditation Accreditation Accreditation & & & Citation Citation Citation Methodologies Methodologies Methodologies Strategy Strategy Strategy Administration Administration Administration Development Development Development Certification Certification Certification Repositories Repositories Repositories and Handbooks and Handbooks and Handbooks UKRDS Management and administration UKRDS Management and administration

  48. ANDS and DataNets • Australian model, co-ordinated and top-down from government: ANDS (Australian National Data Service) with AUS $24m over 3 years • US model, distributed and NSF-funded: 5 large ‘DataNets’ (consortia of universities) to build data stewardship capabilities : $100m over 5 years

  49. ANDS: Components • Developing Frameworks: influencing relevant national policies • Providing Utilities: building and delivering national technical services to support the ‘data commons’ • Seeding the Commons: improving and standardising institutionally supported repositories • Building Capabilities: assisting researchers to align their data management practices with the needs and outputs of ANDS Source: http://www.ands.org.au/

  50. ANDS: Conceptual Architecture Graphic courtesy of ANDS

More Related