Long term data preservation
1 / 32

Long-Term Data Preservation - PowerPoint PPT Presentation

  • Uploaded on

Long-Term Data Preservation. International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics. Jamie.Shiers@cern.ch WLCG Overview Board, March 2013 Twitter: #DPHEP. Overview. Summary of DPHEP Blueprint recommendations

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Long-Term Data Preservation' - chung

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Long term data preservation

Long-Term Data Preservation

International Collaboration for Data Preservation and

Long Term Analysis in High Energy Physics


WLCG Overview Board, March 2013

Twitter: #DPHEP


  • Summary of DPHEP Blueprint recommendations

  • Opportunities: collaboration with other disciplines & funding

  • A “2020 vision” and its implementation

Dphep blueprint
Dphep blueprint

Dphep entities1
DPHEP Entities

Implemented via multi-lateral Collaboration Agreement (draft circulated)

Dphep entities2
DPHEP Entities

Chair of Study Group was CristinelDiaconu / CPPM & DESY who continues in this role

Dphep entities3
DPHEP Entities

CERN provides Project Manager 2013 – 2015 after which may rotate

Dphep entities4
DPHEP Entities

Broadened to include “influential” names, e.g. from APA, SCIDIP-ES

Dphep entities5
DPHEP Entities

Representatives of parties to Collaboration Agreement

Dphep entities6
DPHEP Entities

e.g. EU, NSF, STFC, INFN, …

Dphep blueprint deliverables
DPHEP Blueprint Deliverables

Proposed activities of the DPHEP Organization – p85, Blueprint document.

These deliverables are to be met within 2 years of becoming fully operational.

Dphep levels2
DPHEP Levels

HepMC / Rivet toolkit may play a useful – and sustainable – role here. See DPHEP7

Dphep summary
DPHEP Summary

  • There is a lot of knowledge and experience in the existing DPHEP community that can be leveraged for other efforts, e.g. LHC & LEP

  • LHC is clearly of key interest to WLCG OB but we should not forget LEP before it is too late!

  • On-going (small) effort to document current situation and options for moving forward

  • CERNLIB felt to be (a) critical factor but there are many external distributions

Collaboration with others
Collaboration with others

  • Many other disciplines, ranging from science to arts & humanities, already (very) active

  • Numerous conferences and workshops have been up and running for years

  • We have been accepted – partly due to halo effect of the Higgs discovery – with open arms

  • Concrete discussions on further collaboration are funding advancing well

  • Not limited to Data Preservation – e.g. SKA!


  • DASPOS is up and running with NSF funding

  • Research Data Alliance – with indirect EU, NSF, AUS and other funding – will play a role

    • Co-chair of RDA WG on DP

  • Clear signs that EU Horizon 2020 will include Data Preservation

    • e-IRG meeting, EIROforum w/s, RDA, …

  • Now is the time to firm up partnerships & prepare for up-coming projects

  • STFC and other UK bodies particularly active in above activities: how can we profit from this?

2020 vision for lt dp in hep
2020 Vision for LT DP in HEP

  • Long-term: disruptive change(s), e.g. LC era

    • All archived data – e.g. that described in Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further

    • Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards

  • Vision achievable, but we are far from this today

Long term commitment
Long-Term Commitment

  • To achieve long-term data preservation, we need long-term commitment(s)

  • By 2035, there will have been:

    • 3-4 updates to the ESPP;

    • 4-5 new DGs;

    • X re-organizations of CERN-IT.

  • We need commitments that outlive all of these!

Oais components
OAIS Components

  • In the OAIS model, there are the concepts of producer and consumer

  • DASPOS aims to take data produced by e.g. CMS and show that e.g. ATLAS can reproduce a full analysis, using the software, meta-data, documentation etc.

  • This exercise will be started at DPHEP7 (March 21-22) and hopefully repeated regularly – e.g. annually – so that by 2020 the entire process is well understood, documented and repeatable

  • It is proposed that the (Archive) Information Packages are simply XML documents stored in Invenio

  • The exact tool-set and feature requirement is still TBD

  • Some tools used on a daily basis – e.g. Twiki! – not suitable for long-term archives

  • Good opportunity for sharing experiences and best practices with other disciplines / projects, e.g. SCIDIP-ES, APA

APA – “Too Big an Issue for any single organisation – we must work together”

Archival storage
Archival Storage

  • Experience from WLCG and beyond tells us that data loss and corruption will (and does) occur!

    • See WLCG SIRs, Tim Bell’s presentation to DPHEP3

  • But there are things that we can do to mitigate risks and recover (often), e.g. rule-based systems: apply checksum and other “tests” upon schedule and/or actions

  • What is the current situation at WLCG sites?

  • Can we coordinate / agree suitable actions?

  • Coordinate via HEPiX, IEEE MSST, APA, EUDAT, RDA etc.

  • Collaboration with industry, e.g. IBM-led FP7 project

  • Recovery often performed by experiments by re-replicating data: how will this be done in the long-term?

Dphep level 4
DPHEP Level 4

  • Retaining the full potential of the data is the only really interesting option – but it is by far the most difficult!

  • Difficult does not mean impossible – and we can profit from a period of “meta-stability” while we concentrate on this

  • Past experiments typically ported / re-wrote major parts of their offline environment several times over a period of decades

  • This is inevitable for LHC too – we could make this easier, but it will require an initial investment!

  • Collaboration with others who face similar problems could help but much of this we have to solve ourselves

Where to invest
Where to Invest?

Tools and Services, e.g. Invenio

Archival Storage Functionality

Support to the Experiments for DPHEP Level 4

Suggested topics for dphep7
Suggested Topics for DPHEP7

  • “Ingest Issues” (10’)

    • How did you (the experiment) decide what data to save, how to make it discoverable/ available, how is it documented, where is the data / meta-data etc. What are the access policies and target communities?

    • What tools do you use?

  • “Archive issues”: (10’)

    • How is the archive managed? How are errors detected and handled? What is the experience?

    • What storage system / services are used?

  • “Offline environment issues”: (20’)

    • What have been the key challenges in keeping the offline environment alive? What are the key lessons learned / pitfalls to be avoided? What would you have done differently if long-term preservation had been a goal from the early days of the experiment?

  • DPHEP8: around or during CHEP? TBD in coming weeks… Doodle

Outline for site / experiment talks at DPHEP7, March 21-22, CERN


  • We have outlined the current status of Long-Term Data Preservation in HEP and areas for fruitful collaboration with others

  • Funding, e.g. through EU Horizon 2020, is looking good – we need to invest now to secure this!

  • Much work needs to be done to turn a dream into reality – particularly and critically in the area of future-proof offline environments

  • However, this is expected to result in a cost-saving in the long-term by reducing effort in inevitable migrations

Where to invest summary
Where to Invest– Summary

Tools and Services, e.g. Invenio:could be solved. (2-3 years?)

Archival Storage Functionality:should be solved. (i.e. “now”)

Support to the Experiments for DPHEP Level 4:

must be solved – but how?

Long term data preservation

International Collaboration for Data Preservation and

Long Term Analysis in High Energy Physics