Physical Sciences & Engineering

Physical Sciences & Engineering Chair: Johannes Reetz, MPCDF - Max Planck Society Rapporteur: Leon du Toit, University of Oslo

Session 1: Data Pilot presentations

Session 2: Data management challenges • Discussion facilitator: Claudio Cacciari, CINECA • Challenges: Data Repository vs. LT Data sharing Objective: discuss about common data (management, stewardship) challenges Expected Outcome: • A series of insights to a variety of approaches and view points, between related communities • A set of (new) common needs where EUDAT could play

DATA Domains Linking Publications ToDigital Objects Discovery of Digital Objects Stage Digital Objects Register Digital Objects PUBLISHED DATA DOMAIN REGISTERED DATA DOMAIN WORKSPACE (TEMPORARY - TRANSIENT)

EUDAT DATA Domains Linking Publications ToDigital Objects Discovery of Digital Objects Stage Digital Objects Register Digital Objects PUBLISHED DATA DOMAIN REGISTERED DATA DOMAIN WORKSPACE (TEMPORARY - TRANSIENT)

EUDAT DATA Domains Discovery of Digital Objects Stage Digital Objects Register Digital Objects Data Objects Data Entities REGISTERED DATA DOMAIN

Live Data repository vs. Long Term data sharing • A Data Repository for “live data” • Data gets updated during its life cycle • Metadata and provenance information gets updated • Collections get extended • Research collaborations need shared data access to live unregistred data. e.g. a Dropbox variant “, is this enough? • An archiving-system for LT data sharing “static data” • Curation, data publication, certification • Can’t we have a single system for all such types of data? • What is needed, what can be managed, what can be afforded?

Live Data repository vs. Long Term data sharing Sharing & LT preservation • We are looking for ideas, sharing ideas, finding ideas • Sharing raw data • Publication of data -> data becomes valuable to other communities after it has been published. The published paper is metadata, when people start reusing it they collect more metadata that is not available in the paper from the author – this should be fed back into the metadata store, risk having too much • Discoverability is not such a big problem within small communities are informed about their own activities, across communities is the where the problems is • Who takes the costs of storage and curation if large data sets are being long-term archived • Curation (selection vs management) is difficult in the sense that it is censorship (selectivity); who is entitled to do this, custodian role: knowledgeable contact person - scalability concerns • Finer grained definition of custodian role - stages of responsibility, issues of LT, knowledge transfer • We should rely on AI and machine learning - agents to help scale this • Problem solvingfornow – datadepositorsshouldspecify LT storageparameters – policy, e.g. lifetime, settingthestartingpoint • Data protection, privacy and sensitive information - different legal requirements, respond to changing demands on data creators, e.g. legitimate reasons for processing, qualified open access, managing consent – implies system design decisions, e.g. PII data • Oursystemsshouldaccomodateversioning, datacorrections, provenance • Often data do not speak for themselves, only become useful when combined with code; this brings software maintenance; executables; how does this relate to EUDAT; where are the lines drawn • Funders should address who pays for the custodians • Related to the data+code combo - sometimes data capture methods necessitate software to reconstruct the data in order to make it analysable • Client side software is always relevant - therefore, software maintenance is always present • We need to store sufficient information in order to interpret the data; define different levels; collections should contain pointers to software or other necessary tools • Capturingworkflow (provenance) requirestheexecutiontoolstobecapableofgeneratingtherelavantmetadata • Shouldalignincentives so scientistshavereasonstoprovideinfoneededforuseful LT preservation • Mitigate risk of knowledge loss by gathering as much metadata as possible; • Considertheinterrelatednessofpracticesvstechnologies

Live Data repository vs. Long Term data sharing Live Repository (workspace) • We should rely on the user /communities to control the community-specific data management • Domain and problem specificity leads to very heterogeneous data making a common live data repository difficult to deal with as a service provider • Usagepolicy • trying to reduce dimensionality is a goal for them (defining metadata is part of this effort); tools that can help with this would be useful • How deal with data on ingest needs to be post processed? • Want to access live data via APIs – people mostly want the latest version; discourage people to download; so nobody uses the data files • _really_ large scale is out of scope, we are in the mid scale of data • The serviceprovidershouldbeclearaboutcurrentcapabilitiesandfutureplansregardingthescaleofdata • one could present structured data in ways without knowing too much about the domain; vizservices. Q: Are these tools already available?

Live Data repository vs. Long Term data sharing • use APIs to abstract and get rid of heterogeneity • metadata enables solutions here; communities need to provide metadata standards if they want automated solutions; the amount of useful automation is proportional to the quality and standardisation of the metadata • need community model for metadata and interfaces, communities need help to develop standards • RDA canhelpguidedevelopmentofstandards • problem is that understanding between large communities to standardise metadata takes a _very_ long time; always evolving, several versions, no silver bullet; domain specific metadata standards also important; need to give knowledge to the researchers; we should also use existing ones; the creators should make tools to make this easier • usage metadata - tracking users having agreed TOUs - support for this? provide examples of TOUs? manage access via TOUs? • are metadata schemas in the registered domain fixed? • in the past people did take a lot of care with data and metadata but it came to nothing; we need to have high requirements for long term preservation for it to be useful

Session 3: PhysicalSciences & EngineeringLive Data repository vs. Long Term data sharing Results and conclusion

PhysicalSciences & EngineeringLive Data repository vs. Long Term data sharing Long-term preservationaspects • LT preservation for sharing ideas, finding ideas, preserving ideas • Sharing raw data • Upon reusing LT data more metadata collected from the author. Fed back into the metadata store - risk having an inflation of metadata and annotations • Curation is difficult: risk of censorship (selectivity), who is entitled for this custodian role? Custodian role needs to be defined in detail. Scalability? • We could increasingly rely on AI and machine learning techniques; intelligent agents can perhaps help to reduce the deluge of data and meta data prior to the preservation. • Data depositorsshouldspecify LT preservationparameters(intentions at ingest time) – policy, retention time, settingthestartingpoint • Handling sensitive information, necessary to log the data provider’s consent to use their data; cope with the variety of legal requirements;this implies system design decisions • Systems shouldaccomodateversioning, datacorrections, provenance • LT preserved data remains useful only when linked to the preserved code • Collections should contain pointers to software, execution environments and workflows • Capturingworkflows (provenance) requiresthe (workflow) systemstobecapabletogeneratethe relevant metadata Live Repository • We should rely on the user /communities to control the community-specific data management • Want to access live data via APIs • need community model for metadata and interfaces, communities need help to develop standards • RDA canhelpguidedevelopmentofstandards

Physical Sciences & Engineering