1 / 13

Data Fabric IG Introduction

Data Fabric IG Introduction. Observations I from Recent Overview. about 50 interviews & about 75 community interactions Data Management and Processing is too time consuming and costly due to organization heterogeneity.

kirima
Download Presentation

Data Fabric IG Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Fabric IGIntroduction

  2. Observations I from Recent Overview • about 50 interviews & about 75 community interactions • Data Management and Processing is too time consuming and costly due to organization heterogeneity. • Federating data including logical layer information (tracing provenance, understanding creation context, checking identity and integrity, etc.) is too costly. • DM and DP is not ready for Big Data due to the lack of usage of automated procedures incorporating proper data organization mechanisms.

  3. Observations II from Recent Overview • Due to lack of software that is supporting proper data organizations we continue to create legacy data. • Example: a key biologists is spending 75% of his time for data management - a waste of money and human capital. • To a large extent results are not reproducible. • Senior Domain People agree: • need a change data organization and procedures • but risky path and lack data professionals • people hesitate since they miss clear perspectives

  4. Data Fabric Sketch a very rough sketch of the Data production and processing machinery of data-driven science • One Big Question for RDA: • How can we maximally support this machinery • unload researchers, • make science reproducible, • etc.

  5. “Data Fabric” based on Recent Overview often all in file system data one can work with organized data sharable & re-usable data often a lot of copying in file system

  6. One concrete conclusion • current practice: a data collection comes with its own data organization, management and access solutions • future: there is no need for this heterogeneity since DOs can be treated content independent to a certain extent

  7. Until now in RDA • started with a number of WG activities in P1 • DTR, PP, PIT, MD, DFT – the old ones • some people found this urgent and interesting topics • almost at the end and some questions: what now, how does it all fit into landscape, etc.? • they started this DF brainstorming • even more groups started • are there general themes in the data landscape • the whole issue of data publishing/citation/etc. • the whole issue of scientific culture/legal & ethical aspects • our daily data work in the departments – the Data Fabric • may be more

  8. A few Questions arise • what is the scope of RDA’s Data Fabric? • what are the characteristics of RDA’s Data Fabric? (term is used in industry already: efficient computational machine) • what are the components of RDA’s Data Fabric? • what should the DF IG do within RDA and what not?

  9. Scope and Characteristics of RDA’s DF • DF is about • making departments’ data science reproducible • creating the conditions for trust in the anonymous data domain • identifying mechanisms, components and interfaces making data science efficient and cost effective • discussing cross-disciplinary approaches • defining a framework that allows to include new components or component variants in a flexible way • Example: • DF will state necessity of a worldwide available machinery to register & resolve DOs, we will say something about registered attribute types and specify an API • but we will not say how to implement and use such a system

  10. Scope and Characteristics of RDA’s DF • DF is NOT about • prescribing an overarching architecture we need to follow • specifying an implementation of such an architecture • discussing specific technologies and tools • more than discussing the processing machinery (not publication, citation, l & e, etc.) • DF is about highly automated procedures or at least guidance to follow such procedures.

  11. Components of RDA’s DF (just first ideas) • domain of registered data objects (DO) incl. basic organization principles (data, code, knowledge) • domain of registered actors (ORCID, etc.) • domain of trusted repositories for DOs • accepted policy principles (proper organizing mechanisms, self-documenting, certified, etc.) • set of trusted registries (types&concepts, metadata and provenance schemas, metadata instances, repositories, PIDs, policies, etc.) • what about semantics – so important! • much already out there, need to see how this can all fit together and how we can foster software development

  12. DF IG way of acting • DF IG must be an inclusive open platform for interaction • DF IG needs to place the various WGs/IGs on the landscape • DF IG needs to identify barriers across groups • DF IG can work as umbrella to maintain WG results • open position papers will summarize the state of discussions and provoke convergence debates • it will NOT take council’s of TAB’s role

  13. Task of today DF IGBoF • What are DF’s Scope and Characteristics? • What are DF’s components, interfaces, mechanisms? • How should DF act? • Who will chair DF IG?

More Related