1 / 13

VLDATA Common solution for the (very-)large data challenge

VLDATA Common solution for the (very-)large data challenge. EINFRA-1, focus on topics (4) & (5). Objectives.

emele
Download Presentation

VLDATA Common solution for the (very-)large data challenge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VLDATACommon solution for the (very-)large data challenge EINFRA-1, focus on topics (4) & (5)

  2. Objectives • An open & generic platform supporting efficient and cost-effective solutions for large-scale distributed data processing, curation, analysis and publication. Providing standard-base interfaces and interoperable access to various e-Infrastructures • Evolution of existing solutions, advancing state-of-the-art, addressing: openness, extensibility, flexibility, interoperability, scalability, efficiency, productivity, security, cost effectiveness • User Community driven co-design, validated by end users and supporting a new generation of data scientists. • Cooperation among Technology providers, integrating existing technologies, to simplify the connection between Users and Resource providers.

  3. (Direct) Impact • To the Research Infrastructures • Scalability, robustness • Participating RIs will operate their Distributed Computing Systems efficiently, processing their large volume research data, making it available to their end users in a reliable and cost-effective manner that couldn't be achieved before. This may lead to new ways of organizing science activities, leading to significant scientific breakthroughs. • To the end user: scientist/operator • Simplicity and interoperability • By providing important functional components (e.g., pilots, single interface, etc. ) missing from existing practices, VLDATA platform will make possible the transparent integration of resources, hiding the complexity from use, resulting in the extension of the scale of the resources RIs can utilize. • To funding agencies • Cost efficiency • By reducing the duplication efforts, maximizing the use of EU-invest on e-Infrastructures, enlarging the user communities, providing efficient data processing services, providing advanced technology by integrating the state-of-the-art will reduce development cost significantly.

  4. Consortium

  5. Make IT simple • Simplicity: VLDATA provides an abstraction of the different Resources that are all made accessible the end user via the same interfaces. • Transparency: Users are allowed to specify their Workflows/Pipelines with different levels of abstractions. The platform takes care of the necessary Resource Allocation to fulfill the required specifications. • Extendibility and flexibility: VLDATA provides an API that allows users to extend the provided functionality by developing new or customized components • Reliability: Quality standards and extensive validation in several scientific domains to ensure the readiness-to-use and robustness of VLDATA based solutions • Scalability: Modular implementation allowing horizontal (amount of connected Resources or Users) and vertical (amount of processed Units) scaling to adapt VLDATA to the needs of each particular community or Research Infrastructure project. • Smart and intelligent: building on collected experience and monitoring data, algorithm can look for optimized scheduling/searching strategies, including automated decision making based on usage traces and expectations. • Cost-effective: Building up on existing well-established solutions and incrementally extending and developing to address new challenges with an evolving validated common solution, avoiding unnecessary duplicated efforts.

  6. VLDATA Advanced Modules: Compute & Data Management Quality & Security Requirements Framework: System Logging, Configuration, Accounting, Monitoring Basic Modules: File Catalog, Resource Status, Request Management, Workload Management Resource Interfaces: Grids, Clouds, Clusters Computing, Storage Public, Private, Volunteer User Interfaces: Portals, Command Line, REST, APIs

  7. Organization

  8. Development Area (WP 1-5) • Main Partners: • AMC, Cardiff, CPPM, CYFRONET, DESY, MTA SZTAKI, UAB, UB, Westminster • Working Cycle: • Requirement-Analysis -> Design -> Development -> Integration -> Quality Control • Work Plan: • Year 0: Prototype • Year 1: Scaling + Integration • Year 2: Catalogs + Quality • Year 3: Virtualization + Security • Year 4: New Challenges + Consolidation • Sustainability • Open Source Distribute Data Processing Collaboration: Open DISData Association • 4-year decreasing budget for Development Area

  9. Validation Area (WP 6) • Each participating RI produces and validates it own solution using common framework and tools • Main Partners • LHCb, Belle II, BES III, Pierre Auger Observatory, EISCAT_3D, Astrophysics, Computational Chemistry, Molecular Structure simulation, Seismology • (TBC) IceCube, COMPASS, NA62, CTA • SMEs • Working Cycle: • Design -> Integration -> Validation • Transversal activities, sharing experience, training, tools, etc.

  10. Exploitation Area (WP 7-9) • Target: sustainability • Main partners • CNRS, UvA, EGI, ASCAMM, ETL, Bull • From “DIRAC Consortium” towards “Open Distributed Data Processing Collaboration” • Work Packages: • Communication • Training • Sustainability

  11. Outputs & Inputs Policy Makers Experts Other Projects This Consortium Resource Providers WP1 Existing Products WP2, 3, 4, 5 WP6 Other RIs WP7, 8, 9

  12. Working model • User community driven co-development (Rapid Application Development): • Open, iterative, incremental and parallel, requirement-driven development process

  13. Abstract The proposed project aims to produce and validate common solutions to the processing, curation, analysis and publication of very large scientific data generated by European and world­wide scientific Research Infrastructures (RIs). The number of RIs in Europe and beyond expected to collect yearly multi­Petabyte data samples increases exponentially and they will soon be reaching the Exa scale. Existing solutions must be evolved in order to cope with large­scale distributed data processing. The VLDATA platform will provide standard­based interoperable access to various types of resources: Grid, Cloud, Volunteer, HPC, etc. (funded with different models: capex or opex, and coming from the public or private sector), and software tools/services running on­top to support global data science. Various RIs from different scientific domains, from physics to life sciences or to chemistry will validate VLDATA platform by implementing solutions for their concrete use cases, achieving at the same time a significant optimization in the efficiency and cost, and ensuring that no aspects of the challenge will be ignored. The complete life­cycle of the data will be addressed, as well as interoperability between different scientific domains and e­Infrastructures. The project gathers experts with complementary backgrounds from the Technology and the Resource Provider worlds that, collaborating with other relevant external experts and those from the participating RIs, will: (1) analyse the requirements for each of the RIs, (2) provide the VLDATA generic platform, starting from current solutions and following an incremental iterative development model, (3) design, prototype and implements the Distributed Computing Systems for each of the projects participating RIs, and (4) make the resulting VLDATA platform available to other RIs with similar needs. To reach other RIs, VLDATA will promote standard interfaces and tools, define appropriate quality assurance mechanisms and provide dissemination and training events, aiming to be sustained in the long run by contributions from new RIs benefitting from the VLDATA platform.

More Related