1 / 9

Breakout Plan:

Breakout Plan:. 9:00 – 9:30 am Intro to the Breakout Quick introductions Review of ANL findings Plan for subgroup division and charge Reporting template discussions 9:30 am – 10:45 am Break into subgroups Fill in breakthrough slide (slide 1)

colucci
Download Presentation

Breakout Plan:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Breakout Plan: • 9:00 – 9:30 am Intro to the Breakout • Quick introductions • Review of ANL findings • Plan for subgroup division and charge • Reporting template discussions • 9:30 am – 10:45 am Break into subgroups • Fill in breakthrough slide (slide 1) • 10:45 am Begin slide on science impact (slide 2) • 11:45 am Give completed slides to breakout leads • 12:00 – 12:30 pm Get lunch and come back • 12:30 pm – 1:00 pm Finalize slide presentation • 1:00 pm - 2:30 pm Co-leads present report-back slides

  2. Facilities Integration and AI Ecosystem Co-lead: Michael E. Papka Co-lead: Inder Monga Co-lead: James J. Hack Technical Writer: Scott Jones

  3. List of breakout participants • Kalyan Perumalla • Carlos Soto • Suzy Tichenor • Torre Wenaus • Julia White • Sean Wilkinson • Da Yan • Junqi Yin • Inder Monga • Bobby Sumpter • Natalia Vasileva • Jay Bardhan • Arthur Bland • Jim Brandt • Guojing Cong • George Fann • James Hack • Sean Hearne • Scott Jones • Andrew Kail • Doug Kothe • Ralph Kube • Michael Matheson • Veronica Melesse Vergara • Bronson Messer • Michael Papka

  4. Multi-facility integration and streaming data • Specific capabilities that need development • Automated flows of data in and out of the HPC environment (e.g., streaming, batch processing, … ) • Integrated instruments and experimental facilities need to be impedance matched • Role of edge processing needs to be considered in overall multi-facility workflows • On-demand needed for real-time feedback and control, characterization of instrument (e.g., computational steering) • Multi-facility scheduling across all resources that allows real-time processing of streaming data • Federated identity/instrument capabilities are an essential part of the solution • How much of the HPC system do you need? • Optimization is minimization of time to solution • e.g., Synthesis of materials needs faster turn-around; characterization of materials can live with longer time turn-around • Use AI methodologies to decide where generation/manipulation of data needs to happen (edge, HPC environment, in the cloud…) • Multi-parameter optimization problem including real costs

  5. Cross-Facility optimization of applications • AI for improvements to and adjunct for simulation • Grand-challenge • Identifying appropriate data from science investigations across facilities exploring similar phenomenology, methods, computational workflows, to train and optimize using AI techniques • e.g., Insufficient data to optimize things like AMR techniques, where capturing data on simulations using AMR across multiple simulations, across labs, etc., may have value • Leverage other investments in development of trained AI models (see model management) • e.g., with appropriate metadata leverage community work • Finding and sharing datasets for training (e.g, metadata challenges, access time challenges, …) • e.g., If I need to optimize DFT simulation, can I get data from a broad range of DFT simulations, with the appropriate metadata, to collectively and thoroughly characterize simulation capabilities…

  6. Model Management Community use and sharing of large AI models • How can the community exploit models developed by others? • Proprietary data may introduce other challenges (e.g., policy space)? • Training models for the community – provenance, what data was used to train?, etc. • Sharing of models- what are mechanisms, and responsibility of facilities for supporting/maintaining these capabilities? • Metadata standards relevant to modeling framework need development

  7. AI environment (software and libraries) • Need for a scalable environment • Scaled up from local machine to HPC machine seamlessly • Benchmark/tests needed to ensure that the scaled-up version is correct? • SC-wide maintained repository of AI software packages that has metadata of architecture, scale, etc. • Proper abstractions so the user is protected from software stack variability, other packages used, and scalability changes of libraries • Abstraction layer helps the computing facilities vet and protect the underlying software infrastructure, so the user is not using unverified software packages (see cyber security) • Use of testbeds for new software and hardware

  8. Optimize the operation of facilities • Using AI for facility operations optimization • Managing and organizing monitoring data coming from the facility, appropriate metadata as well; Are we gathering the right data? • Characterizing applications and understanding their ‘fingerprint’ to optimize how the application is configured for the particular facility (e.g., architectural awareness) • Development of APIs that will enable feedback to systems and users • Data from multiple facilities to optimize end-to-end scientific workflows that might use multiple resources across administrative domains • Policy, access, data sharing formats etc. etc.

  9. Cybersecurity • Identifying bad actors on the machine, anomaly detection using the captured/sensed data exploiting AI techniques • Real-time action/response to identified problems • Identifying malicious code • Policies that allow data to be used across facilities in a way to preserve agreements and restrictions • Appropriate cybersecurity controls to allow easy streaming of data in and out of the facility, at high performance?

More Related