Integrated Data Analysis and Visualization

Integrated Data Analysis and Visualization Group 5 Report DOE Data Management Workshop May 24-26, 2004 Chicago, IL

Group 5 Participants

Outline • What is integrated data analysis and visualization? Why do we care? • Data Complexity & Implications • Applications-driven capabilities • Technology gaps to address these capabilities • General Recommendations • Conclusions

The Curse of Ultrascale Computation and High-throughput Experimentation Computational and experimental advances enable capturing of complex natural phenomena, on a scale not possible just a few years ago. With this opportunity, comes a new problem – the petabyte quantities of produced data. As a result, answers to fundamental questions about the nature of the universe largely remain hidden in these data. How to enable scientists perform analyses and visualizations of these raw data to extract knowledge?

Data Select  Data Access Correlate  Render  Display (density, pressure)From astro-data Where (step=101)(x-velocity>Y); Sample (density, pressure) Run analysis Run viz filter Visualize scatter plot Workflow Design & Execution Scientific Process Automation Layer VIZ Tool Analysis Tool Select Data Take Sample Data Mining & Analysis Layer Read Data (buffer-name) Write Data Read Data (buffer-name) Write Data Read Data (buffer-name) Get variables (var-names, ranges) Use Bitmap (condition) Bitmap Index Selection PVFS Parallel HDF Storage Efficient Access Layer Hardware, OS, and MSS (HPSS) Tony’s Scenario

Integration Must Happen at Multiple Levels • To enable end-to-end system performance, 80-20 rule and novel discoveries integration must happen: • Between and within data flow levels: • Workflows  Analysis & Viz  Access & Movement) • Across geographically distributed resources • Across multiple data scales and resolutions

Challenge of Data MassivenessDrinking from the firehose • Climate • Now: 20-40TB per simulated year • 5 yrs: 100TB/yr 5-10PB/yr • Fusion • Now: 100Mbytes/15min • 5 yrs: 1000Mbytes/2 min with realtime comparison with running experiment, 500Mbits/sec guaranteed (QoS) • High Energy Physics • Now: 1-10PB data stored, Gigabit net. • 5 yrs: 100PB data, 100Gbits/sec net • Chemistry (Combustion and Nanostructures) • Now: 10-30TB data • 5 yrs: 30-100TB data, 10Gbits/sec multicast • Astrophysics • Now and 5 yrs: Can soak up anything you build! (John Sharf’s stats, LBL)

Most of this Data will NEVER Be Touched with the current trends in technology • The amount of data stored online quadruples every 18 months, while processing power ‘only’ doubles every 18 months. • Unless the number of processors increases unrealistically rapidly, most of this data will never be touched. • Storage device capacity doubles every 9 months, while memory capacity doubles every 18 months (Moore’s law). • Even if the divergence between these rates of growth will converge, the memory latency is and will remain the rate-limiting step in data-intensive computations • Operating systems struggle to handle files larger than a few GBs. • OS constraints and memory capacity determine data set file size and fragmentation

Data size, n Algorithm Complexity n nlog(n) n2 100B 10-10sec. 10-10 sec. 10-8 sec. 10KB 10-8 sec. 10-8 sec. 10-4sec. 1MB 10-6 sec. 10-5 sec. 1 sec. 100MB 10-4 sec. 10-3 sec. 3 hrs 10GB 10-2 sec. 0.1 sec. 3 yrs. Challenge of Breaking the Algorithmic Complexity Bottleneck MS Data Rates: 100’sGB10’sTB/day(2004)1.0’sPB/day(2008) Algorithmic Complexity: Calculate means O(n) Calculate FFT O(n log(n)) Calculate SVD O(r • c) Clustering algorithms O(n2) For illustration chart assumes 10-12 sec. calculation time per data point

Massive Data Sets are Naturally Distributed BUT Effectively Immoveable(Skillicorn, 2001) • Bandwidth is increasing but not at the same rate as stored data • There are some parts of the world with high available bandwidth BUT there are enough bottlenecks that high effective bandwidth is unachievable across heterogeneous networks • Latency for transmission at global distances is significant • Most of this latency is time-of-flight and so will not be reduced by technology • Data has a property similar to inertia: • It is cheap to store and cheap to keep moving, but the transitions between these two states are expensive in time and hardware. • Legal and political restrictions • Social restrictions • Data owners may let access data but only by retaining control of it • Should we move computations to the data, rather than data to the computations? • Should we cache the data close to analysis and viz.? • Should we be smarter about reducing the size of the data while having the same or richer information content?

Challenge of High Dimensionality, Multi-Scale and Multi-Resolution The experiment paradigm is changing to statistically capture the complexity. We will get maximum value when we explore three or more dimensions/scales in a single experiment. Cells & Tissues Treatments Genetic Manipulations Phenotypes Time Genetics Environments (From G. Michaels, PNNL) Populations But multi-scale and multi-resolution analysis and visualization are in their infancy!

Can we browse a petabyte of data? To see 1 percent of a petabyte at 10 megabytes per second takes 35 8-hour days! Analysis of full context must select views or reduce to quantities of interest in addition to fast rendering of data. Human Bandwidth Overload? More data Visualization Scalability through guidance by analysis of full context Petabytes Terabytes Gigabytes Megabytes No Analysis Analysis-driven summarization More analysis Region Selection Know Our Limits & Be SmartObligations are two-sided: CS and Apps Not humanly possible to browse a petabyte of data. Analysis must select views or reduce to quantities of interest rather than push more views past the user. Ultrascale Simulations: Must be smart about which probe combinations to see! Physical Experiments: Must be smart about probe placement!

Arguably, visualization can be the most critical step of a simulation experiment. But it has to be in a full context. Space Time Simulation Analysis Storage Visualization Frame of Context Differences Suggest Needs in Hardware and Software for Analysis and Visualization I hear and I forget.I see and I believe.I do Visual Analysis and I understand.—Confucius (551-479 BC) Frames of contextfor major steps of space-time simulation scientific discovery process. Need hardware and software for Full Context analysis and visualization (From G. Ostrouchov, ORNL)

Paraview ASPECT But Tony Still Has a Dream – Internet “Plug-ins” for Ultrascale Computing!

From Dreams to Achievable Application-driven Capabilities The first step in Group 5 discussion

Capability #1: IDL-like SCALABLE, open source environment (J.Blondin) • High performance-enabling technologies: • Parallel analysis and viz. algorithms (e.g. pVTK, pMatlab, parallel-R) (3—2) • Portable implementation on HPC platforms (2—1) • Hardware accelerated implementations (GPUs, FPGAs) (2—1) • Parallel I/O libraries coupled with analysis and viz (ROMIO+pVTK, pNetCDF+Parallel-R) (2—1) • Information visualization (3—2) • Interoperability-enabling technologies: • Component architectures (CCA) (3—2) • Core data models and data structures unification (3—2) • Structural, semantic and syntactic mediation (3—2) • Scripting environments: • IDL/Matlab-like high-level programming languages (3) • Optimized (parallel, accelerated) functions (core libraries) (3) • Simulation interfaces (3) • Visualization interfaces (information, statistical and scientific visualization) (3—2)

Capability #2: Domain-specific libraries and tools • Same technologies as for Capability #1 Plus More • Novel algorithms for domain-specific analysis and visualization: • Feature extraction/selection/tracking (e.g., ICA for climate) (3—2) • New types of data (e.g. trees, networks) (3—2) • Interpolation & transformation (3—2) • Multi-scale/hierarchical features correlation (3) • Novel data models if necessary (3)

Capability #3: “Plug and Play” Analysis and Visualization Envs • Community-specific data model(s) (3) • Standardization that still provides efficiency and flexibility • Community-specific common APIs (3) • Unified still extensible data structures • Common component architectures (3—2)

Components integration strategies should be assessed within single and between multiple higher-level applications “i”-ntegration vs. “I”-ntegration • “i”-integration within the same application: • Same set of data structures => brute-force check (at worst) • Same language, execution and control model • Scripting languages (TCL, Python, R) • Run on the same cluster of machines • “I”-integration across multiple applications: • Different data structures (unknown for future apps) • Different execution & control models • Different programming languages • Run on different hosts Data Formats Transformations: • File  App • App_X  App_X • App_X  App_Y • App  File

Capability #4: Feature (region) detection, extraction, tracking • Efficient and effective data indexing that (3-2): • Supports unstructured data in files (e.g., bitmap indexing extension to AMR) • Supports heterogeneous/non-scalar data (e.g., vector fields, protein sequence, protein function, pathway, network) • Supports on-demand derived data (e.g. F(X)/G(Y)<5: entropy as a function of indexed density and pressure) • Information visualization (3—2)

Other Capabilities • Remote, collaborative & interactive analysis and visualization (3-2): • Network-aware analysis and viz • Novel means of hiding latency (e.g. caching via LoCi, view-dependent isosurfaces) • Sensitivity & uncertainty quantification (3) • Streaming analysis & viz (3): • Approx. multi-res. algorithms • Data transformations on streaming data • Annotation & provenance of analysis and visualization results • System-, analysis & viz-, data-level metadata • Verification and validation • Comparative analysis and visualization • Cross-cutting capabilities: • Integration of analysis and visualization with workflows • Integration of analysis and visualization with data bases (e.g., query-based)

General Recommendations • Encourage open source software • Move out mature technologies (??) • Encourage/force data model(s) & APIs standardization efforts? • Do not expect scientists to develop their domain-specific components rather fund collaborative CS & Apps teams  Will assure more robust and reusable solutions and take the burden of CS tasks from scientists

Conclusions • Integration must occur at multiple levels • Integration is more easily achievable within a community than across communities • Community-based data model(s) and APIs are required for “Plug & Play”

Integrated Data Analysis and Visualization

Integrated Data Analysis and Visualization

Presentation Transcript

Analysis and Visualization of Spatial Data

Exploratory Data Analysis and Data Visualization

Visualization and Data Analysis Workshop Focus

Volume Data Analysis and Visualization

Data Analysis and Visualization

ADVANCED DATA VISUALIZATION HTML5 AND RISK ANALYSIS

Data Point Visualization and Clustering Analysis

Data Visualization

Data Analysis and Visualization with Excel

Visualization tool for data analysis

Data Visualization

Data Analysis and Visualization with MS Excel

Data Analysis and Visualization with MS Excel

ICS 278: Data Mining Exploratory Data Analysis and Visualization

Data Visualization and Analysis

Terrain Data Analysis and Visualization

The Integrated Data Viewer – A Tool for Scientific Analysis and Visualization

Data Analysis and Visualization with MS Excel

Exploratory Data Analysis and Data Visualization

Analysis and Visualization of Spatial Data

Data Mining and Data Visualization

Data Analysis and Visualization