1 / 10

Efficient access and query, data integration Group 4 Group coordinators: Alok Choudhary Rob Ross

Efficient access and query, data integration Group 4 Group coordinators: Alok Choudhary Rob Ross. Parallel and Random I/O. I/O Stacks High-level I/O libraries (PnetCDF, HDF5, SILO) I/O middleware (MPI-IO) Parallel file systems (Lustre, GPFS, PVFS)

frisco
Download Presentation

Efficient access and query, data integration Group 4 Group coordinators: Alok Choudhary Rob Ross

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient access and query, data integration Group 4 Group coordinators: Alok Choudhary Rob Ross

  2. Parallel and Random I/O • I/O Stacks • High-level I/O libraries (PnetCDF, HDF5, SILO) • I/O middleware (MPI-IO) • Parallel file systems (Lustre, GPFS, PVFS) • Other shared file systems (CXFS, GFS, Panasas, qfs) • Solutions may exist • Performance/scalability are “ok” • Will these scale to next-generation systems (e.g. BG/L, Red Storm?) • Random I/O • Query metadata for optimizing seemingly random accesses • Research and development • Scale! Not just an engineering problem. • DB-like, query operations (more later) • Recognizing and/or passing on access pattern information, then acting on it • Related to metadata issues • Execution of app. code at the I/O server (active disk • (user) Metadata as file system constructs • Hardening and packaging • Large FC configurations • Fault tolerance • System support • Deployment and maintenance • Low BW, serial applications in good shape • High BW, embarrassingly parallel, task farming

  3. Parallel and Random I/O • Gaps with Priority • Scaling of parallel I/O stack • Both scaling of # of clients, and • Scaling of size of the file system (# of files/objects) • APIs for passing more information to the system • (already there in MPI-IO to some extent, some PFSs, but not adequate, also needed support at the high-level I/O library) • Management of large scale storage • Fault tolerance • Autonomic (self-managing, etc.) storage • Connecting PFSs to hierarchical storage systems efficiently

  4. Large-scale feature-based Queries • Lots of dimensions • existing indexing techniques aren’t particularly good for this • Not worth building an index at all in some instances • Research and development • Parallel update problem with existing representations • When to linear scan, streaming • Hardware-assisted searching (e.g. Netezza, NexQL, Seisint) • Hardening and packaging • Bitmapped indexing, in some use • Deployment and maintenance • Relational DBs • Object DBs

  5. Large-Scale, Feature-Based Queries • Gaps with Priorities • Scalability of techniques, such as indexing, as a solution to this problem • Support for runtime feature extraction • Concurrent update (addition) to indices • only for some groups

  6. Query processing over files • DB-like operations on files • Structured data files such as HDF5, PnetCDF, SILO • Alternative APIs, file format independent • Java database objects, ODMG • Research and development • What should the API look like? • Protocols for accessing databases in distributed environments with arbitrary backends (e.g., GGF DAIS group) • Hardening and packaging • Ad-hoc Query package (LLNL work) • Range queries over SILO mesh data • Root (HEP community) • Operates on files in internal file format • Deployment and maintenance • nothing

  7. Query Processing over Files • Gaps with Priorities • Determining the API for this query processing • What capabilities are needed from this API? • Implementing this API for common file formats • Appropriate underlying optimizations may impact all of I/O stack (e.g. query optimizations, cache management, etc.) • Extensible, parallel runtime for aiding in the use of this API, constructing queries, etc.

  8. Data Integration • Digital libraries, federations and warehousing • Research and development • Tools for aiding in creation of warehouses, ontology creation • Fine-grained access control • Security in federated/dist. environment (pharma etc.) • Applies even to the queries, not just the data itself • Hardening and packaging • Digital libraries (SRB) • Many one-off instances of domain-specific integrations • Deployment and maintenance • DiscoveryLink (IBM), other commercial packages – framework for doing data integration with their DB offerings • Linking similar (R) DBs together isn’t too difficult

  9. Data Integration • Gaps with Priorities • Converging on a language for describing metadata for communities • Tools to support wrapping and integrating complex data • From arbitrary sources (free text, mesh data, etc.), including files • For this domain (community exists looking at bio domain) • Provenance • Security • Cross-domain access and authentication • Encryption of both queries and data • Authentication of data sources

  10. The End

More Related