1 / 25

Artemis: Integrating Scientific Data on the Grid

Artemis: Integrating Scientific Data on the Grid. Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman. Outline. Motivation Data integration needs in scientific applications Distributed computing in grids Problem statement Artemis architecture Evaluation Related Work

happy
Download Presentation

Artemis: Integrating Scientific Data on the Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Artemis: Integrating Scientific Data on the Grid Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman

  2. Outline • Motivation • Data integration needs in scientific applications • Distributed computing in grids • Problem statement • Artemis architecture • Evaluation • Related Work • Conclusions and future work

  3. Scientific Data Integration • Large-scale, cross-disciplinary scientific data collection, storage, and analysis exacerbates heterogeneity and dynamics • National Virtual Observatory (NVO) • Earth System Grid (ESG)

  4. Discovery R R RM RM Registries organize services of interest to a community Access RM RM RM Policy service Security service Policy service Security service Grid Computing [Foster & Kesselman 04] • Grids provide middleware services for distributed computing: • Seamless integration and management of resources – OGSA • Job submission and execution management – Condor • Resource availability & performance – Monitoring and Directory Svc (MDS) • Data replication for robustness and efficiency – Replica Loc Svc (RLS) • Descriptions of data sources – Metadata Catalog Services (MCS) From [Kesselman 04]: Security & policy must underlie access & management decisions Many sources of data, services, computation Resource management is needed to ensure progress & arbitrate competing demands Exploration & analysis may involve complex,multi-step workflows Data integration activities may require access to, & exploration/analysis of, data at many locations

  5. Scientific Data Storage and Access • Data sources are very heterogeneous • Data that results from various instruments, disciplines, and types of analyses • Wide variety of data storage systems (files, DBs, servers, etc) • Data sources are highly distributed • Data stored in different locations on the grid • Data is replicated in multiple locations • Data sources are highly dynamic • Data grows continuously, new data models are routine • New data sources regularly appear • Data sources may become unavailable sporadically • Data available at unprecedented scale • Very soon petabytes These challenges are in the way of scientific progress in many disciplines

  6. Data Storage and Access in Grids • Data described with metadata attributes • Attribute names may not be consistent across different sources • Metadata descriptions often stored separately from the data itself • Metadata Catalog Service (MCS) [Moore et al 01, Singh et al 03] • Stores descriptive metadata and allows users to query based on desired attributes • Addresses heterogeneity of data source implementations and access

  7. Sample Query • search constraints: keywords = "atmospheric data" or "climate data“ or "climate model“ model type = "CCSM" or "PCM“ period = 2001 • search results: Files, collections, or views:                            /CCSM2/b20.007/atm                            /PCM/B06.62/atm                            /PCM/B06.20/atm                            /PCM/B06.21/atm

  8. currently unavailable Problem Statement • Users should have seamless single point access • Should not have to formulate a different query for each source • Should not manage the unavailability of data sources • Users need assistance formulating the queries • Data models may have different attribute names and representations (even from the same source) • New data models/metadata attributes created all the time DB1 MCS1 stime q1 etime q2 descr MCS2 DB2 sub q3 starttime MCS3 endtime DB3

  9. Artemis • A mixed-initiative data integration system that aims to: • Abstracts users from diversity in attribute representations • Assists users to formulate queries step-by-step • Manages the access and availability of dynamic collections of data sources • Integrates and extends various AI techniques: • Data Integration • Ontology • Dialogue wizards

  10. ONTOLOGY Time stime etime Start time End time … stime starttime etime endtime Data Source Data Source Data Source description subject QueryMediator Query Formulation Wizard starttime endtime … Metadata Catalog1 Metadata Catalog3 Metadata Catalog2 Find files with Start time > 500000 ^ End time < 600000 Approach

  11. Artemis Architecture Dynamic Model Generator MCS Wizard Metadata Catalog Service Data Source Entity selection Models Metadata Catalog Service Filters Data Source Prometheus Query Mediator Model Mappings Metadata Catalog Service Data Source Ontology

  12. MCS Wizard • Based on the Agent Wizard [Tuchinda 2003] • Domain experts create mappings between Ontologies and meta-data attributes • users can then pick the ontology and the mappings relevant to their domain. • Guides the user through available operations and filters consistent with the models of the data.

  13. Prometheus Query Mediator • Data integration system from earlier research [Thakkar et. al. 2004] [Knoblock et al 2003] • Provides unified query interface to a wide variety of data sources • Relational model • Requires pre-defined domain model relating sources to domain relations • Extended in Artemis to support: • Source relations: Various MCSs • Domain relations • File, View, Collection • Dynamic domain model based on availability of data sources

  14. Dynamic Model Generation • Generate mediator model dynamically by querying MCSs • Convert object oriented model of MCSs to relational model of the mediator • Handles dynamic nature of data by generating new domain models at query time • Intuitive idea • Query MCSs one at a time for all possible attributes of different objects • Create domain relation for each object type with all possible attributes • Create rules defining each MCS as data source • Relate various data sources to domain relations

  15. Dynamic Model Generator (Cont’d) • Example • MCS 1: • File1(starttime, endtime, frequency), File2(starttime, endtime, frequency, amplitude) • MCS 2: • File3(starttime, endtime, lat, lon, temp), File4(starttime, endtime, lat, lon, windspeed) • Domain relation • File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) • Source relations • MCS1File(starttime, endtime, frequency, amplitude, name) • MCS2File(starttime, endtime, lat, lon, temp, windspeed, name) • Domain Rules • File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^ (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) • File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ (frequency = ‘’) ^ (amplitude = ‘’)

  16. Query Processing • When Prometheus receives a query it determines which MCSs are relevant • Relevant MCSs are determined by comparing the constraints of the query with the constraints of the MCSs • MCSs that do not satisfy constraints of the query are not used in the query • For example, if the query asked for finding files that contained data for some lat, lon then MCS1 would not be queried

  17. Query Processing: Example • Let’s say, the user uses the MCSWizard to form the following query. Q(name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)^ (lat > 33)^(lat < 34)^ (lon < -118)^(lon > -119)^ (starttime > 50000)^(endtime < 60000) • The Prometheus mediator would generate a datalog program with the query and domain rules File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^ (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ (frequency = ‘’) ^ (amplitude = ‘’)

  18. Query Processing: Example • Let’s say, the user uses the MCSWizard to form the following query. Q(name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)^ (lat > 33)^(lat < 34)^ (lon < -118)^(lon > -119)^ (starttime > 50000)^(endtime < 60000) • The Prometheus mediator would generate a datalog program with the query and domain rules File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^ (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ (frequency = ‘’) ^ (amplitude = ‘’) • The mediator determines that the order constraints in the rule one on lat and lon attribute are not compatible with the order constraints on lat and lon in the query, so only MCS2 is queried

  19. Artemis: Top level Selection

  20. Artemis: Filtering

  21. Evaluation • Enabled users to query 12 different MCSs • Covering information from three different applications • LIGO, ESG, and Geo-spatial data warehouse • Covering 17,000 different files • Metadata consisted of about 300 different attributes • Simulated addition of metadata to MCSs and failure of several MCSs while system was running

  22. Related Work • MCS [Singh et al 03] • Organize metadata about objects on the data grid • Object oriented schema to support user defined metadata attributes • Difficult for users to keep track of diverse attribute names • No semantic information is attached to the attributes • Agent Wizard [Tuchinda et. al. 2003] • Interactive application that guides user by dividing complex tasks as series of simpler question answering tasks • Challenge is to model complex task as set of simpler subtasks • Prometheus Mediator [Thakkar et. al. 2004] • Data integration system that can efficiently integrate data from a wide variety of data sources • Key restriction is that relational schema for data sources and domain must be known in advance

  23. Related Work (Cont’d) • Mygrid [Wroe 2003] • Model data sources as semantic web services • Integration of data sources is represented as a workflow • Requires that data sources have fixed schema and associated semantics • Model-based mediator system for scientific data management [Ludascher 2003] • Data sources provide semantic information regarding their data • The provided information is used to generate domain model for a mediator system • Assumption is that semantic information is provided by different data sources of interest

  24. Conclusions • Contributions: • Mixed-initiative approach to help scientists query objects on the data grid • Isolate users from heterogeneity of data sources • Manage distributed dynamic data • Future Work: • Algorithm to determine when to dynamically generate domain model • Better support for specifying model mappings • Artemis available as a grid service • More extensive testing and usability studies

  25. ?

More Related