Integration of Biological Data: Current Challenges and Systems

Integration of Biological Sources: Current Systems and Challenges Ahead (Sigmod Record, Vol. 33. No. 3, September 2004 )Thomas Hernandez & Sybbarao KambhampatiDept. of Computer Science and EngineeringArizona State University

Introduction • Traditionally, the integration of biological data was done manually by biologists. However, the availability of more data in different formats and the wide distribution over the internet makes the manual integration practically infeasible. There is a need for computer integration. • This need is also justified by the characteristics of the biological sources:

Characteristics of Biological Sources • Variety of data. Typical data stored cover several biological and genomic research fields (e.g. gene expression and sequences, disease characteristics, molecular structures, microarray data, etc). Not only can the quantity of data available in a source be quite large, but also the size of each record can itself be extremely large (e. g. DNA sequences, 3D protein structures, etc). • Heterogeneous representations. Several sources containing the similar data can have very different representations. The representational heterogeneity includes structural (i. e. schema), naming, semantic (i.e. the same semantic concept with different terms and the opposite), content (different data for the same semantic object) differences.

Characteristics of Biological Sources • Autonomous operations. They are free to modify their design and/or schema , remove or modify data without any prior public notification. Nearly all sources are web-based and therefore dependent on network traffic and overall availability. The data is dynamic. • Different interfaces and querying capabilities.

Integration Approaches in Existing Systems • They can be classified first in terms of data models. This refers to the design assumptions made by the integration system as to the syntactic nature of the data being exported by the sources. 1. Text data model. They view sources as exporting mainly text, and their integration involves supporting keyword/text search across the sources. 2. Structures data model. When sources are viewed as exporting more structured data, there are two broad types of integration approaches: warehoused or accessed on demand from the sources. 3. Linked records model. They view sources as exporting linked sets of browsable records and the integration involves supporting effective navigation across sources.

Integration Approaches in Existing Systems • The majority of systems use the (semi-) structured or linked record models. More details about those systems are going to be discussed. • They include three types of approach: 1. Warehouse integration. It materializes the data from multiple sources into a local warehouse and executes all queries on the data contained in the warehouse instead of the actual sources. It emphasizes the data translation instead of query translation in mediator-based integration. Pros: less dependency on network, improved efficiency of query optimization, enabling users to filter, validate, modify, and annotate the data obtained from the sources. Cons: outdated data and the need for frequent updates.

Integration Approaches in Existing Systems 2. Mediator-based integration. It concentrates on query translation. A mediator is responsible for reformulating a query at runtime on a single mediated schema into a query on the local schema of the underlying data sources. Mapping between the source description and the mediator is very crucial for such a translation. There are two main approaches for establishing mapping between each source schema and the global schema: global-as-view (GAV) and local-as-view (LAV). In GAV, the mediator relations are written directly in terms of the source relations. In LAV, every source relation is defined over the relations and the schema of the mediator. LAV is preferred for large scale integration and GAV is appropriate when the set of sources being integrated is known and stable.

Integration Approaches in Existing Systems 3. Navigation-based integration. It emerges from the fact that an increasing number of sources on the web require of users that they manually browse through several web pages and data sources in order to obtain the desired information. The specific paths essentially constitute workflows in which the output of a source is redirected to the input of the next source until the requested information is reached.

Integration Approaches in Existing Systems • There are also other classifications besides the data model classification: 1. Aim of integrations – portal or query oriented; 2. Source model – complimentary (horizontal) or vertical (overlapping exists and requires aggregation); 3. User model – low expertise, high expertise in query languages, and interactive query formulations; 4. Level of transparency: users choosing sources or hard-wiring choices of sources.

Integration Approaches in Existing Systems

Sequence Retrieval System (SRS) • SRS first parses flat files that contain structured text with field names. It then creates and stores an index for each field and used these local indexes at query-time to retrieve relevant entries. Although extensive indexed entries are kept locally to be used by the query processor at query time, SRS is not a warehouse system as the actual data is neither modified nor stored locally. The other main feature of SRS is that it keeps track of the cross-references between sources. It uses its own parsing component to identify links that exists between entries in different sources during parsing and indexing. These links are then used to suggest more results to a user after a query has been processed. http://srs.embl-heidelberg.de:8000/srs5/

BioKleisli • BioKleisli is a mediator-based integration system. The mediator on top of the underlying sources relies mainly on a high level query language (CPL, more expressive than SQL) to query across several sources. Queries are decomposed into sub-queries and source-specific wrappers map sub-queries to specific heterogeneous sources, which are accessed through predefined atomic query functions. • BioKleisli doesn’t use any global molecular biology schema or ontology. • It is aimed at performing a horizontal integration. A query attribute is usually bound to an attribute in a single predetermined source and there is essentially no content overlap.

TAMBIS • TAMBIS is a mediator-based and ontology-driven integration system. GUI (Concepts Defined In a global Schema) Source-independent GRAIL query Query internal form Source dependent CPL query execution plan Use BioKleisli existing function library to access sources

TAMBIS • The TAMBIS domain ontology mainly serves the purpose of easing the user’s task of formulating the query instead of schema mapping between sources.

DiscoveryLink • DiscoveryLink is also a mediator-based integration system. Applications typically connect to DiscoveryLink and submit a query in SQL on the global schema, not necessarily aware of the underlying sources. Underneath, a federated database query processor communicates with source-specific wrappers to determine the optimal plan for a given query. • The wrappers have two roles. They translate the source data models and provide source-specific information about query capabilities that will help the optimizer determine which parts of a query can be submitted to each source.

Other Existing Systems • BASCIIS is an end-use product which was developed following a mediator-based approach combined with extensive use of a knowledge base (KB). The KB contains a domain ontology which serves as a global schema and maps the data base schema to the domain ontology. • BioNavigator is a commercially available navigation integration system. Users can define their preferred execution path for a query and reuse it later. • GUS is a warehouse-based integration system.

Discussion • As mentioned earlier, warehouse-based approaches have two clear advantages. First, it simplifies query optimization and processing by storing the data locally according to a single global schema. Second, it enables users to add their own annotations to some stored data and specify some filtering conditions to clean the data as it is stored locally. • However, it is still unclear how this user-friendly feature can be achieved efficiently and more specifically how the data could effectively be validated or modified without human interventions and extensive domain expertise. Furthermore, data warehousing faces the big problem of handling updates in the sources and even a bigger challenge as the data can be modified and annotated locally, and therefore different from the data in the sources.

Discussion • Although GAV and LAV are introduced earlier for mediator-based approach, there are no mediator-based integration systems implementing them so far. Wrapper-oriented approaches are still relatively new. • Much like TAMBIS and BioKleisli, most of the current systems only address the horizontal integration and don’t consider the potential overlapping aspect of sources. DiscoveryLink makes an attempt to solve the problem of selecting between several potential sources by using the information provided by wrappers to estimate querying costs. But the overlap and coverage point of view of optimization and source selection is not considered.

Reference Thomas Hernandez & Subbarao Kambhampati. Integration of Biological Sources: Current Systems and Challenges Ahead. Sigmod Record, Vol. 33, No. 3, September 2004.

Integration of Biological Data: Current Challenges and Systems

Integration of Biological Data: Current Challenges and Systems

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction