(Almost) Hands-Off Information Integration for the Life Sciences

(Almost) Hands-Off Information Integration for the Life Sciences Ulf Leser, Felix Naumann Presented By: Darshit Parekh

Table of Contents • Introduction - Introduction to Life Sciences - Data Integration in Life Sciences • ALADIN - Features of ALADIN - System Architecture for ALADIN - System Components Description • A Case study on Protein Data - Comparison of ALADIN with other existing Technologies - Advantages, Challenges and Bottlenecks in ALADIN • Summary - Demo on COLUMBA Project

1. Introduction to Life Sciences • Life science is the study of living things- plants and animals. • It helps to explain how living things relate to each other and to their surroundings. • It is the in-depth study of living organisms • More specifically following fields are included • Agrotechnology • Animal Science • Bio-Engineering • Bioinformatics and Biocomputing • Cell Biology • Neuroscience • A Broad field that studies life.

2. Data Integration in Life Sciences • Data integration in the life sciences has been topic of intensive research. • It’s one of the areas where large number of complex databases is freely available. • Research in this area is important as there is advancement in the medical technologies and hence, human health and wellness. • The data required for this kind of research and analysis is widely scattered over in many heterogeneous and autonomous data sources. • Life Science databases have number of traits which we need to consider when we start designing data integration system for the same. • Life sciences database stores only primary type of object which are described by rich and sometimes deeply nested annotations. • We consider an example of Swiss-Prot which is essentially protein database.

2. Data Integration in Life Sciences Contd…. • The “Hubs” in the life science data world, provide links to large number of databases pointing from primary object of proteins to further information like protein structure, publications, taxonomic information, the gene encoding the protein, related diseases, known mutations , etc. • Internally stored as a pair ( target-database, accession number). • Presented as hyperlinks on web pages and helps the end user retrieve useful information from the protein database. • With large number of databases ,identifying the duplicates is an important tasks and needs to carefully perform the task. • Data Integration in life sciences involves either manual data curation or schema mapping and integration approach.

2. Data Integration in Life Sciences Contd…. • Manual Data Curation • Projects that achieve high standard of quality in the integrated data through manual curation – data focused. • The curation work is performed by experienced professionals. • Swiss Prot- data on protein sequences from journal publications, submissions, personal communications, other databases. • Data Focused projects are typically managed by domain experts like Biologist. • Very little Database concepts or technology used in this case. • Data in text like manner. • Even if detailed schemata are developed, it can cannot be used to query the database and obtain the results through query.

2. Data Integration in Life Sciences Contd…. • Schema Focused • Projects of this type make use of Database technology and are maintained by computer scientists, database analyst , database programmer etc. • These projects are aimed at providing integration middleware rather than building concrete databases. • Techniques like schema integration, schema mapping and mediator based query rewriting is used. • Examples : TAMBIS, OPM. • Some sort of wrapper required for query processing, detailed semantic mapping between heterogeneous source schemata and a global mediated schema. • The mappings must be specified in a special language which makes the work of domain experts very difficult.

2. Data Integration in Life Sciences Contd…. • Data Focused projects are very successful in biological scene but it does so with a price. • Schema focused projects are hardly used in real life science projects and did not achieve the required attention that it should have. • The major reason for its failure lies in the fact that it is schema centric • The schema mapping and integration also leaves the biologists in a fix as they are not used to with database technologies

3. Introduction to ALADIN • ALADIN is a novel combination of data and text mining, schema matching, and duplicate detection and justifies high level of automatism feature. • It leads to previously unseen relationship between objects, thus directly supporting the discovery based work of life science researchers • ALADIN has two major contributions. Firstly ALADIN is a knowledge resource for life science research and secondly it offers challenges and bottlenecks in the database reserach. • ALADIN: Almost Automatic Data Integration. • The novel feature includes automatic integration with minimum information loss and also takes care of information quality. • The proposed technique has features better than Data Focused and Schema Matching techniques.

4. Features of ALADIN • ALADIN’S architecture consist of several components that together allow for automatic integration of diverse data sources into a global, materialized repository of biological objects and links between them. • The databases that ALADIN helps integration have data that is semi-structured and text centric. • ALADIN uses relational database as its basis. • Can Integrate different types of data sources for which relational representation exists or can be generated. • ALADIN can integrate data in the XML file and flat file using appropriate parsers. • Integration in ALADIN does not depend on predefined integrity constraints structuring a schema but uses technologies from schema matching and data and text mining to detect relation containing primary objects between each data source , to infer relationships between relations and objects within one source , and to infer relationships between objects in different sources.

4. Features of ALADIN Contd… • The system does not rely on any particular schema for its data sources . • Generic Parsers are used like generic-XML to relational mapping tools. • ALADIN integrates data sources in a five step process. • First step- Imports Data source into relational format. • Second step- From the relational representation it tries to find out relation that represents the primary objects within the data source. • Third step- the fields containing annotations for the primary objects are detected . Existing integrity constraints are discovered subject to availability , otherwise they are guessed from the data analysis. • Fourth Step- Links between the objects of the primary relations of different data sources are searched for. Links are generated based on similarity of text fields. • Fifth Step- Duplicates are detected across different data sources and they are removed. • Once the data is imported the process becomes almost automatic.

4. Features of ALADIN Contd… • ALADIN also supports structured queries , detects and flags structured and flags duplicate objects , and adds a wealth of additional links between objects that are undiscoverable when looking at the database in the isolation. • ALADIN can be readily browsed without any schema information. • ALADIN is useful in scenarios where explorative approach is necessary.

5. System Architecture for ALADIN

5. System Architecture for ALADIN The Main Components of Architecture 1. Data Import • The data source needs to be imported to into the relational database system. • In cases where no downloadable import method exists, this is where ALADIN requires human work • The situation is very rare, most of the time the parsers are readily available. • Schema design or re-design is not required. 2. Discovery of Primary Objects • Identifying primary objects stored in primary relation. • Primary relation contain data about the main objects of interest in the source such as “ proteins” and “diseases”.

5. System Architecture for ALADIN • The relations store a primary key but it does not have information regarding the foreign key. 3. Discovery of Secondary Objects. • Secondary objects are additional information about primary objects. • Cardinalities of relationships are determined in this step. • At the end of this step, the internal structure of the newly imported data source is more clear or known. • In this step there can be a possibility of errors while identifying relationships. • The errors can be minimized in the ALADIN by introducing performance measure parameters.

5. System Architecture for ALADIN 4. Link Discovery • In this step we search for attributes that are cross-references to objects in other data sources. • Cross-references always point to primary objects in other data sources as these are the attributes with stable public ID’s. • The output from the second step are necessary input to determine all possible link targets. • This step justifies the theoretical requirement of comparing all pairs of attributes from all sources.

5. System Architecture for ALADIN 5. Duplicate Detection • In this step, search for a special kind of “ links” between primary objects in different data sources representing the same real world object is initiated. • Such duplicate links are established if two objects are sufficiently similar according to some similarity matrix. • Knowledge of duplicates enhances the users browsing and querying experience . 6. Browse, Search and Query Engine • Once the data is integrated into the system, there are three modes to access the data. • Browsing displays objects and different kind of links that users can follow.

5. System Architecture for ALADIN • Search allows the users to make a full text search on all stored data and a focused search restricted to certain partitions of data like a certain data sources , particular field, etc. • Querying allows full SQL queries on the schemata as imported. • Appropriate Graphical User Interfaces are provided to carry on with these operations. 7. Metadata Repository. • It contains known and discovered schemata , information about primary and secondary relations, statistical metadata and sample data to improve discovery efficiency.

Integration Steps in ALADIN

6. System Components Description 1. Data Import • Read data source into the relational database. • No necessary to have integrity constraints at this time. • Some databases like Swiss-Prot , the Gene Ontology provide direct relational dump files. • For text-based exports, readily downloadable parsers are available. Examples are BioSQL and BioPerl packages which are able to read Swiss-Prot , Genebank Databases. • Some databases provide parsers with their export files, such as the open MMS parser for the Protein Structure Database. • Databases exported as XML files can be parsed using a generic XML shredder.

6. System Components Description 2. Discovery of Primary Relations • Discovering primary objects without the use of parsers. • Heuristic rules along with schema and the actual data is used to determine the primary relation • Rules derived from the previous experience of data integration. • SQL query on each attributes. • The attributes found are alphanumeric in nature and are called accession members . • Foreign key relationships and cardinalities. • The detected primary relation and set of relationships achieved are input to the next steps

6. System Components Description 3. Discovery of secondary Relations • The need to connect objects in one source to the other source . • The step determines the description and annotation that is displayed together with the primary object in the web interface. • The computation of the paths from the primary relation to each of the other relations of data source is done using transitivity of relationships. • The paths are stored in the metadata repository. 4. Link Discovery • Explicit Links and Implicit Links • Explicit cross-references in life science databases. • Cross references are stored as accession numbers .

6. System Components Description • E.g. “ ENSG00000042753” or “Uniprot:P11140” • String matching techniques needed. • Many relationships are not explicitly stored. • The implicit relationships are discovered by searching for similar data values among the other data sources. • Three types of comparisons taken into consideration • First is DNA, RNA, or protein sequences compared to each other. • Second is attributes containing longer text strings, such as textual descriptions are analyzed using information retrieval and text mining. • Use of standard vocabularies across the data sources. • The discovered links are stored in the metadata repository to avoid repeated discovery and computation at query time.

7. A Case Study on Protein Data • The design decisions of ALADIN have taken based on the past experiences drawn from the integration projects in this domain. • The paper discusses the most recent COLUMBA Project. • COLUMBA is an integrated , relational data warehouse that annotates protein structures taken from the protein data bank (PDB). • The data explains following properties of protein- • Classification on structures • Protein sequence and sequence features • Functional annotation of proteins • Participation on metabolic pathways • The extraction and transformation from the initial data source schema into the target schema is currently hard-coded and it certainly requires a lot of effort.

7. A Case Study on Protein Data • Understanding the schema is very difficult , as they are very poorly documented and often use structures that are hard to understand just looking at the schema . • Transformation requires operations that are not defined in the current schema mapping languages . • Use of SQL and Java. • COLUMBA annotate Protein structures from Protein Data Bank. It also includes protein fold classification databases SCOP and CATH. • Further functional and taxonomic annotation is available from Swiss-Prot, the Gene Ontology (GO) and the NCBI Taxonomy.

7. A Case Study on Protein Data Part of BioSQL Schema. Arrows indicate candidates for Primary relations and cross-references

7. A Case Study on Protein Data • In the complete schema there are three tables with an in-degree above five. • BioEntry • The BioEntry table is used to store primary objects.

7. A Case Study on Protein Data • The ontology term table is used to store functional descriptions. • Ontology Term • The SeqFeature storing a meta representation of sequence features. • SeqFeature

7. A Case Study on Protein Data • The BioEntry has an accession number candidate , whose values are mixed characters and integers and all have the same length. • The other fields in the BioEntry are either non unique (e.g. Taxon Id), have no alphanumeric character ( e.g. Bioentry Id) or have varying length ( e.g. name) • This table qualifies in ALADIN as the primary relation. • Primary and Foreign keys are determined by analyzing the scope of different attributes storing surrogate keys. • In the next step in COLUMBA, protein structures are connected to annotations using existing cross –references or by matching sequence similarity.

7. A Case Study on Protein Data • The BioSQL schema contains several attributes whose values are excellent candidates for finding out implicit links. • OntologyTerm. Term_Definition linking to biological ontologies. • BioEntry.description linking to disease or gene-focused databases. • Biosequence.Biosequence_str, containing the actual DNA or protein sequence. • Duplicate Detection is an important step here as protein structures from the PDB are available in three different flavors: • The original PDB files • Cleansed version available as dump files. • Cleansed version available with a parser. • PDB accession number is available in all the three versions and hence removing redundancy is easy in this case.

7. A Case Study on Protein Data • SQL example - fetch the accessions of all sequences from SwissProt: SELECTDISTINCT bioentry.accession FROM bioentry JOIN biodatabase USING (biodatabase_id) WHERE biodatabase.name = 'swiss' -- or 'Swiss-Prot' • SQL example - how many unique entries are there in GenBank: SELECT COUNT(DISTINCT bioentry.accession) FROM bioentry JOIN biodatabase USING (biodatabase_id) WHERE biodatabase.name = 'genbank'

8. Some Related Work • Discovery Link • OPM • TAMBIS • Makes use of Schema information rather than the data • SRS- need to mention the primary and secondary relation explicitly in the parsers . • GenMapper • BioMediator • The Project closest to our proposal is Revere Project

Comparison of ALADIN with the existing technologies

9. Challenges and Bottlenecks in ALADIN • ALADIN system is a true challenge in terms of size, number, and the complexity of the data sources to be integrated. • Incorrectly identified primary or secondary relations leads to incorrect targets to link discovery. • Incorrect links in turn influence the precision of duplicate detection. • Issue of performance is not addressed in the paper. • Integrating new data sources to the existing ones leads to the poor efficiency as it involves lots of calculations, sorting, schema matching . It takes a lot of time to achieve the desired results. • Another important problem is that of data changes. When the data in the source changes, all links needs to be recomputed which involves lot of overhead.

10. Summary • ALADIN architecture and framework looks out to be a novel proposal for data integration in life sciences. • The design seems to be almost automatic using text mining, schema matching , data mining and information retrieval. • ALADIN offers clear added-value to the biological user when compared to the current data landscape. • It enables structured queries crossing several databases. • The system suggests a lot of new relationships interlinking all the areas of life sciences , offers ranked searched capabilities across databases for users that want goggle style information retrieval . • The queries like getting the genes of a certain species on a certain chromosome that are connected to a disease via a protein whose function is known. For each of the object types in query , several potential data sources exist and this system takes into account all of the data sources , a feature not supported by any of the integration technology.

11. DEMO OF COLUMBA PROJECT

(Almost) Hands-Off Information Integration for the Life Sciences