Bioinformatics Data Representation and Integration

Bioinformatics Data Representation and Integration By Ngozi Oleleh

Table of Contents • Introduction to Bioinformatics • Proteins and Sequences • Bioinformatics Tools • The databases • Blast Functions • Bioindexing • Conclusion

What is Bioinformatics • Bioinformatics is the use of computers to study and handle biological Information • Bioinformatics can be looked at as an integration of computer science and Biology to help enhance the study of biological data which has been proven to be very extensive • The role of computer science in this Interdisciplinary is to store the data(via databases) for future Analysis via biological tools • This field’s study includes but is not limited to the study of genes, dna sequences and protein structures

Protein and Sequences • Biological proteins are made up of 20 amino acids • Alanine - ala - A • arginine - arg– R • asparagine - asn– N • aspartic acid - asp – D • cysteine - cys– C • glutamine - gln– Q • glutamic acid - glu - E • glycine - gly– G • Histidine - his – H • isoleucine - ile– I • leucine - leu– L • lysine - lys– K • methionine - met – M • phenylalanine - phe– F • proline - pro – P • serine - ser – S • threonine - thr - T • Tryptophan - trp - W • tyrosine - tyr– Y • valine - val – V

Proteins and Sequences • Combination of these amino acids make up protein structures and sequences • Pdb database contains numerous protein structures that are similar by sequence alignment of fold recognition. • Bioinformatics studies difference and similarities of these protein structures based on sequence similarity • A Sequence is a combination of amino acids. • This sequences can contain biological data, that can be used to denote information about families of proteins

Bioinformatic Tools • Mage • Used to display protein singular structures • Rasmol • Used to display protein 3d Structure • LALIGN • For pairwise Sequence Alignment • ClustalW • Used for Multiple Sequence Alignment • Ammp • Molecular Modeling • Sequence Alignment Tools • FASTA • BLAST (will be looked at extensively)

Biological Databases • There are over 5000 public biological databases • These databases contain genomic, proteomic and microarray data. • This so called data is made up of sequence of genes or amino acids of proteins • Biological databases have become very useful to scientists. It is important in understanding and explaining a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species.

This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life. • The biological knowledge is distributed amongst many different general and specialized databases. This sometimes makes it difficult to ensure the consistency of information. • Biological databases cross-reference other databases with accession numbers as one way of linking their related knowledge together.

Bioinformatics databases can be grouped into 2 groups: Generalized databases and Specialized databases • Generalized databases • Primary Sequence Databases (EMBL, Genebank,DDJB) • Protein Sequence Databases(Swiss-prot,UniProt, UniRef) • Carbohydrate Databases (CarbBank) • 3d structure Databases (PDB, EBI-MSD,NDB)

Specialized Databases • Specialized databases • Specialized Sequence database • Genome databases • Specialized Protein Sequence database • Specialize Structure databases • Microarray databases Main focus are the Generalized databases

Primary Sequence Database • Primary sequence databases • EMBL (European Molecular Biology Laboratory nucleotide sequence database at EBI, Hinxton, UK) • GenBank (at National Center for Biotechnology information, NCBI, Bethesda, MD, USA) • DDBJ (DNA Data Bank Japan at CIB , Mishima, Japan)

Protein Sequence Database • Protein sequence databases • SWISS-PROT (Swiss Institute of Bioinformatics, SIB, Geneva, CH) • TrEMBL (=Translated EMBL: computer annotated protein sequence database at EBI, UK) • PIR-PSD (PIR-International Protein Sequence Database, annotated protein database by PIR, MIPS and JIPID at NBRF, Georgetown University, USA) • UniProt (Joined data from Swiss-Prot, TrEMBL and PIR) • UniRef (UniProt NREF (Non-redundant REFerence) database at EBI, UK) • IPI (International Protein Index; human, rat and mouse proteome database at EBI, UK)

Other Databases • Carbohydrate databases • CarbBank (Former complex carbohydrate structure database) • 3D structure databases • PDB (Protein Data Bank cured by RCSB, USA) • EBI-MSD (Macromolecular Structure Database at EBI, UK ) • NDB (Nucleic Acid structure Database at Rutgers State University of New Jersey , USA)

Blast Blast is a heuristic algorithm to detect sequence similarity and is optimized for speed. It is suitable for large scale analysis What blast does is to match a queried sequence to certain positions of database sequences

Quick Diversion • Blast Example • Sequence to be queried TSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQ

Sequences producing significant alignments: Score(Bits) E Value • pdb|2FXP|A Chain A, Solution Structure Of The Sars-Coronaviru... 82.4 3e-17 pdb|2BEZ|F Chain F, Structure Of A Proteolitically Resistant ... 81.6 5e-17 pdb|1WNC|A Chain A, Crystal Structure Of The Sars-Cov Spike P... 77.8 7e-16 pdb|1WYY|A Chain A, Post-Fusion Hairpin Conformation Of The S... 76.6 1e-15 pdb|2BEQ|D Chain D, Structure Of A Proteolytically Resistant ... 69.7 2e-13 pdb|1ZVA|A Chain A, A Structure-Based Mechanism Of Sars Virus... 68.6 5e-13 pdb|1ZV7|A Chain A, A Structure-Based Mechanism Of Sars Virus... 65.9 3e-12 pdb|1ZV8|B Chain B, A Structure-Based Mechanism Of Sars Virus... 65.5 4e-12 pdb|1WDG|A Chain A, Crystal Structure Of Mhv Spike Protein Fu... 25.4 4.7 pdb|2A11|A Chain A, Crystal Structure Of Nuclease Domain Of R... 24.3 9.1

Blast Functions in Databases • Blast is one of the most heavily used data analysis tools available, hence large scale data analysis need to supports BLAST functions. • Blast Support is achieved by defining a set of user-defined functions that return BLAST results as a table. • Many databases Support Blast Functions • Blast 2 major functions are • BLAST_MATCH • BLAST_ALIGN

The Blast Functions • function BLASTP_MATCH ( • query_seq CLOB, • seqdb_cursor REF CURSOR, • subsequence_from NUMBER default 1, • subsequence_to NUMBER default -1, • filter_low_complexity BOOLEAN default false, • mask_lower_case BOOLEAN default false, • sub_matrix VARCHAR2 default ’BLOSUM62’, • expect_value NUMBER default 10, • open_gap_cost NUMBER default 11, • extend_gap_cost NUMBER default 1, • word_size NUMBER default 3, • x_dropoff NUMBER default 15, • final_x_dropoff NUMBER default 25) • return table of row (t_seq_id VARCHAR2, score NUMBER, expect NUMBER);

Parameter Description • query_seq The query sequence to search. A sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input. • seqdb_cursor The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string. • Subsequence from Start position of a region of the query sequence to be used for • the search. The default is 1. • Subsequence To End position of a region of the query sequence to be used for • the search. If -1 is specified, the sequence length is taken as subsequence to. The default is -1. • Filter_low_complexity TRUE or FALSE. If TRUE, the search masks off segments of the query sequence that have low compositional complexity. Filtering can eliminate statistically significant but biologically • uninteresting regions, leaving the more biologically interesting regions of the query sequence available for specific matchingagainst database sequences. Filtering is only applied to the query sequence. The default value is FALSE. • mask_lower_case TRUE or FALSE. If TRUE, you can specify a sequence in upper case characters as the query sequence and denote areas to be filtered out with lower case. This customizes what is filtered from the sequence. The default value is FALSE.

sub_matrix Specifies the substitution matrix used to assign a score for aligning any possible pair of residues. The different options are PAM30, PAM70, BLOSUM80, BLOSUM62, and BLOSUM45. The default is BLOSUM62. • expect_value The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior. • open_gap_cost The cost of opening a gap. The default value is 11. Specifying 0 invokes default behavior. • extend_gap_cost The cost of extending a gap. The default value is 1. Specifying 0 invokes default behavior. • word_size The word size used for dividing the query sequence into subsequences during the search. The default value is 3. Specifying 0 invokes default behavior. • x_dropoff Dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes default behavior. • final_x_dropoff The final X dropoff value for gapped alignments in bits. The default value is 25. Specifying 0 invokes default behavior. • t_seq_id The sequence identifier of the returned match. • score The score of the returned match. • expect The expect value of the returned match.

How the whole system Works • Sequences that need to be searched are inserted into a query table • INSERT INTO query_db VALUES (’1’, ’AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGT’);

How does it work • Select T_SEQ_ID, score, EXPECT as evaluefrom TABLE(BLASTP_MATCH ( (select sequence from query_db), -- query_sequenceCURSOR(SELECT seq_id, seq_dataFROM swissprotWHERE organism = 'Homo sapiens (Human)'), -- seqdb_cursor1, -- subsequence_from-1, -- subsequence_to0, -- FILTER_LOW_COMPLEXITY0, -- MASK_LOWER_CASE 'BLOSUM62', -- SUB_MATRIX10, -- EXPECT_VALUE0, -- OPEN_GAP_COST0, -- EXTEND_GAP_COST0, -- WORD_SIZE0, -- X_DROPOFF0)) -- FINAL_X_DROPOFFt where t.score > 25;

The Search Procedure • SELECT t.t_seq_id, t.score, t.expect, p.name • FROM PROT_DB p, TABLE( • BLASTP_MATCH ( • (SELECT sequence FROM query_db WHERE sequence_id = ’2’), • CURSOR(SELECT seq_id, sequence FROM PROT_DB), • 1, • -1, • 0, • 0, • ’BLOSUM62’, • 10, • 0, • 0, • 0, • 0, • 0) • )t WHERE t.t_seq_id = p.seq_id AND t.score > 25 • ORDER BY t.expect;

Output Results • SEQ_ID SCORE EVALUE • -------- ---------- ---------- • P31946 205 5.8977E-18 • Q04917 198 3.8228E-17 • P31947 169 8.8130E-14 • P27348 198 3.8228E-17 • P58107 49 7.24297332

The Databases and Why • The ability to perform genome-wide and cross-genome data analysis can reduce time required for new biological discoveries • Since traditional databases are not built to support location datatypes, researchers are forced to find ways in which these databases can manage biological information that will permit information to be queried with a Modern database system • This research has led to a concept called Bioindexing

Bioindexing • An index in this construct is basically a way of providing a mapping between information entities. • In a traditional database, an index is an auxiliary structure which speeds up the data retrieval process by providing a mapping between a record key and the physical disk address of the records containing the key • Bioindexing provides similar functionality as a database index but also facilitates DATA INTEGRATION • Biological features are generally attached to locations and locations are also the bases for maps(MAPS in this context is an association of features with a sequence alignment), alignment ( relationships between two genomic sequence segments ) and other complex relationships.

The Blast Database and Bioindexing • Bioindexing is essentially an infrastructure for representing and managing biological knowledge in a large-scale database system using index constructs • Bioindexing uses “location” datatype and “BLAST JOINS” to efficiently handle and query the large amount of data. • Bioindexing is essentially a scheme for connecting and querying information with modern database systems WITH THE USE OF INDEXES

Types of Indexing • Intrinsic Indexing: Indexable bioinformatics datatypes. Intrinsic indexing permits both the representation and management of biological mapping • Extrinsic Indexing : is basically an efficient way of data integration from different heterogeneous sources such as relational tables, xml files standard sequence formats and other sources. • Extrinsic indexing concerns the functions and algorithms used to access and connect this information, even when it is not stored locally

Location (How it is represented) • Without proper abstraction, users have to implement their own codes to handle location operations • A location consists of a sequence identifier and an interval range. • Integer Interval are modeled in [lower,upper] structure • Identifiers are character strings or accession numbers used to denote a particular sequence and interval range consists of a pair of positive integers used to denote the sub-range within the given sequence

Complexity (Where Clauses ) if no location DatatypesEst sequences being needed to be grouped over consecutive overlapping EST fragments • SELECT DISTINCT A.id, A.lower, B.upper • FROM ESTs AS A, ESTs AS B • WHERE A.unigene_clusterid = B.unigene_clusterid • AND A.lower < B.upper • AND NOT EXISTS • (SELECT * • FROM ESTs AS C • WHERE C.unigene_clusterid = A.unigene_clusterid • AND A.lower < C.lower AND C.lower < B.upper • AND NOT EXISTS • (SELECT * FROM ESTs AS D • WHERE D.unigene_clusterid = A.unigene_clusterid • AND D.lower < C.lower AND C.lower <= D.upper)) • AND NOT EXISTS • (SELECT * • FROM ESTs AS E • WHERE E.unigene_clusterid = A.unigene_clusterid • AND ((E.lower < A.lower AND A.lower <=E.upper) OR • (E.lower < B.upper AND B.upper < E.upper)))

Location Datatype • A straightforward representation of a location would be a sequence identifier as a character string and the location interval as (start, end) pair of integers. • There are other possible representations such as integer codes for sequence identifiers and or a (start,length) interval representation • Most databases use the sequence identifier, and location (start, end ) pair of integers.. WHY..because of Simplicity

Simplicity using Location Datatype“Creation and Insertion” • CREATE TABLE features ( location loc, description text); • -- The Prader-Willi/Angelman syndrome region on chromosome 15 • INSERT INTO features VALUES ( 'NG_002690[1..755217]', 'Prader-Willi/Angelman syndrome region' ); • INSERT INTO features VALUES ( 'NG_002690[1..174707]', 'AC090602.16' ); • INSERT INTO features VALUES ( 'NG_002690[174707..324834]', 'AC124312.5' ); • INSERT INTO features VALUES ( 'NG_002690[324835..478258]', 'AC124303.5' ); • INSERT INTO features VALUES ( 'NG_002690[478259..606120]', 'AC100774.2' ); • INSERT INTO features VALUES ( 'NG_002690[606121..755217]', 'AC124997.4' );

The introduction of location datatype not only provides a natural and intuitive way to represent biological information, but also boosts system performance. • Additional performance increase could be achieved by supporting the location index scheme. • Supports for indexing schemes in traditional relational database systems are very limited and inflexible. • They are only limited to a few well-known index structures, such as B+-tree, Hash and R-tree and could be used for a limited set of native data-types for (in)equality and range queries.

Essentially there are operation and functions supported in the location datatype. • A major proportion of these functions are related to interval operations. • More than 30 interval operations are defined, including Allen's interval logic [15] (which includes after, before, contains, during, equals, overlaps, overlapped by, • finishes, finished by, meets, met by, starts and started by). • Optimization information (such as regarding ordering, commutativity or negation) is also provided to permit optimization of important operations like merge-join, hash-join or general theta-join.

Why location datatype is Needed • Here is a simple example to demonstrate the power of location datatype support. This example shows a session that painfully attempts to locate alternatively spliced exon intervals which intersect with known homology intervals and associate them with known protein features from the Pfam and Swissprot databases.

Complexity without locations • CREATE TABLE alt_splice_homology_map AS • SELECT o.*, d.swiss_id, d.query_start, d.query_end, • d.hit_start+(o.seq_start-d.query_start)/3, • d.hit_start+(o.seq_end-d.query_start)/3, • FROM alt_splice_exon_obs o, alt_splice_homology d • WHERE o.ug_id = d.ug_id • AND o.seq_start > d.query_start • AND o.seq_start < d.query_end • AND d.e_value < 0.01 • GROUP BY o.ug_id, o.seq_start; • SELECT o.*, f.type, f.start, f.end • FROM alt_splice_homology_map o, swiss_feature f • WHERE o.swiss_id=f.swiss_id • AND o.hit_end >= f.start • AND o.hit_end <= f.end;

Simplicity using locations • CREATE TABLE alt_splice_homology_map AS • SELECT o.*, d.location, • range_start(d.query)+(o.location-range_start(d.hit))/3 • FROM alt_splice_exon_obs o, alt_splice_homology d • WHERE o.location @ d.location -- contained • AND d.e_value < 0.01 • GROUP BY o • SELECT o.*, f.type, f.location • FROM alt_splice_homology_map o, swiss_feature f • WHERE o.location &< f.location -- left overlap

Location Support • Supporting location indexing in a traditional database implies the need to support interval indexing. • BUT, interval indexing is not supported in traditional databases and standard join operations could not handle intervals efficiently, this has led to extensive research for interval indexing. • Here lies the need for a concept called GIST

GIST • Is an efficient solution handle the problem of ineffective interval indexing in traditional database • Gist is basically a balanced search tree in which keys are maintained in a hierarchical manner. The search keys used in gist may be any arbitrary predicate, but this predicate must hold true for the data searched below a key. • Gist searches by traversing the entire tree in a dept-first search manner. If the query predicate is consistent with a given search key, Gist will continue to search the subtree below the key

Gist Implementation • Gist is implemented using bounding intervals that covers the range of • Identifier integers (id_lower,id_upper) • And • Intervals in the subtree (lower,upper) • Under Gist architecture interval predicates such as such as left, right overlap, overleft,overright, contains, contained and equal are all supported

What gist location does

Conclusion • Bioinformatics databases are being modeled and queried using function(as seen in oracle and ibm DB2) • An efficient way of modeling these databases are seen using bioindexing (as seen in postgre- sql database) • The use of an index structure as seen in Bioindexing, where a location is modeled using a (DFS) tree structure leads to less complexity. • This location index structure leads to an faster searching of the databases • This concept of speed is very important in bioinformatics • Using a gist architecture, lead to less complex queries and a more confined search sector for query information.

References • The Index as a First-Class Construct in Relational Database Systems • D. Stott Parker, Edwin Mach • Algorithms and Databases in Bioinformatics: Towards a Proteomic Ontology • Mario Cannataro, Pietro Hiram Guzzi, Tommaso Mazza, Giuseppe Tradigo and Pierangelo Veltri • Oracle® Data Mining • Mobile Access to Biological Databases on the Internet • Pentti Riikonen*, Jorma Boberg, Tapio Salakoski, and Mauno Vihinen • Utilizing Multiple Bioinformatics Information Sources: • An XML Database Approach • Raymond K. Wong William M. Shui • Support for BioIndexing in BLASTgres • Ruey-Lung Hsiao, D. Stott Parker, and Hung-chih Yang

Bioinformatics Data Representation and Integration

Bioinformatics Data Representation and Integration

Presentation Transcript

Data Exploration, Analysis, and Representation: Integration through Visual Analytics

Data Representation

Data Exploration, Analysis, and Representation: Integration through Visual Analytics

Data Mining and Bioinformatics

BioInformatics and Data Sharing

Data Representation

Bioinformatics beyond sequences Knowledge representation and analysis of biological data

Semantic Web for Life Science Data Representation and Integration

Data Representation in Bioinformatics

Data Representation

Bioinformatics workflow integration

Data Representation

Data Representation, Data Integration and API Delivery of PDB Data

Integration in: Participation and Representation

Bioinformatics, Data Integration and Machine Learning a Thesis Proposal

DATA REPRESENTATION

DATA REPRESENTATION

Data Representation

Bioinformatics workflow integration

Data Representation, Data Integration and API Delivery of PDB Data

Bioinformatics beyond sequences Knowledge representation and analysis of biological data

Semantic Web for Life Science Data Representation and Integration