Index-based approach to similarity search in protein and nucleotide databases

Index-based approach to similarity search in protein and nucleotide databases David Hoksza, Tomáš Skopal Charles University in PragueDepartment of Software Engineering Czech Republic

Presentation Outline • Biological background • Protein and nucleotide databases • Current methods • dynamic programming • heuristic approach • Index based approach • Experiments DATESO 2007

Terminology • DNA (deoxyribonucleic acid) • sequence of nucleotides (A, C, G, T) • double-helix • RNA (ribonucleic acid) • single-helix sequence of nucleotides (A, C, G, U) • messenger RNA (mRNA) • transfer RNA (tRNA) • ribosomal RNA (rRNA) • … • proteins • molecules • translated from mRNA in ribosomes • sequence of amino acids (20 AAs) • coded by codon (triplet of nucleotides) • genetic code • DNA → RNA → protein central dogma transcription translation DATESO 2007

Protein Similarity • Interaction of proteins determines biological functions • Function of protein derived from it’s three dimensional structure • similar proteins (many common amino acids on “appropriate” places) have similar structure • → similar proteins have similar functions • similar proteins have a common ancestor • Determining protein sequence • → finding similar proteins • → getting clue to the function DATESO 2007

Protein and nucleotide Databases • Protein databases • finding similar proteins • even among different species • Nucleotide databases • finding similarities in non-coding (not transcribed) parts • finding whether sequence was already described • checking whether given segment was sequenced correctly • Prominent databases • GenBank • EMBL (European Molecular Biology Laboratory Data) • DDBJ (DNA Data Bank of Japan) • UniProt • Swissprot + trEMBL (translated EMBL) + PIR (Protein Information Resource) not moderated moderated DATESO 2007

Databases Growth DATESO 2007

Similarity Search • Similarity = alignment of 2 sequences • “correspondence” between 2 sequences • Standard methods for finding alignments • dot matrix method • dynamic programming • heuristic approach N P H G I - I - M G L - A E - - H G - A - L - G L L - E DATESO 2007

Similarity Measures • Need of defining a measure • Distances for measuring alignments of strings • Hamming distance • sequences of equal length • number of non-identical positions • Levenshtein (edit) distance • minimal number of editing operations (insert/update/delete) needed for convert one sequence to the other • Weighted edit distance • takes into account probability of updating one letter to the other • distance matrix • biologically correct • PAM, BLOSUM, … DATESO 2007

Dynamic Programming – Global Alignment BLOSUM 62 gap cost … -1 • Global alignment • aligning whole sequences • weighted edit distance • Needleman-Wunsch • optimal alignment between 2 sequences a and b • distance matrixδ • gap cost σ • si,j– optimal alignment of prefixes a and b of length i and j • s0,j = j*σ, si,0 = i*σ • s|a|, |b| … value of the optimal alignment N P H G I I M G L A E -1 -1 +8 +6 -1 -1 +2 +6 +4 -1 -1 20 - - H G - - L G L - - adding gap to a O(|a||b|) adding gap to b align ai and bj DATESO 2007

Dynamic Programming – Local Alignment • Local alignment • best global alignment of all pairs of subsequences of a and b • Smith-Waterman • modification of Needleman-Wunsch • allowing “free ride” from the start by incorporating zero value • s0,j = 0, si,0 = 0 • max(si,j) … value of optimal alignment BLOSUM 62 gap cost … -11 N P H G I I M G L A E +8 +6 +2 16 H G L gap extending with cost of σ DATESO 2007

Smith-Waterman, BLOSUM62 open gap -11, extend gap -11 0 0 0 0 0 0 0 0 0 0 0 0 1 -11 0 1 0 8 0 0 0 0 0 0 0 0 -11 0 -11 0 0 0 0 14 3 0 0 6 0 0 0 -11 -3 -11 0 0 0 0 3 16 5 2 0 10 0 0 -11 0 0 0 0 6 5 12 2 8 0 10 0 0 0 0 0 0 8 7 14 3 12 1 7

Heuristic approach • O(|a||b|) is expensive • → heuristic approach • BLAST (Basic Local Alignment Search Tool) • Remove low complexity regions • Generate all n-grams from query sequence • Compute the similarity for every sequence of length n and each n-gram from the previous step • Filter out sequences with similarity lower then a cut-off score • Exact match of remaining (high-scoring) sequences (organized in a search tree) with DB • Connecting matched high-scoring sequences within a given distance with gapped alignment and extending → high scoring pairs (HSP) • HSPs with score under given thrashold are excluded • Remaining sequences aligned by Smith-Waterman algorithm with original query sequence DATESO 2007

Statistical Relevance • What is the probability that a alignment happened by chance? • Using statistics (distribution function) of ungapped local alignment • applying to gapped alignment (empirically tested) • E-value … expected number of sequences of length m and n with score at least S • K, e … depended on distance matrix • Taking size of database N and length of the query into account DATESO 2007

Metric Access Methods (MAM) • Given a metric, MAMs are used to organize objects • only promising groups of objects have to be search while querying • MAMs use metric function as a “black box” • (→ local alignment can be used) • Examples • M-tree, PM-tree, LAESA, vp-tree, GNAT, D-Index… • Metric(Oi, Oj, Ok  U) • reflexivity d(Oi, Oj) = 0  Oi = Oj • positivity d(Oi, Oj) > 0  Oi  Oj • symmetry d(Oi, Oj) = d(Oj, Oi) • triangular inequalityd(Oi, Oj) + d(Oj, Ok)  d(Oi, Ok) DATESO 2007

Creating a metric • What distance function use? • Smith-Watterman • doesn’t take sequence length into account • no statistical relevance • E-value with SW • takes statistical relevance, query length and database length into account • standard in biological databases • problems • reflexivity • same sequences → E-value = 0 • symmetry • triangular inequality DATESO 2007

TriGen algorithm • Turning semi-metric (metric without triangular inequality) into metric • applying triangular generating (TG) modifiers (functions) to original distance function • TG is every concavesimilarity preserving (SP) modifier • increasing intrinsic dimensionality → decreasing index efficiency • TG-error tolerance  • ratio of triangular triplets to non-triangular triplets •  = 0  > 0 • exact search approximate search • tradeoff between correctness and efficiency DATESO 2007

Experimental Results • Swissprot • subset of size 3000 sequences (1,041,000 aminoacids) • average sequence length 335 • maximal sequence length limited to 1000 • only 3% of sequences are longer → special treatment • Testing of • distance computations • computational costs • number of letter comparisons • TG error, real error DATESO 2007

BLAST computational costs estimation • finding neigbhbouring sequences • 54sequences (empirically) • → 81784 distance computations (comparisons) • → 245352 computational operations (3-grams) • average number of neighbouring words for query sequences – 54 • → search tree of height 6 • → 6 * 1,041,000 * 3 = 1,873,800 computational operations for every search DATESO 2007

Experiments – E-value DATESO 2007

Experiments – Error tolerance TriGen error tolerance TriGen error tolerance DATESO 2007

Conclusion • We have analyzed • standard current methods used for searching protein and nucleotide databases • We have implemented • indexing of protein sequences by MAMs • can be used for nucleotide sequences as well • Experimental results • have shown that using MAMs without search space modification doesn’t result into significant advantage over sequential scan DATESO 2007

References [1]Skopal T., Pokorný J., Snášel V.: Nearest Neighbours Search using the PM-tree, DASFAA 2005, Beijing, China [2] Skopal T.: On Fast Non-Metric Similarity Search by Metric Access Methods, EDBT 2006, Munich, Germany DATESO 2007

Index-based approach to similarity search in protein and nucleotide databases

Index-based approach to similarity search in protein and nucleotide databases

Presentation Transcript

Fast Parallel Similarity Search in Multimedia Databases

Nucleotide Databases: Genbank

Similarity Search in Protein Databases

Indexing similarity for efficient search in multimedia databases

SIMILARITY SEARCH The Metric Space Approach

Similarity Searches in Sequence Databases

Protein Structure Similarity

Protein Sequence- and Structure-based Similarity Networks

Structural Similarity Index

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

Similarity searches in sequence databases

Similarity Search

Content-Based Similarity Search

Shape extraction framework for similarity search in image databases

Nucleotide Sequence Databases

Fast Similarity Search in Image Databases

SIMILARITY SEARCH The Metric Space Approach

Protein Databases

Similarity Searches in Sequence Databases

Similarity Search: A Matching Based Approach

Fast Similarity Search in Image Databases

SIMILARITY SEARCH The Metric Space Approach