Indexing Biological Sequence Data

Indexing Biological Sequence Data Doctoral Seminar by Mihail R. Halachev Supervisor: Dr. N. Shiri Dept. of Computer Science and Software Engineering Concordia University 11/29/2004

Outline • Introduction: From DNA to sequence data • Basic tasks over biological sequence data • Search techniques • Indexing techniques for sequence data • Applicability to bioinformatics • Suffix Trees • Conclusion • Future Work

From DNA to sequence data representation The 2 strands are complementary: A T C G A DNA segment can be encoded using the bases from only one of the strands: S = AGTACGΣ = {A, C, G, T} Source: National Health Museum

From mRNA to sequence data representation Each codon specifies a single amino acid. S = ATGLRS*|Σ’| = 20 Source: Wikipedia

Basic tasks over biological data • From a biological point of view: • Having a novel DNA sequence, perform a search in primary biological DBs for similar (already known) sequences. • Similarity (Alignment) • Homology • Compare a novel protein sequence to secondary protein DBs containing motifs, signatures, protein domains, etc. • Approximation of the biochemical function of the query protein • From a computational point of view:- both tasks are essentially searching

Search techniques for sequence biological data (BLAST, Clustal W) Basic Local Alignment Search Tool(BLAST) [Altschul ‘90, ‘97] The NCBI BLAST family of programs includes: • blastp - an amino acid query against a protein DB • blastn - a nucleotide query against a nucleotide DB • blastx - a nucleotide query (in all reading frames) against a protein DB • tblastn - a protein query against a nucleotide DB (in all reading frames) • tblastx - the six-frame translations of a nucleotide query against the six- frame translations of a nucleotide DB

How BLAST works? • The BLAST algorithm is a heuristic search method that seeks wordsof length W that score at least T when aligned with the query and scored with a substitution matrix. • Words in the database that score T or greater are extended in both directions in an attempt to find a alignment to produce a HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. • T parameter values: a trade-off between speed and sensitivity of the search. • Local pairwise alignment Source: National Center for Biotech Info

BLAST Case Study [Hunt ‘01] Hardware: SUN Enterprise 450, 2 GB RAM, 4 Processors, Solaris 7 Software: BLAST (with default parameter settings) Data: 3 human chromosomes (294 Mbp, 10% of human genome), data on local disks Queries: 99 query sequences (predicted human genes), with length between 429 to 5999 bp Results: 6559 hits, average 66 hits per query. Time: 62 hours

BLAST Observations • “BLAST: - performs serial scan of the DB; - is CPU intensive; - its usefulness depends on the biologists being able to provide appropriate search parameters values.” [Hunt ‘01] • “Filtering approaches, like BLAST, are only suitable for high similarity matching, but often low similarities are biologically significant.” [Navarro ‘00a]

Clustal W [Thompson ‘94] • Dynamic Programming alignment method • Based on global multiple alignment • Input : set of N sequences • Output : the optimal alignment of N sequences • Improved sensitivity (may find similar sequences which BLAST may omit) • 50-100 times slower than BLAST

Motivation for Indexing? • “Many of these biological datasets are growing at exponential rates – for example, the sizes of the sequence datasets in GenBank have been doubling every sixteen months.” [Tata ‘04] • “As there is a rapid rise in both the volume of data and the demand for searches by researchers investigating functional genomics, it is worth investigating the possibility of accelerating these searches using indexes.” [Hunt ‘01]

Indexing Techniques for Sequence Data • Q-grams[Navaro ‘98] • String B-Tree [Ferragina ‘99] • Multi-D Index [Jagadish ‘00] • Suffix Tree [Weiner ‘73, McCreight ‘76] [Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata ‘04]

Q-grams -- Construction • Input: T is a text over Σ, |T| = n, |Σ| = σ • Pick an integer, say q = 4(0 < q < n, a good heuristic is q ≈ logσn) • Each substring of T with size q is called a “q-gram” and is stored in the index table (in lexical order) with a list of pointers to positions (or blocks) in T where this q-gram occurs

Q-grams -- Searching • For a pattern P, |P| = m,Find all approximate occurrences P’ of P in T, where error ratio of each P’ ≤ λ • λ = k / m, where k is the edit distance of P’ to P • Knowing m and the desired λ, compute k • Split P at k +1 disjoint pieces • Having k +1 disjoint pieces of P, for each of them search the index table (binary search) • Set of candidate matches is the union of all occurrences • Verify each candidate by neighborhood search

Q-grams -- Example T = Set q = 3, Index Table:

Q-grams -- Example • Search for P = con, k = 1 (i.e. allow only one error), split P in k+1 pieces: P1 = c and P2 = on • Candidate Matches P1 = c : 25, 7, 1, 17, 23 P2 = on : 10, 18 • Verification (1 error allowed) con ? bat con ? cancon ? carcon ? comcon ? concon ? ctccon ? ioccon ? ombcon ? ontcon ? tar Answer:T[25], T[1], T[17]+T[9], T[17]

Indexing Techniques for Sequence Data • Q-grams [Navaro ‘98] • String B-Tree[Ferragina ‘99] • Multi-D Index [Jagadish ‘00] • Suffix Tree [Weiner ‘73, McCreight ‘76] [Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata ‘04]

1 8 28 31 1 10 20 8 28 39 29 31 1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31 1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31 Input: set of words String B-Tree -- Construction • Input: S = {aid, atom, attenuate, car, patent, zoo, atlas} Step 1. Store S consequently on disk. Step 4. Propagate LMP and RMP from each node up, until construct root Step 3. Create leaf nodes.Each node contains pointers to the sorted suffixes. Step 2. Sort lexicographically each suffix of each word Lexicographic Order “aid” : S[1]“ar” : S[21]“as” : S[38]“ate” : S[16]…..“uate” : S[15]“zoo” : S[31]

1 8 28 31 1 10 20 8 28 39 29 31 1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31 1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31 String B-Tree -- Construction • Searching using this index structure is inefficient, because the keys are external and multiple I/Os are required to fetch them. Lexicographic Order “aid” : S[1]“ar” : S[21]“as” : S[38]“ate” : S[16]…..“uate” : S[15]“zoo” : S[31]

0 1 16 25 10 a 1 i atent aid ate attenuate t 3 2 e t 3 9 n 5 1 8 28 31 1 10 20 8 28 39 29 31 1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31 1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31 String B-Tree -- Construction • Each node is implemented as modified Patricia Trie. Lexicographic Order “aid” : S[1]“ar” : S[21]“as” : S[38]“ate” : S[16]…..“uate” : S[15]“zoo” : S[31]

String B-Tree -- Searching • Find all occurrences of P = te in S Start at root: t > n and t < z  branch right Leaf node: P = te found at: S[17,18] S[26,27] S[12,13] Child Node: t ≥ t and t < z  branch right Child Node: te ≥ te and te < tl branch left

Indexing Techniques for Sequence Data • Q-grams [Navaro ‘98] • String B-Tree [Ferragina ‘99] • Multi-D Index[Jagadish ‘00] • Suffix Tree [Weiner ‘73, McCreight ‘76] [Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata ‘04]

Multi-D Index -- Construction • Input: A set of pairs of strings(not necessarily of same length) Step 1.Store the pairs of strings (separated properly) consequently on disk

Multi-D Index -- Construction Step 2. Create index leaf nodes, storing pointers to separating symbols Step 3. Construct internal nodes (until construct root). R-trees and MBR computation are used for building up the index. MBR1 MBR2

Multi-D Index -- Construction • Searching using this index structure is inefficient, because the keys are external and multiple I/Os are required to fetch them. • At each node, for each dimension, create an ‘Elided Trie’. E-tries are very similar to Patricia Tries. • For searches, use the E-Tries in a similar manner as the Patricia Tries (during the downward traversal of the index tree).

Multi-D Index -- Construction

Multi-D Index -- Searching Prefix Search:Q1=(abc*,makk*) Start at root E-Tries repeat { x-dim: abc* can only be on left MBR y-dim: makk* can be in both MBRs Compute the intersection examine only left MBR ….. until reach a leaf index node…. } Step k (leaf page) {//compute candidatesx-dim: string pair @ 0 string pair @ 10 y-dim: string pair @ 10 string pair @ 20 Answer to query = the intersection }

Indexing Techniques for Sequence Data • Q-grams [Navaro ‘98] • String B-Tree [Ferragina ‘99] • Multi-D Index [Jagadish ‘00] • Suffix Tree[Weiner ‘73, McCreight ‘76] [Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata ‘04]

Suffix Tree [Gusfield ‘97] • A Suffix Tree for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. • Each internal node (except the root) has at least 2 children and each edge is labeled with a nonempty substring of S. • No 2 edges out of a node can have edge-labels beginning with the same character. • The key feature of the Suffix Tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i, i.e., S [i..m].

$ 6 b x a $ 3 a x b a x $ b a $ x 5 $ a 4 $ 2 1 Suffix Tree • Input: string S = xabxa, add $ at the end (no suffix of S is a prefix of another suffix). Suffix Tree forS = xabxa$

$ 6 b x a $ 3 a x b a x $ b a $ x 5 $ a 4 $ 2 1 Suffix Tree -- Searching • Find all occurrences of P = xa in S S =

4,2 $ $ 3,1 a x d 6,1 $ b d x a a $ b b x 1,2 x d a a d $ $ $ $ $ $ 1,1 2,1 5,1 4,1 2,2 3,2 Generalized Suffix Tree • ST can be build for more than one string. S1 = S2 = 5,2

Desired for the Indexing Technique • Relatively fastconstruction, reasonable amount of storage consumption (persistently stored); • Allows huge sequences to be indexed; • Supports versatile queries over data; + • Supports bioinformatics applications!

Applicability for Sequence Biological Data

Suffix Trees: A closer look • Suffix Trees are well known in the biological sequence processing field • Recent advances in Suffix Tree construction algorithms • Suffix Trees provide support for answering versatile biological questions

Suffix Tree (ST) Applications • REPuter [Kurtz ‘99] The REPuter program family provides state of the art software solutions to compute and visualize repeats in whole genomes or chromosomes. • MUMmer [Delcher ‘99, ‘02, ‘04] MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form. NUCmer program aligns contigs from a shotgun sequencing project to another set of contigs or a genome.

ST Construction Algorithms History • [Weiner ‘73]First linear time algorithm to build Suffix Tree (called Position Tree). • [McCreight ‘76]A more space efficient solution. • [Ukkonen ‘95]Presents a variation of [McCreight ‘76], but much easier to understand, to prove bounds, and to implement. • All these algorithms are in-memory algorithms. In practice, the sequences to be indexed are large, they cannot fit in the memory; the corresponding ST is ≈ 10x bigger.

Advances in ST Construction Algorithms [Hunt ‘01] Abandons the use of the suffix links (the algorithm is not linear any more), presents the idea of partitioning to reduce the number of disk I/O’s [Giegerich ‘03] Proposes a space efficient representation of ST. [Tata ‘04] Extends ideas in [Hunt ‘01] and [Giegerich ‘03], focuses on development of an efficient buffering strategy. [Tata ‘04] builds a STon the entire human genome (approx. 3 Gbp) in 30 hours, using a single processor machine; even for the in-memory case [Tata ‘04 - O(m2)], performs better than [Ukkonen ‘95 - O(m)]

Versatile Biological Support by ST • Exact search (with or without wild cards) • Approximate search • [Longest] Common substring/subsequence of 2 (or more) strings • Recognizing DNA contamination • Alignment • [Shortest] Superstring of 2 (or more) strings • Shotgun sequencing and sequence assembly • Finding repeats in a single sequence • Compressing DNA strings to study the information content of a string or to discriminate between exons and introns in eukaryotic DNA • ….

Suffix Tree Representations • Suffix Array[Manber ‘93, Myers ‘94, Baeza-Yates ‘00] • LC-tries [Anderson ‘95] • Suffix Binary Search Tree[Irving ‘03]

Conclusion • BLAST Case Study • Observations on existing searching techniques • Alternative indexing techniques for sequence data and their possible application for biological sequence data • Suffix Trees

Future Work Suffix Tree Construction • Further improvements of [Tata ‘04] algorithm – time/space • Combining of two (or more) Suffix Trees • Suffix Tree maintenance Suffix Tree Usage • Most of the widely known ST-based algorithms rely on the suffix links. How the algorithms that use ST will change in the absence of suffix links? • Potential of ST for mining biodata Alternative Index Data Structures “Families of reiterated sequences account for about one third of the human genome.” [McConkey ‘93]

References [Altschul ‘90] S.F. Altschul et al. “Basic local alignment search tool”. J. Mol. Biol., 215:403-10, 1990. [Altschul ‘97] S. F. Altschul, T. L. Madden, A. A. Schaeer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”. Nucleic Acids Research, 25:3389-3402, 1997. [Anderson ‘95] A. Andersson and S. Nilsson. “Efficient implementation of suffix trees”. Softw. Pract. Exp., 25(2):129-141, 1995 [Baeza-Yates ‘00] R. Baeza-Yates and G. Navarro. “A Hybrid Indexing Method for Approximate String Matching”. Journal of Discrete Algorithms, 2000. [Delcher ‘99] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. “Alignment of Whole Genomes”. Nucleic Acids Research, 27:2369-2376, 1999. [Ferragina ‘99] P. Ferragina and R. Grossi. “The string B-tree: a new data structure for string search in external memory and its applications”. Journal of the ACM, 46(2):236-280, 1999 [Giegerich ‘03] R. Giegerich, S. Kurtz, and J. Stoye. “Efficient implementation of lazy suffix trees”. Softw. Pract. Exper. 2003; 33:1035-1049, 2003 [Gusfield ‘97] D. Gusfield. “Algorithms on strings, trees and sequences : computer science and computational biology”. Cambridge University Press, 1997 [Hunt ‘01] E. Hunt, M.P. Atkinson, and R.W. Irving. “A Database Index to Large Biological Sequences”. In VLDB J., 7(3):139-148, 2001 [Irving ‘03] R.W. Irving and L. Love. “The Suffix Binary Search Tree and Suffix AVL Tree”. Journal of Discrete Algorithms, 1 (2003) 387–408, 2003. [Jagadish ‘00] H.V. Jagadish, Nick Koudas, and Divesh Srivastava. “On effective multi-dimensional indexing for strings”. In ACM SIGMOD Conference on Management of Data, pages 403-414, 2000.

References [Kurtz ‘99] S. Kurtz and C. Schleiermacher. “REPuter: fast computation of maximal repeats in complete genomes”. Bioinformatics, pages 426-427, 1999 [Manber ‘93] U. Manber and G. Myers. “Suffix arrays: a new method for on-line string searches”. SIAM J. Comput., 22(5):935-948, 1993. [McConkey ‘93] E. McConkey. “Human Genetics: The Molecular Revolution”. Jones and Bartlett, Boston, MA, 1993 [McCreight ‘76] E.M. McCreight. “A Space-economical Suffix Tree Construction Algorithm”. J. ACM, 23(2):262-272, 1976 [Myers ‘94] E. W. Myers. “A sublinear algorithm for approximate key word searching”. Algorithmica,12(4/5):345-374, 1994. [Navarro ‘98] G. Navarro and R. Baeza-Yates. “A practical q-gram index for text retrieval allowing errors”. CLEI Electronic Journal, 1(2), 1998 [Navarro ‘00a] G. Navarro. “A Guided Tour to Approximate String Matching”. ACM Computing Surveys,33:1:31-88, 2000. [Navarro ‘00b] G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. “Indexing Text with Approximate q-grams”. In CPM2000, LNCS 1848, pages 350-365, 2000 [Tata ‘04] S. Tata, R.A. Hankins, and J. Patel. “Practical Suffix Tree Construction”. In Proc. of the 30th VLDB, 2004 [Thompson ‘94] J. D. Thompson, D. G. Higgins, and T. J. Gibson. “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice”. In Nucleic Acids Research, Vol. 22, No. 22 4673-4680, 1994 [Ukkonen ‘95] E. Ukkonen. “On-line construction of suffix-trees”. Algorithmica 14 (1995), 249-260, 1995 [Weiner ‘73] P. Weiner. “Linear Pattern Matching Algorithms”. In Proc. of the 14th Annual Symposium on Switching and Automata Theory, 1973

Indexing Biological Sequence Data

Indexing Biological Sequence Data

Presentation Transcript

Data Indexing

Sequence data

Indexing Semistructured Data

Data Indexing

Sequence data

Indexing Trajectory Data

Indexing and Mining Biological Images

Indexing Multidimensional Data

Biological sequence analysis

Indexing and Mining Biological Images

Introduction on biological sequence indexing, searching and text mining

Biological Sequence Analysis

Biological Sequence Analysis

Sequence Indexing Schemes

Biological Sequence Pattern Analysis

Biological Sequence Analysis

Indexing Data Relationships

Indexing Spatial Data

Reference-based Indexing of Sequence Databases

Indexing Multidimensional Data

Introduction on biological sequence indexing, searching and text mining

Algorithms for Biological Sequence Analysis