Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching

Introduction to Bioinformatics: Lecture IIIGenome Assembly and String Matching Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org

Outline of the lecture • Physical mapping problem and the resulting computational challenges • Ordering clone libraries: from the consecutive ones to global optimization methods • Applications of exact string matching methods • Towards the shortest superstring problem and the shotgun assembly problem JM - http://folding.chmcc.org

Literature watch Aloy et. al., “Structure-Based Assembly of Protein Complexes in Yeast”, Science 303, As a way of getting acquainted with protein pathways and their intersection with structural studies. JM - http://folding.chmcc.org

Assembling physical maps of a genome Markers DNA Physical mapping problem: create and locate in the genome of interest a set of markers (e.g. stretches of DNA that hybridize to a given probe). With sufficiently dense and ordered set of markers any newly sequenced (and long enough to cover at least one marker) DNA fragment can be mapped to a rough location on the genome. One of the early goals of the Human Genome Project was to select and map a set of STS markers such that there would be at least one STS in each stretch of 100 kb of the genome.

Physical mapping and the problem of ordering clone libraries with STS markers STS: 1 2 3 4 5 DNA clone 1 clone 2 clone 3 clone 4 Definition A clone library consists of a set of short DNA fragments, called clones that originated in a stretch of the studied DNA. Definition A sequence tagged site (STS) is a DNA substring which occurs only once in the DNA of interest. One may think of STSs as a set of indices to which new DNA sequences can be referenced. Problem What is the minimum length of the STSs that could (at least in principle) provide the requested coverage for the Human genome?

The problem of ordering clone libraries with STS markers can be cast (and solved) as the consecutive ones problem The true location of the STSs and clones is not known. However, for each clone the list of STSs hybridizing to it is given. STS: 1 2 3 4 5 DNA clone 1 clone 2 clone 3 clone 4 Our task is to reconstruct the original order of the STSs (and thus order the clone library) given this data. Assuming that the STS probes are unique and that there are no hybridization errors the problem can be cast as the consecutive ones problem and efficiently solved using CS techniques (PQ-tree algorithm, Booth and Leuker, 1976).

The consecutive ones problem and its solution For a binary hybridization matrix find a permutation of its columns such that in each row all ones are located in a block of consecutive entries. STS: 1 2 3 4 5 DNA clone 1 clone 2 clone 3 clone 4 STS Clone

Fortunately errors make life more interesting … In the presence of experimental errors the problem leads to global optimization problem (see Pevzner, Chapter 3). STS: 1 2 3 4 5 DNA clone 1 clone 2 clone 3 clone 4 STS Clone

Heuristic solutions may still provide good probe ordering The number of “gaps” (blocks of zeros in rows) in the hybridization matrix may be used as a cost function, since hybridization errors typically split blocks of ones (false negatives) or split a gap into two gaps (false positive). The problem of finding a permutation that minimizes the number of gaps can be cast as a Traveling Salesman Problem (TSP), in which cities are the columns of the hybridization matrix (plus an additional column of zeros) and the distance between two cities is the number of positions in which the two columns differ (Hamming dist.) Thus, an efficient algorithm is unlikely in general case (unless P=NP) and heuristic solutions are being sought that provide good probe ordering, at least for most cases (e.g. Alizadeh et. al., 1995) Problem Is the correct order of the STSs in the example from the previous slide providing the shortest cycle for the corresponding TSP? JM - http://folding.chmcc.org

Map location of anonymous DNA as a string matching problem A sufficiently long string of anonymous yet sequenced DNA can be placed on the physical map by finding which STSs are contained in this sequence. Due to the size of the problem, efficiency is very important. Millions of STS are available at present and their total length is typically much larger than the length of the DNA sequence to be mapped. Assuming no sequencing errors, the problem can be cast as the exact set matching and solved efficiently using for example suffix trees. Generalized suffix tree or inexact string matching methods need to be used when some errors are allowed. JM - http://folding.chmcc.org

Strings, sequences and string operations JM - http://folding.chmcc.org

String exact matching problem JM - http://folding.chmcc.org

Solving the exact matching problem: conceptual simplicity vs. computational complexity JM - http://folding.chmcc.org

Computationally efficient and elegant solutions JM - http://folding.chmcc.org

The idea of the suffix tree method A string with m characters has m suffixes, which can be represented as m leaves of a rooted directed tree. Consider for example T=cabca c b $ a 4 c a b a c $ b $ a 5 c $ 3 1 a $ 2 For simplicity one leaf, due to the terminal character $ is not included. Problem What is the reason for adding the terminal character? JM - http://folding.chmcc.org

Why does it work? A substring of a string is a prefix of a suffix in that string. For example, a substring P=ab is a prefix of the suffix bca in T=cabca. Thus, if P occurs in T there is a leaf in the suffix tree that has a label starting with P. c b $ a 4 c a b a c $ b $ a 5 c $ 3 1 a $ 2 As a related problem consider the motif search, as implemented in PROSITE. Explain how finite automata formalism is used for motif search. JM - http://folding.chmcc.org

General idea: ordered fingerprints and the notion of closeness between DNA fragments Hierarchical sequencing: physical maps, clone libraries and shotgun Definition The algorithmic problem of shotgun sequence assembly is to deduce the sequence of the DNA string from a set of sequenced and partially overlapping short substrings derived from that string. Analogy to physical map assembly: DNA sequence of a substring may be viewed as a precise ordered fingerprint (in analogy to STSs) and the suffix-prefix match determines if two substrings would be assembled together. In general, the shortest superstring problem (find the shortest string that contains each string from a certain set of strings as its substring) is NP-hard and heuristics are being developed to address the problem. JM - http://folding.chmcc.org

Get the relevant sequences to compare them: conservation and differences Problem  Algorithms  Programs Sequencing  Fragment assembly problem  The Shortest Superstring Problem  Phrap (Green, 1994) Gene finding  Hidden Markov Models, pattern recognition methods  GenScan (Burge & Karlin, 1997) Sequence comparison  pairwise and multiple sequence alignments  dynamic algorithm, heuristic methods  BLAST (Altschul et. al., 1990) JM - http://folding.chmcc.org

Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching

Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching

Presentation Transcript

String Matching

Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching

String Matching

EECS 730 Introduction to Bioinformatics Genome and Gene

CSE331 – Lecture 24 String Matching

String Matching

String Matching

Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String matching

Lecture 27. String Matching Algorithms

EECS 730 Introduction to Bioinformatics Genome and Gene

String Matching

String Matching

Introduction to Genome Assembly

String Matching