1 / 26

Exhaustive search

Exhaustive search. CS 498 SS Saurabh Sinha. To define a motif, lets say we know where the motif starts in the sequence The motif start positions in their sequences can be represented as s = ( s 1 , s 2 , s 3 ,…, s t ). A new motif model. Given s = (s 1 , … s t ) and DNA

sherine
Download Presentation

Exhaustive search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exhaustive search CS 498 SS Saurabh Sinha

  2. To define a motif, lets say we know where the motif starts in the sequence The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st) A new motif model

  3. Given s = (s1, … st)and DNA Construct profile matrix Score(s, DNA) = a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A3 0 1 0 31 1 0 C24 0 0 14 0 0 G 0 14 0 0 0 31 T 0 0 0 51 0 14 _________________ Consensus a c g t a c g t Score3+4+4+5+3+4+3+4=30 Scoring Motifs l t

  4. Good profile matrices • Goal is to find the starting positions s=(s1,…st)to maximize the Score(s, DNA) of the resulting profile matrix • This is the “motif finding problem”

  5. A different formulation • Hamming distance between two strings v and w is dH(v,w) = number of mismatches between v and w • Given an array of starting positions s=(s1,…st), we define dH(v, s) = ∑idH(v,si) • Define: TotalDist(v, DNA) = mins dH(v,s) • Computing TotalDist is easy • find closest string to v in each input sequence

  6. The median string problem • Find v that minimizes TotalDist(v) • A double minimization (mins, minv) • Equivalent to motif finding problem • Show this

  7. Naïve time complexity • Motif finding problem: Consider every (s1,…st): O((n-l+1)t) • Median string problem: Consider every l-mer: O(4l). Relatively fast ! • Common form of both expressions: Find a vector of L variables, each variable can take k values: O(kL)

  8. An algorithm to enumerate ! • Want to generate all strings in {1,2,3,4}L NEXTLEAF(a, L, k) for i := L to 1 if ai < k ai := ai+1 return a ai := 1 return a ALLLEAVES(L, k) a := (1,..,1) while true output a a := NEXTLEAF(a,L,k) if a = (1,..,1) return 11…11 11…12 11…13 11…14 . . 44…44 Increment the least significant digit; and “carry over” to next position if necessary

  9. “Seach Tree” for enumeration -- Order of steps 4 1 2 3 1- 2- 3- 4- 1 4 2 3 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

  10. Visiting all vertices in tree • Not just the leaves, but the internal nodes also • PreOrder traversal of a tree • PreOrder(node): • Visit node • PreOrder(left child of node) • PreOrder(right child of node) • This is a recursive method • Can rewrite without recursion

  11. Visit the Next Vertex • NextVertex(a,i,L,k) // a : the array of digits • ifi < L // i : prefix length • a i+1 1 // L: max length • return ( a,i+1) // k : max digit value • else • forjL to 1 • ifaj < k • ajaj +1 • return( a,j ) • return(a,0)

  12. In words • If at an internal node, just go one level deeper. • If at a leaf node, • go to next leaf • if moved to a non-sibling in this process, jump up

  13. Bypassing • What if we wish to skip an entire subtree (at some internal node) during the tree traversal ? BYPASS(a, i, L, k) for j := i to 1 if aj < k aj := aj+1 return (a,j) return (a,0)

  14. Brute Force Solution for the Motif finding problem • BruteForceMotifSearchAgain(DNA, t, n, l) • s  (1,1,…, 1) • bestScoreScore(s,DNA) • while forever • s NextLeaf (s, t, n-l+1) • if (Score(s,DNA) > bestScore) • bestScore Score(s, DNA) • bestMotif  (s1,s2 , . . . , st) • returnbestMotif O(l(n-l+1)t)

  15. Can We Do Better? • Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si) • Every row of alignment may add at most l to Score • Optimism: if all subsequent (t-i) positions (si+1, …st) add (t – i ) * ltoScore(s,i,DNA) • If Score(s,i,DNA) + (t – i) * l < BestScore, it makes no sense to search in vertices of the current subtree • Use ByPass() • “Branch and bound” strategy • This saves us from looking at (n – l + 1)t-ileaves

  16. Pseudocode for Branch and Bound Motif Search • BranchAndBoundMotifSearch(DNA,t,n,l) • s (1,…,1) • bestScore0 • i 1 • whilei > 0 • ifi < t • optimisticScoreScore(s,i, DNA) +(t – i ) * l • ifoptimisticScore< bestScore • (s, i) Bypass(s,i, n-l +1) • else • (s, i) NextVertex(s,i, n-l+1) • else • ifScore(s,DNA) > bestScore • bestScore Score(s) • bestMotif (s1, s2, s3, …, st) • (s,i) NextVertex(s,i,t,n-l+ 1) • return bestMotif

  17. The median string problem • Enumerate 4l strings v • For each v, compute TotalDist(v, DNA) • This requires linear scan of DNA, i.e., O(nt) • Overall: O(nt4l) • Improvement by branch and bound ? • During enumeration of l-mers, suppose we are at some prefix v’, and find that TotalDist(v’,DNA) > BestDistanceSoFar. • Why enumerate further ?

  18. BranchAndBoundMedianStringSearch(DNA,t,n,l ) s (1,…,1) bestDistance ∞ i 1 whilei > 0 ifi < l prefix string corresponding to the first i nucleotides of s optimisticDistance TotalDistance(prefix,DNA) ifoptimisticDistance >bestDistance (s, i )  Bypass(s,i, l, 4) else (s, i ) NextVertex(s,i, l, 4) else word nucleotide string corresponding to s if TotalDistance(s,DNA) < bestDistance bestDistanceTotalDistance(word, DNA) bestWordword (s,i )  NextVertex(s,i,l, 4) return bestWord Bounded Median String Search

  19. Greedy Algorithms

  20. A greedy approach to the motif finding problem • Given t sequences of length n each, to find a profile matrix of length l. • Enumerative approach O(l nt) • Impractical • Instead consider a more practical algorithm called “GREEDYMOTIFSEARCH”

  21. Greedy Motif Search • Find two closest l-mers in sequences 1 and 2 and form 2 x lalignment matrix with Score(s,2,DNA) • At each of the following t-2 iterations, finds a “best” l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences • In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA) under the assumption that the first (i-1)l-mers have been already chosen • Sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

  22. Greedy Motif Search pseudocode • GREEDYMOTIFSEARCH (DNA, t, n, l) • bestMotif := (1,…,1) • s := (1,…,1) • for s1=1 to n-l+1 for s2 = 1 to n-l+1 if (Score(s,2,DNA) > Score(bestMotif,2,DNA) bestMotif1 := s1 bestMotif2 := s2 • s1 := bestMotif1; s2 := bestMotif2 • for i = 3 to t for si = 1 to n-l+1 if (Score(s,i,DNA) > Score(bestMotif,i,DNA) bestMotifi := si si := bestMotifi • Return bestMotif

  23. A digression • Score of a profile matrix looks only at the “majority” base in each column, not at the entire distribution • The issue of non-uniform “background” frequencies of bases in the genome • A better “score” of a profile matrix ?

  24. Information Content • First convert a “profile matrix” to a “position weight matrix” or PWM • Convert frequencies to probabilities • PWM W: Wk = frequency of base  at position k • q = frequency of base  by chance • Information content of W:

  25. Information Content • If Wk is always equal to q, i.e., if W is similar to random sequence, information content of W is 0. • If W is different from q, information content is high.

  26. Greedy Motif Search • Can be trivially modified to use “Information Content” as the score • At each step, instead of choosing the top (1) partial motif, keep the top k partial motifs • “Beam search” • Use statistical criteria to evaluate significance of Information Content • The program “CONSENSUS” from Stormo lab.

More Related