1 / 43

Comp. Genomics

Comp. Genomics. Recitation 14 Exam preparation Biological networks. Exercise. A large PPI network G was generated using high throughput technologies. A smaller network H is known in a different organism.

olesia
Download Presentation

Comp. Genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comp. Genomics Recitation 14 Exam preparation Biological networks

  2. Exercise • A large PPI network G was generated using high throughput technologies. • A smaller network H is known in a different organism. • Assume that there exists an efficient algorithm which determines whether there is a sub-network of G of size ≥k that is isomorphic to H

  3. Exercise • Two graphs (PPI networks) are said to be isomorphic if there is a bijection between their vertices sets such that f(u) is adjacent to f(v) iff u is adjacent to v

  4. Exercise • Show that the same algorithm can solve the following problem in polynomial time: • CLIQUE: Is there a clique of size ≥ k in a given a graph G’ and an integer k

  5. Solution • Given a graph G’ and a number k, we create another graph H’ of size k in which there is an edge between every two vertices • This takes polynomial time • We run the original algorithm on (G’,H’) and answer the same

  6. Exercise • Show that the algorithm from the previous question can also solve the following problem: • Input: • A set of elements X=(x1,x2,…,xn), • A distance function d(xi,xj)=1 if xi and xj are “close”, 0 otherwise • Output: Can the set be divided into at most k clusters such that all the element pairs in every cluster are close

  7. Solution • Build a graph |G|, edge (xi,xj) means d(xi,xj)=1 • Use the previous algorithm to find a clique of maximal size (decision problemoptimization problem) • Find the clique and remove it from the graph • Repeat at most k times. If the result is the empty graph, answer ‘Yes’. Otherwise answer ‘No’.

  8. Moed B 26.2.2010 You are given a set of strings S1,S2,..Sk of length C each, and each string is associated to a positive score Bi. Siappears in an alignment if there is a sequence of gapless matches in the alignment that contains Si. We reduce Bi from the score of an alignment for every appearance of Si, including overlaps. Describe a global alignment algorithm.

  9. Question True or false: The following algorithm is a global alignment algorithm for the problem: For every cell [i,j] in the DP matrix we will save the number of consecutive matches that the optimal alignment between x1,…,xi and y1,…yj has made since the last gap. If this value is ≥ C we will check for every Si and reduce Bi as needed.

  10. Solution The suggested algorithm does not work. Counter-example: S[G,G]=10, S[A,A]=1 indel=-1 S1=AAAG B1=-100

  11. Solution

  12. Solution Alignment found: AAA _G Score:1 A AAG_ Optimal alignment: _AAAG Score:10 A_AAG

  13. Question • True or false: The algorithm that worked for positive bonuses will work here too : Add terms of the following form to the recursive update rule: -Isi*Bi+∑k=0..3S[i-k,j-k] where Isi is 1 if the nucleotides i-3,…,i and j-3,…,j are the seed Si and otherwise ∞. The last component is the normal score for matching 4 nucleotides.

  14. Solution • It will not work here. • Since the -Isi*Bi ≤0, and since the option of four consecutive matches is also considered, the algorithm will never use the new update rule • The score that the algorithm computes will not be consistent with the scoring scheme

  15. Question What is the correct algorithm? Divide every cell of the DP matrix into C+1 cells The cell M[i,j,k] represents the optimal alignment between X and Y that ends with k matches

  16. Solution

  17. Solution Correctness: Assume that we have the correct values for all cells M[i’,j’,k’] that precede M[i,j,k] and we want to compute the score at the cell M[i,j,k]. If k<C, then we are not creating a sequence of C matches, and therefore by the inductive assumption and the defined operations M[i,j,k] will contain the optimal score.

  18. Solution If k≥C, and the last C characters are not in {Si}, we are done for the same reasons. If k≥C and the last C characters are in {Si}, then there are several options: The optimal alignment contains the seed. Since we are checking the cells M[i-1,j-1,k-1], M[i-1,j-1,k], we will obtain the score of the optimal alignment.

  19. Solution If k≥C and the last C characters are in {Si}, then the other option is: The seed is not in the optimal alignment. Since the alignment between [i-C+1,…,i][j-C+1,…j] does not contain C consecutive matches, but ends with a match, its prefix which aligns [i-C+1,…,i-1][j-C+1,…j-1] must end with 0,1,…,C-2 matches. Hence the optimal alignment between [1,…,i-1] and [1,…,j-1] ends with 0,1,…,C-2 matches. By the inductive assumption, we have the optimal score for all these alignments, and the update rules tests them all.

  20. Moed B 26.2.2010 • An inverted-repeat is an appearance of some sequence and its inverse in a string, without overlapping. For example, in the string abcdelmnedcblmnknm there is an inverted repeat of size 4, because bcde and its inverse appear in it, and do not overlap. The sequence lmn appears twice but in the same order and therefore it does not constitute an inverted repeat. The sequence mnk appears twice but the two appearances overlap, and therefore it does not constitute an inverted repeat either. Describe a linear time algorithm for finding the longest inverted repeat in a string.

  21. Solution • Build a suffix tree for S and SR • abcdelmnedcblmnknmi=red start index in S=2 • mnknmlbcdenmledcba j=green start index in SR =7=|S|-|REP|-green start index in S+2 • So there is no overlap if • i+|REP|-1< |S|-|REP|-j+2

  22. Solution • Each node is marked if it has children from both S and SR. MAX0 • The postorder search will proceed as follows: • If v does not have marked descendants and v is marked, compare the indices of the leaves with minimal indices in S, SR.

  23. Solution • The postorder search will proceed as follows: • If v does not have marked descendants and v is marked, compare the indices of the leaves with minimal indices in S, SR. • The indices give the maximal repeat with no overlap. If >MAX, update MAX.

  24. Solution • The section between the end of the first appearance and the start of the inverted appearance is (i+|REP|-1+|S|-|REP|-j+2)/2=(|S|+1+i-j)/2 • The length of the non-overlapping string is (|S|+1+i-j)/2-i

  25. Exercise from homework • We want to align a gene x to a genome y • x appears in y starting at position i • We want to align x or part of it to y but that x or parts of it will not be aligned to themselves

  26. Solution Not good! x OK y i i i+|x| OK i

  27. Solution • Global or local alignment? Local, because alignment of parts of x are also acceptable solutions, and we need to find the highest scoring solution • Who are the solutions that we need to exclude? Every solution in which some x[j] is aligned to y[i+j]

  28. Solution y i y x i x i i i+|x|

  29. Solution y • Can we set the diagonal to -∞ instead? i x No, this will disregard solutions that cross the diagonal, e.g.: i i+|x|

  30. Another exam question • Question from exam: Given K strings, denote by l(i) the length of the longest common (contiguous) substring of at least i of the input strings. Compute l(2),…,l(k) This is the k-common substring problem, for all possible k values

  31. Solution • We have seen that for a specific k the problem can be solved in time linear in the sum of input string lengths

  32. Solution

  33. Solution 1 2 3 0 0 0 0

  34. Solution • Claim: After the update procedure is completed, every node contains exactly the number of distinct strings in its subtree Proof: Induction on node height. Base: Node v – all the children are leaves. All children are direct children. They appear consecutively in the DFS and sum in their LCA which is v

  35. Solution

  36. Solution • Step: Let v be a node of height i>1. In all the subtrees that are rooted in its descendants, duplications are counted correctly. It remains to see what happens to duplications in subtrees of different descendants.

  37. Solution v 1 w z 1

  38. Solution • Duplications in different subtrees are counted in v. Therefore all the duplications are counted correctly. • How do we use the information that we computed in order to solve the question? • Traverse the suffix tree, for every node v with j distinct strings update l(j) if v’s string depth is larger than l(j)

  39. Solution • Are we done? Traverse the array l from the last entry to the first and override l(i-1) with l(i) if l(i) is larger

  40. Another homework exercise • Longest Common Prefix: Given a set of k strings of length n each, give an algorithm that finds the longest common prefix for every pair of strings. The total time should be O(kn + p) where p is the number of pairs of strings having a common prefix of length > 0.

  41. Trie • A string is represented in one path only c a { aeef ad bbfe bbfg c } b e b d e f f e g

  42. Solution • We can construct a trie from the set of strings in the question. This will take O(k·n). • Can we then find the LCA for every pair of leaves? • We can, but it will take O(k2) and the total • running time will be O(k·n+k2)

  43. Solution • A simple trick: We will add each leaf to a group of sequences that start with the same nucleotide • Comparison of leaves in the generated groups will take O(p)

More Related