Computer Science Background for Biologists
210 likes | 385 Views
Computer Science Background for Biologists. What is algorithm. Well-defined computational procedure that takes some values as input and produces some value as output. We are interested in the correctness and efficiency of computer algorithms
Computer Science Background for Biologists
E N D
Presentation Transcript
What is algorithm • Well-defined computational procedure that takes some values as input and produces some value as output. • We are interested in the correctness and efficiency of computer algorithms • We seek to extract clean, well-defined problems from the typically messy “real” problem to gain insight into it.
Example of an algorithm • Input: A sequence of n numbers (a1, a2, …an). • Output: A permutation (a’1, a’2, …a’n) of the input sequence such that a’1≤ a’2≤ …a’n.
Exact String Matching • Input: A text string T, where |T| = n, and a pattern string P, where |P| = m. • Output: An index i such that Ti+k-1 = Pk for all 1 ≤ k ≤ m, i.e. showing that P is a substring of T. Text T: Pattern P:
Exact String Matching • Brute force search algorithm for i =1 to n-m+1 do j=1; while ( T[i+j-1] == P[j] ) and (j <= m) j=j+1; if (j > m) then print “pattern at position ”, i;
Algorithm Efficiency • Time efficiency of algorithms • Space efficiency of algorithms
Machine Independent Analysis We assume that every basic operation takes constant time: • Example Basic Operations: • Addition, Subtraction, Multiplication, Memory Access • Time efficiency of an algorithm is the number of basic operations it performs • We do not distinguish between the basic operations.
Time efficiency • In fact, we will not worry about the exact values, but will look at ``broad classes’ of values. • Let there be n inputs. • If an algorithm needs n basic operations and another needs 2n basic operations, we will consider them to be in the same efficiency category. • However, we distinguish between exp(n), n, log(n)
Example: Time Complexity • This algorithm might use only n steps if we are lucky. • We might need about n*m steps if we are unlucky
exp (n) n log n Order of Increase • We worry about the increase speed of our algorithms with increased input sizes.
Function Orders • A function f(n) is O(g(n)) if ``increase’’ of f(n) is not faster than that of g(n). • A function f(n) is O(g(n)) if there exists a number n0 and a nonnegative c such that for all n n0 , 0 f(n) cg(n). • If limnf(n)/g(n) exists and is finite, then f(n) is O(g(n))
Implication of Big oh notation • Big oh notation ― an upper bound on the number of steps that an algorithm takes in the worst case. • Suppose we know that our algorithm uses at most O(f(n)) basic steps for any n inputs, and n is sufficiently large, then we know that our algorithm will terminate after executing at most constant times f(n) basic steps. • We know that a basic step takes a constant time in a machine. • Hence, our algorithm will terminate in a constant times f(n) units of time, for all large n.
Algorithm Complexity • Thus the brute force string matching algorithm is O(mn), or takes quadratic time • An quadratic time algorithm is usually fast enough for small problems, but not big ones. • An exponential-time algorithm can only be fast enough for tiny problems
Any improvement based on brute force search? • Some of these comparisons are wasted work! • By being more clever, we can reduce the worst case running time to O(n+m) • Knuth-Morris-Pratt string matching
NP , NP hard, NP complete Problems • A problem is assigned to the NP class if it can be verified in polynomial time. • A problem is NP-hard if an algorithm for solving it can be translated into one for solving any other NP-problem • NP-hard therefore means "at least as hard as any NP-problem,“ • NP-complete: it is both NP problem and NP-hard problem
NP-Completeness • Unfortunately, for many problems, there is no known polynomial algorithm • Even worse, most of these problems can be proven NP-complete, meaning that no such algorithm can exist! • Heuristics , approximate
Shortest Common Superstring • Input: A set S = {s1, s2, … sm} of text strings on some alphabet £. • Output: the shortest possible string T such that each si is a substring of T. • This application arises in DNA sequencing
Shortest common superstring • NP-complete problems. • Can you suggest an algorithm to find the shortest common superstring? • Greedy heuristic ― approximate optimal solution
Greedy Heuristic • We always merge the two strings with the longest overlap • Put the combined string back • Repeat until only one string remains • GREEDY finds a superstring of length at most twice optimal
Time complexity of the greedy heuristic • We assume n strings, each string has a length of k. • N rounds • O(N2) strings comparisons • Each string comparison takes k2 steps.