Computer Science Background for Biologists

Computer Science Background for Biologists

What is algorithm • Well-defined computational procedure that takes some values as input and produces some value as output. • We are interested in the correctness and efficiency of computer algorithms • We seek to extract clean, well-defined problems from the typically messy “real” problem to gain insight into it.

Example of an algorithm • Input: A sequence of n numbers (a1, a2, …an). • Output: A permutation (a’1, a’2, …a’n) of the input sequence such that a’1≤ a’2≤ …a’n.

Exact String Matching • Input: A text string T, where |T| = n, and a pattern string P, where |P| = m. • Output: An index i such that Ti+k-1 = Pk for all 1 ≤ k ≤ m, i.e. showing that P is a substring of T. Text T: Pattern P:

Exact String Matching • Brute force search algorithm for i =1 to n-m+1 do j=1; while ( T[i+j-1] == P[j] ) and (j <= m) j=j+1; if (j > m) then print “pattern at position ”, i;

Algorithm Efficiency • Time efficiency of algorithms • Space efficiency of algorithms

Machine Independent Analysis We assume that every basic operation takes constant time: • Example Basic Operations: • Addition, Subtraction, Multiplication, Memory Access • Time efficiency of an algorithm is the number of basic operations it performs • We do not distinguish between the basic operations.

Time efficiency • In fact, we will not worry about the exact values, but will look at ``broad classes’ of values. • Let there be n inputs. • If an algorithm needs n basic operations and another needs 2n basic operations, we will consider them to be in the same efficiency category. • However, we distinguish between exp(n), n, log(n)

Example: Time Complexity • This algorithm might use only n steps if we are lucky. • We might need about n*m steps if we are unlucky

exp (n) n log n Order of Increase • We worry about the increase speed of our algorithms with increased input sizes.

Function Orders • A function f(n) is O(g(n)) if ``increase’’ of f(n) is not faster than that of g(n). • A function f(n) is O(g(n)) if there exists a number n0 and a nonnegative c such that for all n  n0 , 0  f(n)  cg(n). • If limnf(n)/g(n) exists and is finite, then f(n) is O(g(n))

Implication of Big oh notation • Big oh notation ― an upper bound on the number of steps that an algorithm takes in the worst case. • Suppose we know that our algorithm uses at most O(f(n)) basic steps for any n inputs, and n is sufficiently large, then we know that our algorithm will terminate after executing at most constant times f(n) basic steps. • We know that a basic step takes a constant time in a machine. • Hence, our algorithm will terminate in a constant times f(n) units of time, for all large n.

Algorithm Complexity • Thus the brute force string matching algorithm is O(mn), or takes quadratic time • An quadratic time algorithm is usually fast enough for small problems, but not big ones. • An exponential-time algorithm can only be fast enough for tiny problems

Any improvement based on brute force search? • Some of these comparisons are wasted work! • By being more clever, we can reduce the worst case running time to O(n+m) • Knuth-Morris-Pratt string matching

NP , NP hard, NP complete Problems • A problem is assigned to the NP class if it can be verified in polynomial time. • A problem is NP-hard if an algorithm for solving it can be translated into one for solving any other NP-problem • NP-hard therefore means "at least as hard as any NP-problem,“ • NP-complete: it is both NP problem and NP-hard problem

NP-Completeness • Unfortunately, for many problems, there is no known polynomial algorithm • Even worse, most of these problems can be proven NP-complete, meaning that no such algorithm can exist! • Heuristics , approximate

Shortest Common Superstring • Input: A set S = {s1, s2, … sm} of text strings on some alphabet £. • Output: the shortest possible string T such that each si is a substring of T. • This application arises in DNA sequencing

Shortest common superstring

Shortest common superstring • NP-complete problems. • Can you suggest an algorithm to find the shortest common superstring? • Greedy heuristic ― approximate optimal solution

Greedy Heuristic • We always merge the two strings with the longest overlap • Put the combined string back • Repeat until only one string remains • GREEDY finds a superstring of length at most twice optimal

Time complexity of the greedy heuristic • We assume n strings, each string has a length of k. • N rounds • O(N2) strings comparisons • Each string comparison takes k2 steps.

Computer Science Background for Biologists

Computer Science Background for Biologists

Presentation Transcript

English for Computer Science

English For Computer Science

Beginning BioPerl for Biologists

Bioinformatics for Biologists

Writing for computer science

Biologists

Computer Programming for Biologists

Bioinformatics for Human Biologists

Computer Programming for Biologists

Probability for Computer Science

Mathematical Challenges for Biologists

Math for Biologists

Computer Programming for Biologists

Database Tools for Biologists

Computer Science For Kids

Computer Science For Kids

Topics for Computer Science

Informatics for Molecular Biologists

Quantitative Life Science Education: Preparing Fearless Biologists

Computers and Programming for Biologists

Probability for Computer Science