Faster finds from Gallo to Google

Faster finds from Gallo to Google Presented to the Niagara University Bioinformatics Seminar Dr. Laurence Boxer Department of Computer and Information Sciences Applications to string search problems from: L. Boxer and R. Miller, Coarse Grained Gather and Scatter Operations with Applications, Journal of Parallel and Distributed Computing, 64 (2004), 1297-1320

The Problem: Examples using case-insensitive exact matches Given two character strings, a “pattern” and a “text” (with the text typically much larger than the pattern), find all matching copies of the pattern in the text. P: agtacagtac T: actaactagtacagtacagtacaactgtccatccg Output: P: Gallo T: If Professor Gallo serves many gallons of home-brewed wine to students who do dastardly deeds in the hallowed DePaul hallways, how many will go to the gallows? Better they should have a singalong…. He used a lame pickup line: “Is this little gal lonely?” Output:If Professor Gallo serves many gallons of home-brewed wine to students who do dastardly deeds in the hallowed DePaul hallways, how many will go to the gallows? Better they should have a singalong…. He used a lame pickup line: “Is this little gal lonely?”

Additional “finds” when a small number of errors (mismatch, insert, delete) are permitted P: Gallo T: If Professor Gallo serves many gallons of home-brewed wine to students who do dastardly deeds in the hallowed DePaul hallways, how many will go to the gallows? Better they should have a singalong…. He used a lame pickup line: “Is this little gal lonely?” Output:If Professor Gallo serves many gallons of home-brewed wine to students who do dastardly deeds in the hallowed DePaul hallways, how many will go to the gallows?... Better they should have a singalong.… He used a lame pickup line: “Is this little gallonely?” 1 character mismatch: “h” for “g” Must delete one space for perfect match Must insert one “l” for perfect match

Analysis of algorithms • Seek to estimate proportionalrunning time T(n) of an algorithm when applied to a data set of size n. • T(n) = Θ(f(n)) if, for large n, T(n) is approximately proportional to f(n). • T(n) = O(f(n)) if, for large n, T(n)< something that’s Θ(f(n)). • Emphasis on large n; for small n, even an inefficient algorithm may finish in acceptable time.

Example: Sequential Sorting Algorithms

Previous State of Knowledge for exact string matching (algorithms for sequential computers) • Using absolute value notation for # of characters in string, suppose • |T| = n, |P| = m, • where 1 <m < n (usually, m << n). • In the worst case, all the input must be considered (otherwise, we may miss a match). There exist Θ(n)-time solutions for sequential computers, which, therefore, are optimal in the worst case. • However, n may be so large that Θ(n) time may be unacceptable. • Speedup may come by using sequential algorithms highly probable to run faster than worst-case time (topic of another talk). • We may use parallel computers to get faster results (topic of today’s talk). • Therefore, input size is Θ(m+n). Since • n < m+n < n+n = 2n, • input size is Θ(n).

Parallel vs. sequential computers • Ideally, a parallel computer with q processors should solve a problem in 1/q – th of the time that a sequential computer requires. • Thus, if is the time for a sequential computer to solve a given problem, then we want the parallel computer to use • But achieving this level of speedup may be difficult or impossible, because time is required to exchange data among processors. • The time required for standard data exchange operations depends on the configuration of processors.

Examples of parallel architectures with times to broadcast a unit of data Linear array. q-1 = Θ(q) steps to send a unit of data from leftmost to rightmost processor • Source row (linear array) broadcasts across row. 2. In parallel, each column linear array broadcasts across column.

Example - tree In 1st step, root broadcasts to each of its “children;” in subsequent steps, in parallel, nodes at a given level that have just received the datum broadcast to their children. Thus, time is proportional to number of levels, Θ(log q).

Communications problems for string matching problems • Data is distributed (in segments of consecutive characters) among processors: • Occurrences of matches may be broken among processors. Hence want to share copies of 1stm-1 characters of T in a processor with processor containing previous segment of T. • Would be useful to have copy of P in each processor.

For the exact matching problem … who ------------ P: Gallo lows ------------ P: Gallo … ------------ P: Gallo ------------ P: Gallo • Suppose we take the following steps: • Each processor gets a copy of all of P. • Each processor gets the 1stm-1 characters of T initially stored in the processor with the next segment of T. Then, in parallel, each processor can run an optimal sequential algorithm on its portion of the data in time.

So, how do we perform these data movements efficiently? • Keys: efficient gather and scatter operations • Gather: given a unit of data in each processor, get a copy of each of these values into one processor.

Scatter: return gathered items to their original processors (typically after modification by a sequential algorithm)

How to gather/scatter efficiently (q = # of processors) • If not already known, identify a minimal spanning tree (MST) rooted at the processor to which data is to be gathered. This is done as follows: • Root sends message to each neighbor. • Each non-root processor waits for a message. First message to arrive identifies processor’s parent. Upon receipt, send message to each neighbor identifying sender’s parent. • Each processor receives messages described above. If A receives a message from B identifying A as parent of B, A knows B is A’s child. • Advanced techniques show this takes O(q) time. • Performing the gather: In parallel, each processor sends data to its parent processor in the MST until each value reaches the root processor. This takes Θ(q) time. • Thus, a gather operation takes Θ(q) time. To scatter efficiently: reverse the direction of data flow for a gather operation: Θ(q) time.

Getting a complete copy of P to each processor, assuming m < n/q (P small enough to fit one processor) • Gather a dummy record from each processor to one processor – Θ(q) time. • Gather P to this processor, pipelining the data flow if more than one character of P is stored in any processor. Time is Θ(m+q) = Θ(max{m,q}) . • For each character of P, tag each dummy record with the character and scatter, pipelining. Pipelining allows reduction of the time from what one might expect to require Θ(mq) time (m separate scatters of Θ(q) time apiece) to Θ(md+q) = Θ(max{md,q}) (m scatters that overlap in time), where d (degree bound) is the maximum number of neighboring processors to any given processor (1 <d<q - 1). • Total time: Θ(md+q) = Θ(max{md,q}). If both md < n/q and q < n/q, the total time is O(n/q).

Getting each processor them-1 characters of T that follow the processor’s last character of T (case 1): Suppose processors holding consecutive segments of T are adjacent (this is possible for linear arrays, meshes using snake-like order for processors, hypercubes; not for trees, etc). Then: • In parallel, each odd-numbered processor gets the 1stm-1 characters of T that are stored in . This takes Θ(m) time via direct communication (since these processors are adjacent). • Similarly, in parallel, each even-numbered processor gets the 1stm-1 characters of T that are stored in . This takes Θ(m) time via direct communication. • Thus, total time for this process is Θ(m).

Getting each processor them-1 characters of T that follow the processor’s last character of T (case 2): Suppose processors holding consecutive segments of T are not adjacent. Then: • In parallel, each processor copies its 1stm-1 characters of T with tags containing the index of the processor with the previous segment. This takes Θ(m) time. • Sort these (m-1) q = Θ(mq) data values by their processor index tags so that they each end up in the processor with the previous segment. This takes time. • Thus, total time for this task is

Thus, we have the following algorithm for the exact string pattern matching problem on a coarse grained parallel computer with q processors: 0) T is distributed among processors in segments of n/q characters apiece. • Distribute to each processor a copy of all of P as described above, in Θ(md+q) = Θ(max{md,q}) time. If both q < n/q (coarse grained parallel computer) and md < n/q, the total time is O(n/q). • Distribute to each processor a copy of the 1stm-1 characters of the next segment of T. This takes Θ(m) time if processors with consecutive segments are adjacent; time otherwise. • Each processor runs an optimal sequential algorithm on its n/q+m-1 characters of T in time. This reduces to Θ(n/q), since m=O(n/q).

Thus, we get optimal worst-case running time Θ(n/q) under the following conditions: • If processors with consecutive segments of T are adjacent, when q < n/q (equivalently, ) and md < n/q; i.e., if max{md, q} < n/q. • If processors with consecutive segments of T are not adjacent, we need the stronger restriction , which is true, for example, when - equivalently, when .

Faster finds from Gallo to Google