Efficient algorithms for locating maximum average consecutive substrings
Download
1 / 37

Efficient Algorithms for Locating Maximum Average Consecutive Substrings - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Efficient Algorithms for Locating Maximum Average Consecutive Substrings. Jie Zheng Department of Computer Science UC, Riverside. Outline. Problem Definition Applications to Molecular Biology Two Existing Algorithms Open Problems. Definition of Problem.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Efficient Algorithms for Locating Maximum Average Consecutive Substrings' - tamarr


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient algorithms for locating maximum average consecutive substrings

Efficient Algorithms for Locating Maximum Average Consecutive Substrings

Jie Zheng

Department of Computer Science

UC, Riverside


Outline
Outline

  • Problem Definition

  • Applications to Molecular Biology

  • Two Existing Algorithms

  • Open Problems


Definition of problem
Definition of Problem

  • Given a sequence of real numbers,

    A = <a1, a2, …, an>, and a positive integer

    L≤ n, the goal is to find a consecutive substring of A of length at least L such that the average of the numbers in the subsequence is maximized.


Applications in biology
Applications in Biology

  • Locating GC-rich Regions

  • Post-Processing Sequence Alignments

  • Annotating Multiple Sequence Alignments

  • Computing Ungapped Local Alignments with Length Constraints


Two existing algorithms
Two Existing Algorithms

  • An O(nlogL)-time Algorithm

    (Yaw Ling Lin, Tao Jiang, Kun-Mao Chao, 2001)

  • A Linear Time Algorithm for Binary Strings

    (Hsueh-I Lu, 2002)


The o nlogl time algorithm yaw ling lin tao jiang kun mao chao 2001
The O(nlogL)-time Algorithm(Yaw-Ling Lin, Tao Jiang, Kun-Mao Chao,2001)

Basic Scheme:

  • Finding good partner of each element,

    i.e. for element ai, locate aj, such that the segment <ai,…, aj> has maximum average among all substrings starting from ai.

  • Choose the <ai,…, aj> with the maximum average among the n candidates.


Important concepts
Important Concepts

  • Right-Skew Sequence

    A sequence A = <a1, a2, …, an> is right-skew if and only if the average of any prefix

    < a1, a2, …, aj> is always less than or equal to the average of the remaining suffix subsequence <aj+1, aj+2,…, an>.


Important concept
Important Concept

  • Decreasingly Right-Skew Partition

    A partition A=A1A2…Ak is decreasingly right-skew if each segment Ai of the partition is right-skew and μ(Ai) > μ(Aj) for any i < j.


Big picture of right skew partition
Big Picture of Right-Skew Partition

A

B

C

D

  • Intuition:

  • If A is chosen, B must also be

  • If C is not chosen, D can not be, either.


Lemma 7 huang the maximum average substring can not be longer than 2l 1
Lemma 7(Huang): The Maximum Average Substring Can not be longer than 2L-1

  • Proof If C is the maximum average substring with length ≥2L, let C= AB, where |A|≥L and |B|≥L, then the average of A or B is no less than that of C.

    Say μ(A) > μ(B), then μ(A) > μ(AB)


Main idea of the o nlogl algorithm
Main Idea of the O(nlogL)Algorithm

  • Compute the decreasingly right-skew partition in O(n) time.

  • Finding the good partner for each index costs O(logL) time.


Compute the decreasingly right skew partition
Compute the decreasingly right-skew partition

  • Lemma 5: Every real sequence A=<a1,a2,…,an> has a unique decreasingly right-skew partition.

  • Lemma 6: All right-skew pointers for a length n sequence can be computed in O(n) amortized time.


Compute the right skew pointers
Compute the right-skew pointers

4

9

5

30

4

9

5

30

4

9

5

30

4

9

5

30


Find good partner in o log l
Find good partner in O(logL)

  • Lemma 9(Bitonic): Let P be a real sequence, and A1A2…Am the decreasingly right-skew partition of a sequence A. Suppose that

    μ(PA1…Ak) = max{μ(PA1…Ai)|0≤i≤m}

    Then μ(PA1…Ai) > μ(Ai+1) if and only if i≥k.


What does lemma 9 tell us
What does Lemma 9 tell us?

  • Locating good partner can be done with binary search!

    To find k so that

    μ(PA1…Ak) = max{μ(PA1…Ai)|0 ≤ i ≤ m}

    We guess i and make it closer to k:

  • μ(PA1…Ai) >μ(Ai+1) implies i ≥ k

  • μ(PA1…Ai) ≤μ(Ai+1) implies i < k



Date structure for binary search
Date Structure for Binary Search

logL Pointer-Jumping Tables

  • j (k) denotes the right end-point of the kth right-skew segment .

  • p(0)[i] = p[i], where p[i] is right-skew pointer for i, p(k+1) [i] = min{p(k) [p(k) [i]+1], n}. 1 k  logL

  • The precomputation of the jumping tables takes at most O(nlogL) time.


The time complexity
The Time Complexity

  • Totally n phases

  • Each phase costs O(logL)

  • Overall: O(nlogL)-time


Crying Out

for A Linear Time Algorithm!!


A linear time algorithm for binary strings hsueh i lu 2002
A Linear-Time Algorithm for Binary Strings(Hsueh-I Lu, 2002)

  • Build upon the previous algorithm

  • Improvements:

    - Considering an upper bound on the number of right-skew segments

    - Working simultaneously on the right-skew partitions of forward and reverse strings

    - Utilizing Properties of Binary Strings


Basic scheme
Basic Scheme

Let B = log3n and b = (loglogn)3

  • Choose O(n/ logn) indices i of S such that if g(i)-i  B holds for any of such i, then g(i) can be found in O(logn) time.

  • Choose O(n/ loglogn) indices i of S such that if B  g(i) – i  b holds for any of such i, then g(i) can be found in O(loglogn) time.

  • Find g(i) for all indices i such that g(i) – i  b.


Denotations
Denotations

  • A right-skew decompostion of any substring S [p, q] is a nonempty set of i indices i1,i2,…, ilso that S[i1,i2], S[i2, i3],…, S[il-1, il] are decreasingly right-skew partition of S.

  • Let DS (i, j) denote the right-skew decomposition of S[i, j]

  • If P = {p1, p2 ,…, pk, pk+1}, where p1< p2<…<pk+1, then


An intuitive observation
An Intuitive Observation

  • Right-skew pointers cannot cross

A

B

C

  • By definition of right-skew segment: μ(A)  μ(B) μ(C)

  • Thus μ(A+B)  μ(C).

  • 2. By definition of decreasingly right-skew partition:

  • μ(A+B) > μ(C). Contradiction.



Lemma 3 from the big picture
Lemma 3: from the big picture

  • If j DS (P), then DS (j, n)  DS (P).

    Lemma 3 tells us that if j belongs to the right-skew decomposition of some set of indices, then its good partner will also be. Thus, we only need to search for its good partner among a limited number of indices.


Lemma 4 d s i j o j i 2 3 it holds for binary strings only
Lemma 4: |DS(i, j)| = O((j - i)2/3)(It holds for binary strings only)

  • Define: A right-skew substring determined by DS(i, j) is the undividable right-skew segment.

  • A right-skew substring S[p, q] is long (short) if q - p l1/3 (q - p < l1/3)

  • Prove lemma 4 by showing that the number of long and short right-skew substrings for a binary string is O((j - i)2/3).


Phase 1 g i i b g r j j b
Phase 1: g(i) - i B; gR(j) - j B

Define:

Pshort = {p | p mod B  0 and 0  p < n}  {n}

  • We have |DS (Pshort)| = O (n/logn)

  • In this phase, we take care of index i such that

    i and g(i), i.e. good partner of i, are both in

    DS (Pshort)  DR (Pshort)


Phase 2 l b g i i l b l b g r j j l b
Phase 2: L + b < g(i) – i L+B L + b < gR(j) – j L+B

Define:

Ptiny = {p | p mod b  0 and 0  p < n}  {n}

  • We have |DS (Ptiny)| = O (n/loglogn)

  • In this phase, we take care of index i such that

    i and g(i), i.e. good partner of i, are both in

    DS (Ptiny)  DR (Ptiny)


Phase 3 g i i l b g r j j l b
Phase 3: g(i)-i L+b, gR(j)-j  L+b

  • We set up a table M whose (x,y) entry contains the index z, such that:

    If C is a binary string of L+b bits, x is the number of ‘1’ in the first L bits of C; y is the binary string consisting of the last b bits of C; z is the good partner of index 0 in C.

    Because b is relatively small, the number of possible value for x and y is linear

  • Looking up the table M, we can cope with the left-over case in O(n)-time.


Open problems
Open Problems:

How to extend the linear time algorithm for binary strings to arbitrary strings.


Interested
INTERESTED?

Contact:

Jie Zheng

Department of Computer Science

Surge Building # 350

UC, Riverside

E-mail: [email protected]


ad