Linear space alignment
Download
1 / 30

Linear-Space Alignment - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

Linear-Space Alignment. Subsequences and Substrings. Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix string v (similarly, x’ = x i …x j , for some 1  i  j  |x|) A string x’ is a subsequence of a string x

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Linear-Space Alignment' - rana-hanson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Subsequences and substrings
Subsequences and Substrings

Definition A string x’ is a substring of a string x,

if x = ux’v for some prefix string u and suffix string v

(similarly, x’ = xi…xj, for some 1  i  j  |x|)

A string x’ is a subsequence of a string x

if x’ can be obtained from x by deleting 0 or more letters

(x’ = xi1…xik, for some 1  i1  …  ik  |x|)

Note: a substring is always a subsequence

Example:x = abracadabra

y = cadabr; substring

z = brcdbr; subseqence, not substring


Hirschberg s algortihm
Hirschberg’s algortihm

Given a set of strings x, y,…, a common subsequence is a string u that is a subsequence of all strings x, y, …

  • Longest common subsequence

    • Given strings x = x1 x2 … xM, y = y1 y2 … yN,

    • Find longest common subsequence u = u1 … uk

  • Algorithm:

    F(i – 1, j)

    • F(i, j) = max F(i, j – 1)

      F(i – 1, j – 1) + [1, if xi = yj; 0 otherwise]

    • Ptr(i, j) = (same as in N-W)

    • Termination: trace back from Ptr(M, N), and prepend a letter to u whenever

      • Ptr(i, j) = DIAG and F(i – 1, j – 1) < F(i, j)

  • Hirschberg’s original algorithm solves this in linear space


  • Introduction compute optimal score

    F(i,j)

    Introduction: Compute optimal score

    It is easy to compute F(M, N) in linear space

    Allocate ( column[1] )

    Allocate ( column[2] )

    For i = 1….M

    If i > 1, then:

    Free( column[ i – 2 ] )

    Allocate( column[ i ] )

    For j = 1…N

    F(i, j) = …


    Linear space alignment1
    Linear-space alignment

    To compute both the optimal score and the optimal alignment:

    Divide & Conquer approach:

    Notation:

    xr, yr: reverse of x, y

    E.g. x = accgg;

    xr = ggcca

    Fr(i, j): optimal score of aligning xr1…xri & yr1…yrj

    same as aligning xM-i+1…xM & yN-j+1…yN


    Linear space alignment2
    Linear-space alignment

    Lemma: (assume M is even)

    F(M, N) = maxk=0…N( F(M/2, k) + Fr(M/2, N – k) )

    M/2

    x

    F(M/2, k)

    Fr(M/2, N – k)

    y

    k*

    Example:

    ACC-GGTGCCCAGGACTG--CAT

    ACCAGGTG----GGACTGGGCAG

    k* = 8


    Linear space alignment3
    Linear-space alignment

    • Now, using 2 columns of space, we can compute

      for k = 1…M, F(M/2, k), Fr(M/2, N – k)

      PLUS the backpointers

    x1

    xM/2

    xM

    x1

    xM/2+1

    xM

    y1

    y1

    yN

    yN


    Linear space alignment4
    Linear-space alignment

    • Now, we can find k* maximizing F(M/2, k) + Fr(M/2, N-k)

    • Also, we can trace the path exiting column M/2 from k*

    0 1 …… M/2

    M/2+1 …… M M+1

    k*

    k*+1


    Linear space alignment5
    Linear-space alignment

    • Iterate this procedure to the left and right!

    k*

    N-k*

    M/2

    M/2


    Linear space alignment6
    Linear-space alignment

    Hirschberg’s Linear-space algorithm:

    MEMALIGN(l, l’, r, r’): (aligns xl…xl’ with yr…yr’)

    • Let h = (l’-l)/2

    • Find (in Time O((l’ – l)  (r’ – r)), Space O(r’ – r))

      the optimal path, Lh, entering column h – 1, exiting column h

      Let k1 = pos’n at column h – 2 where Lh enters

      k2 = pos’n at column h + 1 where Lh exits

      3. MEMALIGN(l, h – 2, r, k1)

      4. Output Lh

    • MEMALIGN(h + 1, l’, k2, r’)

      Top level call: MEMALIGN(1, M, 1, N)


    Linear space alignment7
    Linear-space alignment

    Time, Space analysis of Hirschberg’s algorithm:

    To compute optimal path at middle column,

    For box of size M  N,

    Space: 2N

    Time: cMN, for some constant c

    Then, left, right calls cost c( M/2  k* + M/2  (N – k*) ) = cMN/2

    All recursive calls cost

    Total Time: cMN + cMN/2 + cMN/4 + ….. = 2cMN = O(MN)

    Total Space: O(N) for computation,

    O(N + M) to store the optimal alignment


    Heuristic local alignerers

    Heuristic Local Alignerers

    • The basic indexing & extension technique

    • Indexing: techniques to improve sensitivity

      • Pairs of Words, Patterns

    • Systems for local alignment


    Indexing based local alignment
    Indexing-based local alignment

    ……

    Dictionary:

    All words of length k (~10)

    Alignment initiated between words of alignment score  T

    (typically T = k)

    Alignment:

    Ungapped extensions until score

    below statistical threshold

    Output:

    All local alignments with score

    > statistical threshold

    query

    ……

    scan

    DB

    query


    Indexing based local alignment extensions
    Indexing-based local alignment—Extensions

    A C G A A G T A A G G T C C A G T

    Gapped extensions until threshold

    • Extensions with gaps until score < C below best score so far

      Output:

      GTAAGGTCCAGT

      GTTAGGTC-AGT

    C T G A T C C T G G A T T G C G A


    Sensitivity speed tradeoff
    Sensitivity-Speed Tradeoff

    X%

    Sens.

    Speed

    Kent WJ, Genome Research 2002


    Sensitivity speed tradeoff1
    Sensitivity-Speed Tradeoff

    Methods to improve sensitivity/speed

    • Using pairs of words

    • Using inexact words

    • Patterns—non consecutive positions

    ……ATAACGGACGACTGATTACACTGATTCTTAC……

    ……GGCACGGACCAGTGACTACTCTGATTCCCAG……

    ……ATAACGGACGACTGATTACACTGATTCTTAC……

    ……GGCGCCGACGAGTGATTACACAGATTGCCAG……

    TTTGATTACACAGAT

    T G TT CAC G


    Measured improvement
    Measured improvement

    Kent WJ, Genome Research 2002


    Non consecutive words patterns
    Non-consecutive words—Patterns

    Patterns increase the likelihood of at least one match within a long conserved region

    Consecutive Positions

    Non-Consecutive Positions

    6 common

    5 common

    7 common

    3 common

    On a 100-long 70% conserved region:

    ConsecutiveNon-consecutive

    Expected # hits: 1.07 0.97

    Prob[at least one hit]: 0.30 0.47


    Advantage of patterns
    Advantage of Patterns

    11 positions

    11 positions

    10 positions


    Multiple patterns
    Multiple patterns

    TTTGATTACACAGAT

    T G TT CAC G

    T G T C CAG

    TTGATT A G

    • K patterns

      • Takes K times longer to scan

      • Patterns can complement one another

    • Computational problem:

      • Given: a model (prob distribution) for homology between two regions

      • Find: best set of K patterns that maximizes Prob(at least one match)

    How long does it take

    to search the query?

    Buhler et al. RECOMB 2003

    Sun & Buhler RECOMB 2004


    Variants of blast
    Variants of BLAST

    • NCBI BLAST: search the universe http://www.ncbi.nlm.nih.gov/BLAST/

    • MEGABLAST: http://genopole.toulouse.inra.fr/blast/megablast.html

      • Optimized to align very similar sequences

        • Works best when k = 4i  16

        • Linear gap penalty

    • WU-BLAST: (Wash U BLAST) http://blast.wustl.edu/

      • Very good optimizations

      • Good set of features & command line arguments

    • BLAT http://genome.ucsc.edu/cgi-bin/hgBlat

      • Faster, less sensitive than BLAST

      • Good for aligning huge numbers of queries

    • CHAOS http://www.cs.berkeley.edu/~brudno/chaos

      • Uses inexact k-mers, sensitive

    • PatternHunter http://www.bioinformaticssolutions.com/products/ph/index.php

      • Uses patterns instead of k-mers

    • BlastZ http://www.psc.edu/general/software/packages/blastz/

      • Uses patterns, good for finding genes

    • Typhon http://typhon.stanford.edu

      • Uses multiple alignments to improve sensitivity/speed tradeoff


    Example
    Example

    Query:gattacaccccgattacaccccgattaca (29 letters) [2 mins]

    Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters

    >gi|28570323|gb|AC108906.9|Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus

    Query: 4 tacaccccgattacaccccga 24

    ||||||| |||||||||||||

    Sbjct: 125138 tacacccagattacaccccga 125158

    Score = 34.2 bits (17),

    Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus

    Query: 4 tacaccccgattacaccccga 24

    ||||||| |||||||||||||

    Sbjct: 125104 tacacccagattacaccccga 125124

    >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus

    Query: 4 tacaccccgattacaccccga 24

    ||||||| |||||||||||||

    Sbjct: 3891 tacacccagattacaccccga 3911


    Example1
    Example

    Query: Human atoh enhancer, 179 letters [1.5 min]

    Result: 57 blast hits

    • gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... 355 1e-95

    • gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... 264 4e-68

    • gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... 256 9e-66

    • gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... 78 5e-12

    • gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... 54 7e-05

    • gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... 44 0.068

    • gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... 42 0.27

    • gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 42 0.27

      gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517

      Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%),

      Gaps = 2/177 (1%) Strand = Plus / Plus

      Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62

      ||||||||||||| ||||||||||||||||||| ||||||||||||||||||||||||||

      Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203

      Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122

      |||||||||||||||||||||||||| ||||||||| |||||||||||||||| |||||

      Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262

      Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179

      ||||||||||||| || ||| |||||||||||||||||||| |||||||||||||||

      Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318

    http://www.ncbi.nlm.nih.gov/BLAST/


    Hidden markov models

    1

    2

    2

    1

    1

    1

    1

    2

    2

    2

    2

    K

    x1

    K

    K

    K

    K

    x2

    x3

    xK

    Hidden Markov Models


    Outline for our next topic
    Outline for our next topic

    • Hidden Markov models – the theory

    • Probabilistic interpretation of alignments using HMMs

      Later in the course:

    • Applications of HMMs to biological sequence modeling and discovery of features such as genes


    Example the dishonest casino
    Example: The Dishonest Casino

    A casino has two dice:

    • Fair die

      P(1) = P(2) = P(3) = P(5) = P(6) = 1/6

    • Loaded die

      P(1) = P(2) = P(3) = P(5) = 1/10

      P(6) = 1/2

      Casino player switches back-&-forth between fair and loaded die once every 20 turns

      Game:

    • You bet $1

    • You roll (always with a fair die)

    • Casino player rolls (maybe with fair die, maybe with loaded die)

    • Highest number wins $2


    Question 1 evaluation
    Question # 1 – Evaluation

    GIVEN

    A sequence of rolls by the casino player

    1245526462146146136136661664661636616366163616515615115146123562344

    QUESTION

    How likely is this sequence, given our model of how the casino works?

    This is the EVALUATION problem in HMMs

    Prob = 1.3 x 10-35


    Question 2 decoding
    Question # 2 – Decoding

    GIVEN

    A sequence of rolls by the casino player

    1245526462146146136136661664661636616366163616515615115146123562344

    QUESTION

    What portion of the sequence was generated with the fair die, and what portion with the loaded die?

    This is the DECODING question in HMMs

    FAIR

    LOADED

    FAIR


    Question 3 learning
    Question # 3 – Learning

    GIVEN

    A sequence of rolls by the casino player

    1245526462146146136136661664661636616366163616515615115146123562344

    QUESTION

    How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back?

    This is the LEARNING question in HMMs

    Prob(6) = 64%


    The dishonest casino model
    The dishonest casino model

    0.05

    0.95

    0.95

    FAIR

    LOADED

    P(1|F) = 1/6

    P(2|F) = 1/6

    P(3|F) = 1/6

    P(4|F) = 1/6

    P(5|F) = 1/6

    P(6|F) = 1/6

    P(1|L) = 1/10

    P(2|L) = 1/10

    P(3|L) = 1/10

    P(4|L) = 1/10

    P(5|L) = 1/10

    P(6|L) = 1/2

    0.05


    ad