1 / 46

Dynamic Programming (cont’d)

Dynamic Programming (cont’d). CS 498 SS Saurabh Sinha. Previous lecture cont’d. This is more likely. This is less likely. Affine Gap Penalties. In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:. ATA__GC ATATTGC. ATAG_GC

jihan
Download Presentation

Dynamic Programming (cont’d)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Programming (cont’d) CS 498 SS Saurabh Sinha

  2. Previous lecture cont’d

  3. This is more likely. This is less likely. Affine Gap Penalties • In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC Normal scoring would give the same score for both alignments

  4. Accounting for Gaps • Gaps- contiguous sequence of spaces in one of the rows • Score for a gap of length x is: -(ρ +σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for extending the gap.

  5. Affine gap penalty in DP • When computing si,j, need to look at si,j-1, si,j-2, si,j-3,…. and si-1,j, si-2,j, … • Each cell needs O(n) time for update • O(n2) cells • Therefore, O(n3) algorithm • We can still do this in O(n2) time

  6. Affine Gap Penalty Recurrences Continue Gap in w (deletion) si,j = s i-1,j - σ max s i-1,j –(ρ+σ) si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) si,j = si-1,j-1 + δ(vi, wj) max s i,j s i,j Start Gap in w (deletion): from middle Continue Gap in v (insertion) Start Gap in v (insertion):from middle Match or Mismatch End deletion: from top End insertion: from bottom

  7. Reading assignmentSection 6.10 (J & P)Multiple Alignment

  8. Gene Prediction

  9. Gene Prediction: Computational Challenge • Gene: A sequence of nucleotides coding for protein • Gene Prediction Problem: Determine the beginning and end positions of genes in a genome

  10. The Genetic Code SOURCE: http://www.bioscience.org/atlases/genecode/genecode.htm

  11. Codons • In 1961 Sydney Brenner and Francis Crick discovered frameshift mutations • Systematically deleted nucleotides from DNA • Single and double deletions dramatically altered protein product • Effects of triple deletions were minor • Conclusion: every triplet of nucleotides, each codon, codes for exactly one amino acid in a protein

  12. Great Discovery Provoking Wrong Assumption • In 1964, Charles Yanofsky and Sydney Brenner proved collinearity in the order of codons with respect to amino acids in proteins • As a result, it was incorrectly assumed that the triplets encoding for amino acid sequences form contiguous strips of information.

  13. Exons and Introns • In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns) • This makes computational gene prediction in eukaryotes even more difficult • Prokaryotes don’t have introns - Genes in prokaryotes are continuous

  14. Central Dogma and Splicing intron1 intron2 exon2 exon3 exon1 transcription splicing translation exon = coding intron = non-coding Batzoglou

  15. Gene prediction • More difficult in eukaryotes than in prokaryotes (due to introns). • In human genome, ~3% of DNA sequence is genes • Lot of “junk” DNA between genes, and even inside genes (between exons). • Gene prediction must deal with this.

  16. Gene prediction: broadly speaking • Statistical approaches:look for features than appear frequently in genes and infrequently elsewhere • Similarity based approaches: a newly sequenced gene may be similar to a known gene. • even this is not so simple. The exon structures may be different between otherwise similar genes

  17. Statistical approaches

  18. Splicing Signals Exons are interspersed with introns and typically flanked by GT and AG

  19. Donor site 5’ 3’ Position % Splice site detection From lectures by Serafim Batzoglou (Stanford)

  20. Consensus splice sites Donor: 7.9 bits Acceptor: 9.4 bits

  21. Splicing and gene prediction • Using splice sites (profiles) to predict genes ? • Limited scope, too many false predictions

  22. Open Reading Frames (ORFs) • Detect potential coding regions by looking at ORFs • A region of length n is comprised of (n/3) codons • Stop codons break genome into segments between consecutive Stop codons • The subsegments of these that start from the Start codon (ATG) are ORFs ATG TGA Genomic Sequence Open reading frame

  23. ORFs • 6 reading frames in any given sequence • 6 ways to map the DNA sequence to codon sequence (+1,+2,+3,-1,-2,-3) • 3 on either strand • Look at all 6 reading frames for ORFs

  24. Long vs.Short ORFs • Long open reading frames may be a gene • At random, we should expect one stop codon every (64/3) ~= 21 codons • However, genes are usually much longer than this • A basic approach is to scan for ORFs whose length exceeds certain threshold • This is naïve because some genes (e.g. some neural and immune system genes) are relatively short

  25. Codon usage • In a given sequence (e.g., an ORF), compute frequency distribution of codons (64 element array): codon usage array • Codon usage array for coding sequences is different from that for non-coding sequences • If the codon usage array for an ORF is much more similar to that of coding sequences than to that of non-coding sequences, the ORF could be a gene

  26. Codon usage • Codons coding for “Arg” in human: • CGU: 37%, CGC: 38%, CGA: 7%, CGG: 10%, AGA: 5%, AGG: 3% • In a coding sequence, codon CGC is 12 times more likely than codon AGG • An ORF preferring CGC over AGG is likely to be a gene

  27. Codon Usage in Human Genome

  28. Codon usage • One way to test if an ORF is a gene is to compute • Pr(ORF sequence under a coding sequence model) • Pr(ORF sequence under a non-coding model) • Ratio of the two. • These methods work best in prokaryotes • The exon-intron trouble is not handled yet • Hidden Markov models that use codon usage ideas and splice site ideas, all in one • We’ll see more of this in second half of course

  29. Promoter Structure in Prokaryotes (E.Coli) • Transcription starts at offset 0. • Pribnow Box (-10) • Gilbert Box (-30) • Ribosomal Binding Site (+10)

  30. Ribosomal Binding Site

  31. Statistical approaches: summary • Splicing sites • Codon usage • Promoter motifs, such as -10 element, -30 element • Ribosome binding site

  32. Similarity based approaches • Some genomes may be very well-studied, with many genes having been experimentally verified. • Closely-related organisms may have similar genes • Unknown genes in one species may be compared to genes in some closely-related species

  33. The basic approach • Given a protein sequence, and a genomic sequence, find a set of substrings of the genomic sequence whose concatenation best fits the protein sequence • First cut: Find fragments in the genomic sequence that match portions of the protein sequence (local alignment) • Then find the “optimal” subset of non-overlapping fragments

  34. Exon chaining • Each of the fragments of the genomic sequence that somewhat match the protein (locally) is a putative exon • The “goodness” of the match is the “weight” assigned to this putative exon • Thus, we have a set of weighted intervals (l,r,w): for a fragment from l to r, with weight w representing how well it matches (a portion of) the protein

  35. Exon Chaining Problem • Input: A set of weighted intervals (l,r,w) • Output: A maximum weight chain of non-overlapping intervals from this set

  36. Exon Chaining Problem: Graph Representation edge from every li to ri edge between every two successive vertices • This problem can be solved with dynamic programming in O(n) time. 21

  37. Assumptions • No two intervals have a common boundary point. So the (li,ri) define 2n distinct points, if there are n intervals

  38. Exon Chaining Algorithm ExonChaining (G, n) //Graph, number of intervals fori ←to 2n si← 0 fori ← 1 to 2n if vertex vi in G corresponds to right end of the interval I j← index of vertex for left end of the interval I w← weight of the interval I sj← max {sj + w, si-1} else si← si-1 returns2n

  39. Not very helpful • A chain is a set of non-overlapping exons in order (left to right) • But the matching protein portions may not be in the same order !

  40. Spliced Alignment • Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem). • This set is further filtered in a such a way that attempt to retain all true exons, with some false ones. • Then find the chain of exons such that the sequence similarity to the target protein sequence is maximized

  41. Spliced Alignment Problem: Formulation • Input: Genomic sequences G, target sequence T, and a set of candidate exons (blocks) B. • Output: A chain of exons Γ such that the global alignment score between Γ* and T is maximized Γ* - concatenation of all exons from chain Γ

  42. The DAG • Vertices: One vertex for each block in B • Directed edge connecting non-overlapping blocks • Label of vertex = string of block it represents • A path through the DAG spells out the string obtained by concatenating that particular chain of blocks • Weight of a path is the score of the optimal alignment between the string it spells out and the target sequence

  43. Dynamic programming • Genomic sequence G = g1g2…gn • Target sequence T = t1t2…tm • As usual, we want to find the optimal alignment score of the i-prefix of G and the j-prefix of T • Problem is, there are many i-prefixes possible (since multiple blocks may include position i)

  44. Idea • Find the optimal alignment score of the i-prefix of G and the j-prefix of T assuming that this alignment uses a particular block B at position i • S(i, j, B) • For every block B that includes i

  45. Recurrence If i is not the starting vertex of block B: • S(i, j, B) = max { S(i – 1, j, B) – indel penalty S(i, j – 1, B) – indel penalty S(i – 1, j – 1, B) + δ(gi, tj) } If i is the starting vertex of block B: • S(i, j, B) = max { S(i, j – 1, B) – indel penalty maxall blocks B’ preceding block B S(end(B’), j, B’) – indel penalty maxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ(gi, tj) }

More Related