1 / 58

Alignments and Comparative Genomics

Alignments and Comparative Genomics. Welcome to CS374!. Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia. Biology in One Slide – Twentieth Century. …and today. …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG

Download Presentation

Alignments and Comparative Genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alignments and Comparative Genomics

  2. Welcome to CS374! Today: • Serafim: Alignments and Comparative Genomics • Omkar: Administrivia

  3. Biology in One Slide – Twentieth Century …and today …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

  4. Complete DNA Sequences nearly 200 complete genomes have been sequenced

  5. Evolution

  6. Evolution at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… SEQUENCE EDITS …AC----CAGTCCACCA… REARRANGEMENTS Inversion Translocation Duplication

  7. Evolutionary Rates next generation OK OK OK X X Still OK?

  8. Sequence conservation implies function • Alignment is the key to • Finding important regions • Determining function • Uncovering the evolutionary forces

  9. Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x1x2...xM, y = y1y2…yN, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence

  10. What is a good alignment? Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”? Alignment: A hypothesis that the two sequences come from a common ancestor through sequence edits Parsimonious explanation: Find the minimum number of edits that transform one sequence into the other

  11. Scoring Function • Sequence edits: AGGCCTC • Mutations AGGACTC • Insertions AGGGCCTC • Deletions AGG.CTC Scoring Function: Match: +m Mismatch: -s Gap: -d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d

  12. How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Too many possible alignments: O( 2M+N) AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

  13. Dynamic Programming • Given two sequences x = x1……xM and y = y1……yN • Let F(i, j) = Score of best alignment of x1……xi to y1……yj • Then, F(M, N) == Score of best alignment Idea: • Compute F(i, j) for all i and j • Do this by using F(i–1 , j), F(i, j–1), F(i–1, j–1)

  14. Dynamic Programming (cont’d) Notice three possible cases: • xialigns to yj x1……xi-1 xi y1……yj-1 yj 2. xialigns to a gap x1……xi-1 xi y1……yj - • yjaligns to a gap x1……xi - y1……yj-1 yj m, if xi = yj F(i,j) = F(i-1, j-1) + -s, if not F(i,j) = F(i-1, j) - d F(i,j) = F(i, j-1) - d

  15. Dynamic Programming (cont’d) • How do we know which case is correct? Inductive assumption: F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal Then, F(i-1, j-1) + s(xi, yj) F(i, j) = max F(i-1, j) – d F( i, j-1) – d Where s(xi, yj) = m, if xi = yj; -s, if not i-1, j-1 i-1, j i, j-1 i, j

  16. Example x = AGTA m = 1 y = ATA s = -1 d = -1 F(i,j) i = 0 1 2 3 4 Optimal Alignment: F(4,3) = 2 AGTA A - TA j = 0 1 2 3

  17. The Needleman-Wunsch Algorithm • Initialization. • F(0, 0) = 0 • F(0, j) = - j  d • F(i, 0) = - i  d • Main Iteration. Filling-in partial alignments • For each i = 1……M For each j = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + s(xi, yj) [case 3] UP, if [case 1] Ptr(i,j) = LEFT if [case 2] DIAG if [case 3] • Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

  18. Alignment on a Large Scale • Given a newly sequenced organism, • Which subregions align with other organisms? • Potential genes • Other biological characteristics • Assume we use Dynamic Programming: Our newly sequenced mammal 3109 The entire genomic database 1010 - 1011

  19. Index-based Local Alignment Main idea: • Construct a dictionary of all the words in the query • Initiate a local alignment for each word match between query and DB Running Time: Theoretical worst case: O(MN) Fast in practice query DB

  20. Index-based Local Alignment — BLAST …… query Dictionary: All words of length k (~11) Alignment initiated between exact-matching words (more generally, between words of alignment score  T) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… scan DB query

  21. Index-based Local Alignment — BLAST A C G A A G T A A G G T C C A G T Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A

  22. Gapped BLAST A C G A A G T A A G G T C C A G T Added features: • Pairs of words can initiate alignment • Nearby alignments are merged • Extensions with gaps until score < T below best score so far Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A

  23. Example Query:gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9|Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911 http://www.ncbi.nlm.nih.gov

  24. Efficient global alignment

  25. Global alignment with the chaining approach Find local alignments Chain them into a rough global map Align regions in-between

  26. LAGAN: 1. FIND Local Alignments • Find Local Alignments • Chain Local Alignments • Restricted DP Mike Brudno, Chuong B Do, et al.

  27. LAGAN: 2. CHAIN Local Alignments • Find Local Alignments • Chain Local Alignments • Restricted DP Mike Brudno, Chuong B Do, et al.

  28. LAGAN: 3. Restricted DP • Find Local Alignments • Chain Local Alignments • Restricted DP Mike Brudno, Chuong B Do, et al.

  29. Restricted DP (cont’d) • What if a box is too large? • Recursive application of LAGAN, more sensitive word search

  30. Multiple Alignment

  31. Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

  32. Sum Of Pairs (cont’d) • The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

  33. Dynamic Programming for Multiple Alignment z x y AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

  34. Progressive Alignment • Multiple Alignment is NP-complete • Most used heuristic: Progressive Alignment Algorithm: Until all sequences are aligned: • Align two (multi-)sequences to each other, and treat the result as a new sequence Example: aligning AACTGTA with AATGTC, gives AACTGTA AA-TGTC, with “letters” (AA), (AA), (C-), (TT), (GG), (TT), (AC) Running Time: O(NL2), where N: #seqs, L: length of a seq

  35. MLAGAN: Progressive Alignment Human Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) • With needed generalizations for multi-anchoring & scoring edit distance Baboon Mouse Rat

  36. Evolution at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… SEQUENCE EDITS …AC----CAGTCCACCA… REARRANGEMENTS Inversion Translocation Duplication

  37. Local & Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Global Local AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

  38. Glocal Alignment Problem Find least cost transformation of one sequence into another using shuffle operations AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA • Sequence edits • Inversions • Translocations • Duplications • Combinations of above AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

  39. SLAGAN: 1. Find Local Alignments • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts

  40. SLAGAN: 2. Build Homology Map • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts

  41. SLAGAN: 2. Build Homology Map b a d c Chain using Sparse Dynamic Programming • Penalties: • regular • translocation c) inversion d) inverted translocation

  42. SLAGAN: 2. Build Homology Map • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts

  43. SLAGAN: 3. Global Alignment • Find Local Alignments • Build Rough Homology Map • Globally Align Consistent Parts

  44. SLAGAN Example: Chromosome 20 • Human Chromosome 20 versus Mouse Chromosome 2 • 270 Segments of conserved synteny • 70 Inversions

  45. SLAGAN example: HOX cluster • 10 paralogous genes • Conserved order in Human/Mouse/Rat

  46. SLAGAN example: HOX cluster • 10 paralogous genes • Conserved order in Human/Mouse/Rat

  47. Whole-genome alignment with SLAGAN • Two-step Shuffle • Shuffle for large-scale synteny map • Shuffle each syntenic region for microrearrangements

  48. The ENCODE Project

  49. ENCODE regions shuffled Hum/Rat Hum/Mus

More Related