Today’s Topics Computer Science Enabled by Computing : Decoding the Human Genome Upcoming Review for Final Exam
Enabled by Computers • Things we now take for granted: Possible only because of computing-- • Several Examples (most mentioned before) • Modern Camera Zoom Lens • Certain Space Missions: e.g., “Sling Shot” paths • Medical Imaging • CAT scans (Nobel Prize!) • Other imaging procedure: PET, MRI, … • Designing and Manufacturing a modern Computer • Communications (error checking, compression, …) • Decoding the Human Genome
The Human Genome • Each cell contains • Nucleus • The human Nucleus contains • 24 Chromosomes • Chromosomes (composed of DNA), collectively include • 20-25 thousand Genes • E.g, Chromosome 5 includes 5923 genes • Chromosomes composed of, collectively, • 3.5 Gpb (3,500,000,000 base pairs) • Good Diagram of DNA • http://www.accessexcellence.org/RC/VL/GG/dna2.html
The Human Genome • Makeup: The Double Helix - DNA • 3.5 Gpb • (how big a number can an int hold?) • Bases denoted by letters A, C, G, T • Adenine, Cytosine, Guanine, Thymine • Each strand of DNA (in each of our cells) approx 6 feet long! • (packed into volume approx. 0.0004 inches across) • Letters printed as string 1mm apart is almost 1900 miles long • Good Diagram of DNA • http://www.accessexcellence.org/RC/VL/GG/dna2.html
How to Read (Sequence) DNA? • Look at following strings • Assume we didn’t know alphabet • Can we reconstruct alphabet from these fragments? A AB ABCDE ABCDEF BCDEF CDEFGH FGHIJ GHIJK GHIJKL IJKLMN KLMNO LMNOP MNOPQR OPQRST PQRST QRSTU STUVWX UVWXY UVWXYZ VWXYZ YZ Z • If we assume each letter used only once, can match on single character ABCDEF + FGHIJ yields ABCDEFGHIJ • If uncertain of nature, may require longer overlap: IJKLMN + MNOPQR yields IJKLMNOPQR • Can reconstruct Complete Alphabet from fragments
Reconstruction from DNA fragments • Problem is more difficult • Only 4 characters: A C G T • All kinds of repetition in the sequence • Need larger overlap – how large? • Depends on kind of repetition we find • Look at example with a sequence much longer than alphabet • Fragments shown come from chopping up three identical copies of the sequence • Breaks at “random” points
Reconstruction from DNA sequence • Look at following fragments (from 3 originals) AAGATGGTTCATTCT ACGGGCGGTGTTGGAGCAGA AGAGCT AGGTATATTGAGGAAG ATTGT CAAGTAAAAGGA CATTGTCAAGTAAAAG CCAACTAGTCAGCACTAC CCAACTAGTCAGCACTACAT CGGGCGGTGTTGGAGC CTGCAATTTCTG GAAGGTATAT GACTTGGGTA GCTCTGCAATTTCTG GCTGGGG GCTGGGGA GCTGGGGACGGGCGGTGT TAGTCAGCACTA TCTGCAATTTCTGCCAAC TGAGGAAGAAGA TGAGGAAGAAGATGGTTCA TGGAGCAGAGC TGGTTCATTCTGACTTGGGTA TGTCAAGTAAAAGGAAGGTATAT TTCTGACTTGGGTA • Identify Overlaps to reconstruct TCTGCAATTTCTGCCAACTAGTCAGCACTACAT AGGTATATTGAGGAAGAAGATGGTTCA • Eventually can get original sequence GCTGGGGACGGGCGGTGTTGGAGCAGAGCTCTGCAATTTCTGCCAACTAGTCAGCACTA CATTGTCAAGTAAAAGGAAGGTATATTGAGGAAGAAGATGGTTCATTCTGACTTGGGTA
The Real World • Have looked at toy problems: back to reality • String lengths are huge: (3 * 109) • Why the obsession with fragments? • If we can sequence (read) a fragment, why not just do the whole thing? • Automatic Sequencers Available • Limited to lengths of the order of 1000 from end • (Can sequence whole strand if short enough) • Thus the use of the Shotgun Method of Sequencing
Shotgun Sequencing • Strand much too long for automatic sequencing • Randomly cut them into small pieces (~5 Kbp) • Make many identical copies of these strands • Each of these small pieces are sequenced to produce reads • What’s left is a Data Processing Problem • Need to reconstruct original DNA strand by matching ends • If random reads match nicely, work can be completed • There may be problems!
Shotgun Problems • Gaps • Due to random nature of shearing strands, there may be gaps in the sequence • (Maybe all pieces broke at same place) • May need to repeats for some sub-areas to fill gaps • Repeats • Long repeats may make matching ambiguous • Need extra long fragments with ends sequenced • Can tell how many repeats “fit” in • Also can bridge gaps that have resisted sequencing • Sequencing Errors • Automatic sequencing is error prone – need multiple passes
The Computations Required • Appears to be a Simple String Matching Problem • Remember int indexOf(String) method String a, b; ... // input or compute data int pos = a.indexOf(b); • pos tells where in a, b is located • Combined with use of String substring(int, int) can check for overlap in the ends of strings • Effectively “slide” ends over each other for match
The Computations Required • Seems simple enough in principle, but… • Large numbers involved make task daunting • E.g., must compare each read to every other read • For N reads, involve N2 compares. • Wouldn’t seem bad except when we calculate N • 3*109/103 (divide by approx size of read) • N2 is ~ 9*1012 compares • That’s only for 1 times coverage (need more!) • Each compare also involves up to N2 char compares! (where N is length of string)
The Computations Required • Previous analysis is naïve • + Can do better by grouping things • Like matching “words” rather than “letters” • - Other problems not considered make thing much more complex • Whole process is not in an error free environment • Maybe string matches that match at 99% of positions must be considered a match • Many good computer scientists and mathematicians involved
Interesting Competition • BAC to BAC Sequencing • Public Human Genome Project (1988 - ) • Many cooperating laboratories, world wide • Started much earlier than competition • Started with more primitive technologies • Top down approach using bacterial artificial chromosome (BAC) • Builds framework (scaffolding) first • Then fill in details
Interesting Competition • Whole Genome Shotgun Sequencing • Celera Genomics (private: Craig Ventnor, Eugene Myers) • Later start (1998 - ), “finished” at same time • Benefited from much improved technology • Sequencers much better • Longer strands, better accuracy • Faster computers • Shotgun from the top down • Use three sizes of fragments (1 Mbp, 50 Kbp, 10 Kbp) • Can user longer pieces to deal with repeats • Everything done in parallel.
Interesting Competition • Whole Genome Shotgun method appears to have won • Much controversy at first • Hybrid methods • Job just beginning! • Need to find out what in Genome affects what in practice • Much labeled “junk” DNA because it doesn’t seem to affect anything. • Is that the last word?