1 / 46

Zipf’s monkeys

Zipf’s monkeys. Observations from real and random genomes. Environmental genomics. When an organism dies, it decomposes and the DNA in its cells degenerates into smaller and smaller fragments Given a collection of DNA fragments (i.e. reads), figure out which organisms they came from.

zola
Download Presentation

Zipf’s monkeys

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Zipf’s monkeys Observations from real and random genomes

  2. Environmental genomics • When an organism dies, it decomposes and the DNA in its cells degenerates into smaller and smaller fragments • Given a collection of DNA fragments (i.e. reads), figure out which organisms they came from

  3. The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCT…

  4. The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATG…

  5. The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAG…

  6. The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAGATCATCATCGCGCATCAATCAGTG…

  7. The data ___________________________________________________________________________________________________________________________________________________________

  8. The data ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________

  9. The data ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________

  10. The data ______________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________

  11. The data ______________________________________________________________________________________________ ________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ ________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________

  12. The data ______________________________________________________________________________________________ ________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ ________________________________________________________________________________________________________ _______ _____________ ____ ______________ ___________________________ __________ ________________ _____ ____________________________ ______________________ ________________________________________________________ ________ _______ __ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________

  13. The data ______________________________________________________________________________________________ ________________________________________________________________________________________________________________ _________________________________________________________________________________________________________________ ________________________________________________________________________________________________________ _______ _____________ ____ ______________ ___________________________ __________ ________________ _____ ____________________________ ______________________ ________________________________________________________ ________ _______ __ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __ ________________________ __________________ ________________ ________________________________ ___________________ __________ _______ ___________________ ____________ _____ _______ ________________ _________________ _______________ ______________ ___________ _______________ ___________ _____ _______ ___________ _________ ______________________ ___ __ _____________ ___________________________________ ____________________ _______________________ __________ How can we reconstruct the original genomes?

  14. Approaches • Jigsaw puzzle • Find common subsequences • Align overlapping regions • Statistics • Compute histograms of oligonucleotides (n-grams) • Match to distributions for known organisms • Use rare polymers to select anchor points (BLAST-like)

  15. Compression distance • Conjecture: a lossless, dictionary-based sequence compressor built for a genome compresses one of its own subsequences better than would the compressor built for another genome • (normalized) universal compression distance max[ C(xy) – C(x), C(yx) – C(y) ] UCD(x,y) = --------------------------------------------- max[ C(x), C(y)]

  16. CM clustering • Compression Maximization • Adopt compression into a kind of EM clustering • Partition reads randomly into [say] two groups • For each read, compute compression distance to each group (à la leave-one-out) • Reassign read to closest group • Iterate until some stopping criterion • Apply recursively to each group

  17. Experiment groupAgroupB DG2 AF2 NM1 DE2 MR2 AD4 DE3 CA4 AD5DE5 AF1DG1 DE1 AD1 AF3NM3 DG4 AF4 AF5 DG5 CA1 MR1 MR4 AD3 CA3 CS5 DE4 CA2 CA5MR5 NM4 CS3 CS2 NM2 AD2 DG3 CS4 CS1 MR3 NM5

  18. Experiment: result groupAgroupB AD1DE1 AD2DE2 AD3DE3 AD4DE4 AD5DE5 AF1DG1 AF2DG2 AF3DG3 AF4DG4 AF5 DG5 CA1 MR1 CA2MR2 NM1 MR3 NM2 MR4 NM3 MR5 CS1CA3 CS2 CA4 CS3 CA5 CS4 NM4 CS4 NM5 stop when µCD > 70

  19. Reassembly • Can the LZ trie be used to reassemble reads into genomes? • The LZ trie is a regular grammar of the set of reads • A long phrase is an extension of a shorter phrase • The start of one read is the end of another • The part of a long phrase that is the suffix after a shorter phrase (i.e. the difference between the short phrase and the long one) is the prefix of another phrase

  20. Along the way …. • While setting up the initial experiments, we started to ponder things that might go wrong • Different genomes might have a lot of common subsequences that will conflate the clustering result • SNPs and missing fragments might thwart compression • Compression model might take too long to converge on a useful model (paucity of data) • What is the underlying principle being leveraged?

  21. Information theory • A linear sequence of symbols intended for communication exhibits a balance between randomness and regularity • If a sequence is entirely random, it is noise • If a sequence is entirely predictable, it is redundant • Patterns provide means for recognition (interpretation) and irregularities provide for novelty (information) • Compression attempts to minimize redundancy

  22. Information theory • Human languages exhibit non-uniform distributions over letters, phonemes, words, etc

  23. Brown Corpus word frequencies

  24. DNA primary sequences • Four nucleotide symbols: A, C, G, T • Much of a genome codes nothing, and the rest is genes • A gene is copied (transcription) off the genome, and the copy is used to build a protein (translation) • Three consecutive nucleotides form a codon, which codes for a specific amino acid • A sequence of amino acids (residues) constitutes a protein • Proteins are where structure definitely exists

  25. DNA primary sequences • 43= 64 possible codons • 20 possible amino acids • Many amino acids have more than one codon

  26. Genomic regularities • Most genes start with ATG and end with a stop codon (TAG, TAA, and TGA most frequent) • TATA-box in regulatory region (for binding) • GC rich regions (for stability) But • Frequency of individual nucleotides or residues is not-so interesting (no syntax) • Tertiary structure of proteins is The Thing: the interactions of amino residues are paramount

  27. Genomic regularities • Do genomes have sequential syntactic structures?

  28. Codon frequencies in real DNA

  29. 4-gram frequencies in real DNA

  30. 5-gram frequencies in real DNA

  31. 6-gram frequencies in real DNA

  32. 6-gram probabilities in real DNA

  33. Problems from paucity of data • Takes time for an LZ compression trie to become saturated with characteristic phrases • Experimental data somewhat small, thus interesting sequences may not manifest quickly enough • Prime the trie by prepending some random DNA to the data prior to computing CD • How much? How about a million?

  34. bigram frequency in random DNA

  35. codon frequency in random DNA

  36. 10-gram frequency in random DNA

  37. 4-gram frequency in random DNA

  38. 5-gram frequency in random DNA

  39. 5-gram frequency in random DNA

  40. 7-gram frequency in random DNA

  41. 8-gram frequency in random DNA

  42. 9-gram frequency in random DNA

  43. Miller’s monkey • 19th century – Wilfried Pareto showed that power-law distributions abound in social, scientific, economic and geophysical data • 1949 – G.K. Zipf argued that power-law distributions are an interesting linguistic phenomenon • 1957 – G.A. Miller argued that the effect related to random placement of spaces, and that a monkey at a typewriter would produce ‘language’ with Zipfian distribution • 1968 – David Howes argued that Miller’s proof is flawed • 2004 – Michael Mitzenmacher demonstrated the connection between power-law distributions and log-normal distributions

  44. conclusion • Probably nothing!

More Related