1 / 63

Looking at Whole Genomes: Frequency of Occurrence of Oligonucleotides

Looking at Whole Genomes: Frequency of Occurrence of Oligonucleotides. Lecture I Winter School on Modern Biophysics National Taiwan University December 16-18, 2002 HC Lee Dept Physics & Dept Life Science National Central University. The Book of Life. Growth of sequenced genome data

sestrada
Download Presentation

Looking at Whole Genomes: Frequency of Occurrence of Oligonucleotides

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Looking at Whole Genomes: Frequency of Occurrence of Oligonucleotides Lecture I Winter School on Modern Biophysics National Taiwan University December 16-18, 2002 HC Lee Dept Physics & Dept Life Science National Central University

  2. The Book of Life

  3. Growth of sequenced genome data exploded after 1995 (GenBank: as of 2002 January 13) Genome data exploded after 1995 Millions of sequences CBL@NCU

  4. The Human Genome Human has 24 types of Chromosomes 3 billion bps Human has 23 chromosomes Human genome first draft completed Feb 16, 2001

  5. First working draft of Human Genome Sequencing of first working draft ofHuman Genome published in 2001 February Nature, 409, February 15, 860-921 (2001) Science, 291, February 16, 1304-1351 (2001)

  6. Genome - book of four letter Genome - Book of Life written in four letters DNA - a polymer of nucleotides Nucleotide – backbone + bases Four types of bases: A, C, G, T (the four letters) Gene – coded sequence of bases Genome – set of all genes; set of all chromosomes packaged pair of DNA strands with double helix structure CBL@NCU

  7. Central Dogma • Genome (DNA): genetic information (genes) • Ribosomes: Transcribe (轉錄) & translate (翻譯) genes (nucleotide sequence) to proteins (amino acids sequence) • Proteins: expression and function

  8. New way to do Life Science Research • in vivo 在活體裡 • in vitro 在試管中 • in silico 在電腦中 CBL@NCU

  9. Frequency of occurrence of oligonucleotides A simple first look at whole genomes

  10. Oligo (or k-mer) Frequency • Oligonucleotide (oligo): short sequence of several nucleotides (k~2-30) long; a k-mer • There are 4k different kinds of k-mers • Frequencies of occurrence of all k-mer in a sequence can be obtained by reading with a “sliding window” • Complete set of frequencies of k-mers characterizes a DNA sequence • Very fast to compute; scales with seq length • For multiple seqs, scales w/ no. of seqs • Related to alignment

  11. Counting k-mers with Sliding Window N(GTTACCC) = N(GTTACCC) + 1 • Sum over all N(oligo) = Sequence (circular) length • Sequence is represented by the set {N(oligo) | all oligos} • Or: for each k, sequence represented by 4k-component vector

  12. Frequency distribution of 6-mers Number of oligos Frequency of oligo More about this in lecture II

  13. ”Portraits” of microbial genomes

  14. Making a portrait • Divide a rectangular into 2k by 2k cells, each cell corresponding to one of the 4k different kinds of k-mers • Write in each cell the frequency of the k-mer • Color-code ranges of frequencies

  15. Mycoplasma genitalium Length 0.58 Mb G+C content 32% Bacteria, Firmicutes Pathogen from the human urogenital tract

  16. Mycoplasma pneumoniae Length 0.816 Mb G+C content 40% Bacteria Firmicutes Parasite of the human respiratory tract.

  17. Borrelia burgdorferi Length 0.911 Mb G+C content 30% Bacteria Spirochaetales Causitive agent of Lyme disease (neur- ologic complications, arthritis)

  18. Rhizobium sp. NGR234 Length 0.53 Mb G+C content 59% Bacteria Proteobacteria Representative bacterium that fixes nitrogen in symbiosis with many plants.

  19. Aquifex aeolicus Length 1.55 Mb G+C content 40% Bacteria Aquificales Earliest diverging, and most thermophilic bacteria known. Can grow on hydrogen, oxygen, carbon dioxide. Parasite of the human respiratory tract.

  20. Haemophilus influenzae Length 1.83 Mb G+C content 38% Bacteria Proteobacteria Blood-loving causative agent of influenza.

  21. Methanococcus jannaschii Length 1.66 Mb G+C content 31% Archaea Euryarchaeota Anaerobic, Methane-producing hyperthermophile; grows at > 200 atm and an optimum temp. of 85 degrees C. Note: fractals

  22. Helicbacter pylori Length 1.67 Mb G+C content 40% Bacteria Proteobacteria Acid-loving causative agent of chronic gastric Diseases Note: fractals

  23. Archaeoglobus fulgidus Length 2.18 Mb G+C content 49% Archaea, Euryarchaeota Hyperthermophilic sulphur-reducer; causes havoc by souring oil wells.

  24. Synechococcus sp. PCC6803 Length 3.587Mb G+C content 48% Bacteria Cyanobacteria Unicellular cyanobacterium widely used for study of oxygen-producing photosynthesis mechanism. Exceptionally wide distribution of frequ- ency occurrence of short oligos.

  25. Phylogeny based on alignment of homologous sequences

  26. Molecular Evolution & Phylogeny • Organism represented by Genome • A Universal Ancestor (is believed to) exists • Random mutation of DNA sequence leads to divergence and new species • Pressure from fitness causes conservation of sequence

  27. Phylogeny & Sequence similarity • Because fitness exerts pressure on functional sequence to conserve, if rate of change induced by mutation is assumed constant, then the dissimilarity between two homologous sequences is indicative of time elapsed when they diverged. Hence can use sequence similarity to study phylogeny. • E.g. phylogeny based on 16S/18S rRNA

  28. Sequence Alignment • Most important method for studying sequence homology • Example – alignment of two sequences a and b Seq a: TACCATCGCAAACAT GG (length 17b) x||||x|x|||x-|x--x| Seq b: AACCACCACAAG ACCTCG (length 18b) Consensus length 19, 10 matches(|), 6 mismatches (x), 1 single gap (-, SG), 1 extended gap (--, EG) Score: matches – (SG+EG)*P – (EG-1)*PE = (P: penalty for SG; PE: penalty for EG) Score = 10 –2 –1 = 7 Similarity = matches/total length =10/19=55%

  29. Sequence Alignment (II) • Result intuitive, evolution based • Widely used in sequence analysis – homology search, phylogeny, etc • Parameter dependent – many alignments possible (Needleman-Wunsch algorithm) • DNA & proteins sequences • Good software. E.g., BLAST, GCG,.. • Fast for length < 2000 • NP-complete problem for long and remotely related sequences, and for multiple alignments

  30. The Ribosome • E.g. phylogeny based on 16S/18S rRNA • 16S (Prokaryotes): 1550 bases; 18S Eukaryotes): 1800 bases • Ribosomal enzyme • Transcription & translation • Among the most ancient and best conserved biological machines • In genome of EVERY organism • Two subunits: 30S + 50S • 30S (small subunit): 16S/18S + 20 proteins • Translates mRNA

  31. “Cartoon” of 16S rRNA Head Body Platform

  32. Platform Head E coli 16S rRNA secondary structure Body 3‘m

  33. Bacteria 16S rRNA alignment tree 35 organisms: 19 bacteria 9 archaea 7 eukarya E. coli Bacillus Aquifex Herpetosiphon Thermotoga Mouse Homo sapiens Methanococcus Archaea Eukarya Archaeoglobus C. elegans

  34. Phylogeny based on frequency of k-mers

  35. Sequence distance based on Oligo Frequency

  36. 16S/18S rRNA k-mer tree as function of k Bacteria Archaea Eukarya

  37. Oligo Frequency and sequence alignment distances correlated • If sequence evolve ONLY by uncorrelated single mutations, then: S = X n(b/c chances of any base not changing is X) • X - alignment similarity • S - oligo frequency similarity • n - oligo length. • In practice, more than single mutation. E.g., extended gaps. Then S = X**(kn) k < 1. Empirically: k = 2/3.

  38. Simulated Random Mutations log S v.s. log X S = X9 Oligo length = 9 oligo align

  39. Extended Gaps I

  40. Extended Gaps II

  41. Simulated Random Mutations with Extended gaps h=4 ng =3 kth=0.625 Oligo length = 9 S = X6.3 log S v.s. log X oligo align

  42. Tree of Life (35 organisms) log S v.s. log X h=5 ng=2.5 kth=0.8 kex=0.66 Oligo length = 9 oligo align

  43. Oligo frequency Eukarya Archaea Aquifex Thermotoga Bacteria

  44. Alignment Aquifex Thermotoga

  45. Comparison of 16S/18S rRNATrees of Life (35 organisms)Similar topology Differences in detail Bacteria Aquifex Thermotoga Eukarya Archaea Black: oligo frequency Red: sequence alignment

  46. Oligo method is Robust • Three tests (Bacteria and Archaea) • Random truncation of 16S rRNA to 800 to 1200 bases • Random inversion of 16S rRNA (splice, reverse order and reconnect) • Random concatenation of 23S, 16S and 5S rRNA sequences

  47. k o m n d e r g h b a s i p j q z f y H C A D F B E G 0.1 L L 16s rRNA Truncated Alignment r G B q D F j f p z H Aquifex H Thermatoga y C Thermatoga E Sulfolobus Aquifex i H A b A a e A Aeropyrum k h m s g n d 0.1 o Oligo

  48. Aquifex Thermatoga Alignment 16s rRNA Truncated A Aquifex H A H Thermatoga Oligo

More Related