620 likes | 855 Views
MICROBIAL GENOME ANNOTATION. Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries. NEB Educational Support. http://www.neb.com/nebecomm/course_support.asp?. Why study Computational Biology and Bioinformatics?.
E N D
MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries
NEB Educational Support http://www.neb.com/nebecomm/course_support.asp?
Why study Computational Biologyand Bioinformatics? • DNA sequencing output is growing faster than Moore’s law! • 1 Illumina sequencing machine = 0.5 Tbp/week • There are hundreds of these and thousands of other sequencing machines around the world. • New sequencing technology will conceivably allow sequencing a human genome for less than $1K in less than 1 day!
Why study Medical Bioinformatics? • In the near future, most cancer diagnostics will involved DNA or RNA sequencing! • In the near future, every baby born in the developed world will have their genome sequenced. Protecting privacy and your doctors ability to use that information are the only real impediments! • Hospitals are using DNA sequencing to track antibiotic resistant bacterial infections.
DOE Undergraduate Research in Microbial Genome Analysis and Functional Genomics http://www.jgi.doe.gov/education
Why Study Microbial Genomes? • Large biological mass (50% of total) • photosynthetic (Prochlorococcus) • fix N2 gas to NH3 (Rhodopseudomonas) • NH3 to NO2 (Nitrosomonas) • bioremediation (Shewanella, Burkholderia) • pathogens, BW (Yersinia pestis - plague) • food production (Lactobacillus) • CH4 production (Methanosarcina) • H2 production (Rhodopseudomonas)
Example of Current Microbial Genome Projects • UC Davis – FDA funded 100K bacterial genomes project associated with food. • 5 years = 20K per year / 200 days/year = 100 genomes/day!
Web Resources and Contact Information • http://genome.ornl.gov/microbial/ • http://www.jgi.doe.gov/ • http://genome.jgi-psf.org/ • http://www.jcvi.org/ • http://www.ncbi.nlm.nih.gov/ • http://www.sanger.ac.uk/ • http://www.ebi.ac.uk/ • ftp://ftp.lsd.ornl.gov/pub/JGI • artemis ready files for each scaffold = (feature table plus fasta sequence file) • Contact: • landml@ornl.gov; hauserlj@ornl.gov
Sequenced Microbial Genomes • ARCHAEAL GENOMES • 159 FINISHED; 218 IN PROGRESS • BACTERIAL GENOMES • 3363 FINISHED; 11831 IN PROGRESS • ENVIRONMENTAL COMMUNITIES • > 50,000 samples (see MGRast) • as of Sept 6, 2012 • http://www.expasy.ch/alinks.html • http://www.genomesonline.org • http://metagenomics.anl.gov/
Published Genomes • Nitrosomonas europaea - J.Bac. 185(9):2759-2773 (2003) • Prochlorococcus MED4 & MIT9313 - Nature 424:1042-1047 (2003) • Synechococcus WH8102 - Nature 424:1037-1042 (2003) • Rhodopseudomonas palustris - Nat. Biotech. 22(1):55-61 (2004) • Yersinia pseudotuberculosis - PNAS 101(22):13826-31 (2004) • Nitrobacter winogradskyi – Appl. Envir. Micro. 72(3):2050-63 (2006) • Nitrosococcus oceani - Appl. Envir. Micro. 72(9):6299-315 (2006) • Burkholderia xenovorans – PNAS 103(42):15280-7 (2006) • Thiomicrospira crunogena – PLoS Biology 4(12):e383 (2006) • Nitrosomonas eutropha C91 – Env. Micro. 9(12):2993-3007 (2007) • Sulfuromonas denitrificans – Appl. Envir. Micro. 74(4):1145-56 (2008) • Nitrosospira multiformis -- Appl. Envir. Micro. 74(11):3559-72 (2008) • Nitrobacter hamburgensis -- Appl. Envir. Micro. 74(9):2852-63 (2008) • Saccharophagus degradans – PLoS Genetics 4(5):e1000087 (2008) • R. palustris – 5 strain comparison – PNAS 105(47):18543-8 (2008) • L. rubarum and L. ferrodiazotrophum – Appl. Envir. Micro. (in press)
Basic Annotation Impacts • Design of oligonucleotide arrays • Design & prioritize protein expression constructs • Design & prioritize gene knockouts • Assessment of overall metabolic capacity • Database for proteomics • Allows visualization of whole genome
Additional Analysis Impacts • Revised functional assignments based on domain fusions, functional clustering, phylogenetic profile • Regulatory motif discovery • Operon and regulon discovery • Regulatory and protein association network discovery
Scaffolds or contigs Microbial Annotation Genome Pipeline Simple repeats Prodigal Complex Repeats Model correction tRNAs Final Gene List rRNA, Misc_RNAs InterPro PRIAM Blast COGs TMHMM SignalP GC Content, GC skew Function call Web Pages Feature table
Prodigal (Prokaryotic Dynamic Programming GenefindingAlgorithm) • Unsupervised: Automatically learns the statistical properties of the genome. • Indifferent to GC Content: Prodigal performs well irrespective of the GC content of the organism. • Draft: Prodigal can train on multiple sequences then analyze individual draft sequences. • Open Source: Prodigal is freely available under the GPL. • Reference: Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11(1):119. (Highly Accessed)
G+C Frame Plot Training • Takes all ORFs above a specified length in the genome. • Examines the G+C bias in each frame position of these ORFs. • Does a dynamic programming algorithm using G+C frame bias as its coding scoring function to predict genes. • Takes those predicted genes and gathers dicodon usage statistics.
Gene Prediction • Dicodonusage coding score • Length factor added to coding score (GC-content-dependent) • Coding/noncoding thresholds sharpened (starts downstream of starts with higher coding get penalized by the difference). • Dynamic programming to put genes together. • Bonuses for operon distances, larger bonus for -1/-4 overlaps. • Same strand overlap allowed (up to 60 bases). • Opposite strand -->3'r 5'f<- allowed (up to 250 bases)
Start Site ScoringShine Dalgarno Motif • Examines initially predicted genes and gathers statistics on the starts (RBS motifs, ATG vs GTG vs TTG frequency) • Moves starts based on these discoveries. • Gathers statistics on the new set of starts and repeats this process until convergence (5-10 iterations). • RBS motifs based on AGGAGG sequence, 3-6 base motifs, with one mismatch allowed in 5 base or longer motifs (e.g. GGTGG, or AGCAG). • Does a final dynamic programming with the start scoring function.
Start Site ScoringOther Motifs • If Shine-Dalgarno scoring is strong, use it – this accounts for ~85% of genomes. • If Shine-Dalgarno scoring is weak, look for other motifs • If a strong scoring motif is found, use it (example GGTG in A. pernix) • If no strong scoring motif is found, use highest score of all found motifs (example – Crenarchaea, Tc and Tl start sites are the same, but internal operon genes use weak Shine-Dalgarno motifs)
Frequency distance distributions Salgado et al. PNAS (2000) 97:6652 Fig. 2
Frequency distance distributions Salgado et al. PNAS (2000) 97:6652 Fig. 3b
Branched Chain Amino Acid Transporter family – Rhodopseudomonas palustris
Transporter Gene Loss in Yersina Pestis • 36 Genes involved in transport from YPSE are nonfunctional in YPES • 13 lost due to frameshifts • 11 lost due to deletions • 6 lost due to IS element insertions • 4 (2 pair) lost due to recombination causing deletions and frameshifts • 2 lost due to premature stop codons