1 / 21

Metagenomic dataset preprocessing – data reduction

Metagenomic dataset preprocessing – data reduction. Konstantinos Mavrommatis KMavrommatis@lbl.gov. Complexity. Acid Mine Drainage. Sargasso Sea. Termite Hindgut. Cow rumen. Soil.

aitana
Download Presentation

Metagenomic dataset preprocessing – data reduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metagenomic dataset preprocessing – data reduction Konstantinos Mavrommatis KMavrommatis@lbl.gov

  2. Complexity Acid Mine Drainage Sargasso Sea Termite Hindgut Cow rumen Soil The total metagenome is the result of a cell community. Cells belong to different organisms ranging from strains to domains. Who is there? (phylogenetic content) What does it do? (Functional content) Why is it there? (Comparative study) Species complexity 1 10 100 1000 10000

  3. Dataset processing Sample preparation High throughput sequencing Assemble reads Analysis Feature prediction ? QC Functional annotation and comparative analysis Binning

  4. Dataset processing (v 3.0a) Submitted file Assembled contigs Submitted file 454 reads Submitted file Illumina reads Fasta/fastq File QC. Check character set and contig name. Remove trailing Ns. Trimming. Q=20 Trimming. Q=13 Fasta Low complexity. Size of 80 bp Dereplication. Prefix = 5, identity 95%, Clustering. 100% identity fasta File for gene calling

  5. File for gene calling fasta CRISPR detection. crt / pilercr Conflict resolution RNA detection. tRNAscan / hmmer / Blast / (isolates:Rfam) Concatenation of all results. Creation of final output file File for IMG IMG Dataset processingFeature prediction pipeline (v 3.0a) CDS detection. Isolates: prodigal Metagenomes: varies Unassembled reads + assembled contigs

  6. Dataset processingQuality trimming Courtesy Alex Copeland http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ Remove sequences from the ends of the reads. lucy for 454 datasets. Illumina (longest high quality string)

  7. Dataset processingLow complexity filter tatatatatatatatatat aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa • using dust (NCBI) • Remove sequences with less than 80 informative bases

  8. Dataset processingDereplication

  9. Dataset processingSequence dereplication atcccat atc-cat atcccat atcccat atcccat gctacat gctncat gctacat Not dereplicated gctacat • using uclust • 95% identity (global alignment). • Identical prefix (5nt)

  10. Dataset processingEvaluation of processing tools • Unassembled sequences due to their small size, quality problems, and large number need to be processed with efficient pipelines. • Simulated datasets: • Using sequences extracted from finished genomes (Perfect sequences) • Using reads that have been used to assemble finished genomes (Real errors). • Evaluation and development of new tools/wrappers.

  11. Dataset processingFeature prediction Available methods: Ab initio: Metagene, MetaGeneMark, FragGeneScan, Prodigal. Similarity based: Blastx, USEARCH. isolate MISSED CORRECT WRONG NEW metagenome

  12. Trimming

  13. 454 Ti(no errors)

  14. 454Ti(with errors)

  15. Illumina 115 bp

  16. Illumina 74 bp

  17. Contigs frameshift Wrong prediction

  18. Why annotate unassembled reads? Additional information about functions and phylogeny Assembled only More accurate statistics based on unassembled + assembled Unassembled + assembled + real metagenome

  19. Processing time(metagenomes)

  20. Processing time(isolates)

  21. Thank you for your attention

More Related