1 / 100

A generic and modular platform for automated sequence processing and annotation

2. A generic and modular platform for automated sequence processing and annotation. Arthur Gruber. Instituto de Ciências Biomédicas Universidade de São Paulo. AG-ICB-USP. 2. Sequence processing and annotation. Analyzing and processing sequencing reads is a tedious and error-prone job

luz
Download Presentation

A generic and modular platform for automated sequence processing and annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2 A generic and modular platform for automated sequence processing and annotation Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP

  2. 2 Sequence processing and annotation • Analyzing and processing sequencing reads is a tedious and error-prone job • Multistep process • All sequences are submitted to the same processing steps • Sequences processed by a given step are the input for the next one • Require different programs • Integrated system –PIPELINE AG-ICB-USP

  3. 2 Problem: how to build pipelines • Creating scripts for new pipelines involves good programming knowledge • Once created, most pipelines are difficult to change and customize • Many programs must be used • Phred, Cross_match, Phrap, CAP3, Blast, HMMer, InterproScan, TMHMM, etc. AG-ICB-USP

  4. 2 Problem: how to build pipelines • Each program needs a specific environment to work(e.g. directories with specific names) • Each program produces output in different ways and formats • Integrating programs is a hard task AG-ICB-USP

  5. 2 Solution: creating an environment to build pipelines • Abstract the environment of each program • Abstract output format • Easily specify “coupling” of different programs • Document how the pipe was built • Easy to inspect and monitor • Easy to store (e.g. in a database) Requirements: AG-ICB-USP

  6. 2 EGene Aims and characteristics: • To develop a simple to use and configure platform for pipeline construction • Big sequencing centers already have sophisticated pipelines, but many are not published and/or publicly available • They are too complex for the small-/mid-sized labs • Platform should be generic • Useful for any sequencing project • Platform should provide components for the most common tasks • New components should be easy to develop AG-ICB-USP

  7. 2 EGene: a generic platform for pipeline construction • Written in Perl language • Modular • Easy to build specific components to interact with third-party programs • EGene components can be integrated to fulfill user-specific needs • CoEd – a graphical configuration editor written in Java –user-friendly interface Characteristics: AG-ICB-USP

  8. AG-ICB-USP AG-ICB-USP

  9. AG-ICB-USP AG-ICB-USP

  10. AG-ICB-USP AG-ICB-USP

  11. AG-ICB-USP AG-ICB-USP

  12. AG-ICB-USP AG-ICB-USP

  13. AG-ICB-USP AG-ICB-USP

  14. AG-ICB-USP AG-ICB-USP

  15. 2 Sequence processing pipelineThe Eimeria ORESTES project Mitochondrial sequence filtering Cross_Match Input chromatogram files Plastid sequence filtering Base calling and quality assignment Cross_Match Phred Ribosomal sequence filtering Cross_Match Primer screening and masking Cross_Match Repetitive sequence filtering Cross_Match Vector masking and screening Cross_Match Bacterial sequence filtering Blast Quality filtering Filter-quality.pl Chicken sequence filtering Blast End trimming Trim-ends.pl Human sequence filtering Blast Size filtering Filter-size Assembly CAP3 AG-ICB-USP

  16. 2 Sequence processing and grahical report AG-ICB-USP

  17. 2 How to get EGene Internet site: http://www.coccidia.icb.usp.br/egene - EGene is distributed under the GNU General Public License - EGene is Open Source AG-ICB-USP

  18. 2 How to get EGene Internet site: http://www.coccidia.icb.usp.br/egene - EGene is distributed under the GNU General Public License - EGene is Open Source AG-ICB-USP

  19. 2 Recent developments • Incorporation of forks • Enhancement of the data model – incorporation of annotation evidences • Development of annotation components • Evidence-based annotation AG-ICB-USP

  20. 2 Genome annotation • Annotation is the process of adding information to DNA sequence. • The information usually has a DNA coordinate. • Features could be repeats, genes, promoters, protein domains, etc. • Features can be cross-referenced to other databases (e.g. Pfam/Pubmed) AG-ICB-USP

  21. 2 Genome annotation • Annotation is the process of adding information to DNA sequence. • The information usually has a DNA coordinate. • Features could be repeats, genes, promoters, protein domains, etc. • Features can be cross-referenced to other databases (e.g. Pfam/Pubmed) AG-ICB-USP

  22. 2 Annotation file A typical annotation file contains: A header with: • Information about the sequence • Organism • Authors • References • Comments A feature table containing • Sequence features and co-ordinates AG-ICB-USP

  23. 2 Feature table format • Flatfile format • Format definition available at http://www.ncbi.nlm.nih.gov/projects/collab/FT/ • Covers DDBJ/EMBL/GenBank • Defines all accepted annotation terms and hierarchy AG-ICB-USP

  24. 2 Incorporating annotation • EGene’s data model was enriched to incorporate annotation information into the representation of the sequences • All collected data is converted into a proprietary XML format • The XML can be easily converted into different annotation formats: Feature Table, GFF3, etc. • We provide some converters and new ones can be easily implemented AG-ICB-USP

  25. 2 Annotation components • A comprehensive set of annotation components has been implemented: • ORF finding and translation • Tandem repeats finding: TRF, String, mREPS • tRNA finding: tRNAscan-SE • Gene Prediction: Genscan, GlimmerM, GlimmerHMM, Twinscan, Phat, ESTscan, SNAP • Motif finding: HMMer x Pfam, RPS-BLAST, InterproScan • Similarity search: BLAST • EST mapping: Sim4, Exonerate AG-ICB-USP

  26. 2 Annotation components • A comprehensive set of annotation components has been implemented: • Transmembrane domain finding: TMHMM, Phobius • Signal peptide: SignalP, Phobius • GPI anchor: DGPI • GO mapping and quantification • Orthology assignment and quantification: COG/KOG • Pathway mapping: KEGG • Annotation visualization with GBrowse: web inspection • Annotation report generation: feature table, GFF3 • Web site generation: HTML/PHP AG-ICB-USP

  27. 2 EGene generates annotation files that can be inspected using regular editors (Artemis, Apollo, etc.) AG-ICB-USP

  28. 2 EGene’s annotation • EGene can generate annotation in different formats: • XML– local use, easy to feed a database management system • Feature table • Convenient for manual curation on Artemis • Ready for submission to public databases • GFF3 • Current annotation interchange format • Manual curation/visualization on Artemis, Apollo and GMOD Genome Browser • Compliant with Sequence Ontology terms AG-ICB-USP

  29. 2 EGene performs GO term mapping and constructs web pages for inspection AG-ICB-USP

More Related