1 / 18

Functional and structural genomics using PEDANT

Functional and structural genomics using PEDANT. 陽明生技所 生物資訊學程 林千涵. Introduction. With increasing biological sequence data, it need a system with ability of storing and retreving tens of gigabytes of data, a mature database management system, and a good visualization tools.

HarrisCezar
Download Presentation

Functional and structural genomics using PEDANT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Functional and structural genomics using PEDANT 陽明生技所 生物資訊學程 林千涵

  2. Introduction • With increasing biological sequence data, it need a system with ability of storing and retreving tens of gigabytes of data, a mature database management system, and a good visualization tools • From case-oriented sequence analysis work to automated large-scale genome annotation

  3. Introduction-PEDANT • Difference of existing genome analysis programs • protein oriented vs. DNA oriented analysis • interactive work vs. commandline operation • bioinformatics method applied • user interface • conveniency feature, project management and data editors • fidelity of result produced • Benchmark may vary in terms of chosen of balance between sensitivity and selectivity of the analyses • PEDANT (Protein Extraction, Description, and ANalysis Tool) was available in mid-1997(use FASTA as similarity search) • a workhorse for general bioinformatics research • a common framework for a number of genome analysis projects • a complete database of automated genomes • a tool for routine analysis of large amounts of genomic contigs and ESTs

  4. System Architecture • Overview • database module: storing, modifying and accessing data • processing module: bioinformatics computations • user interface: web based communication

  5. System Architecture-Cont. • Data access • primary table: store raw data (ex DNA, protein sequences and program results ex BLAST output ) • secondary table: parsed program results • simplified schema • Operation in command line mode • applying bioinformatics methods to sequences • parsing data tables • querying the resulting databases • Web interface • No static HTML pages required • DNA and Protein viewers make direct access to the SQL tables • Implementation and system requirements • Perl 5, and C++ for graphical viewer • Performance • parallel capabilities

  6. Schema

  7. Bioinformatics Method • Overview of the PEDANT processing pipeline • identification of coding regions and various analysis genetics elements • homology search • detection of protein motifs, prediction of secondary structure and other protein features and sensitive fold recognition • automatically attributed to pre-defined functional categories • Prediction of genes and other genetic elements • Table 1 • choose one of 15 genetic codes • http://www.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c • Functional and structural categories • similarity search : PSI-BLAST(Position-Specific Iterated BLAST) • special datasets: MIPS, COG, PROSITE, PFAM and BLOCKS • significant matches of PIR: annotations, keywords, enzyme classification and superfamily information • with significant relationship of PDB, secondary structure information: STRIDE(upper case), PREDATOR(lower case) • low complexity region, membrance regions, coiled coils and signal peptides • comparison of SCOP with IMPALA functional structural

  8. Table 1

  9. Bioinformatics Method-Cont. • Yeast biological role categories • first system of biological role of categories : E.Coli • MIPS: advanced hierarchical functional catalogue (Yeast) • Multidimensionality-protein:gene is M:M • automated assignment to MIPS is first approximation, will be refined by manual annotation • Distribution of ORFs • Visualization • a integrated, hypertext-linked protein report with calculated parameters and sequences as reference for further manual annotation • Protein report page

  10. Distribution of ORFs

  11. Protein report page

  12. Bioinformatics Method-Cont.2 • Automatic versus manual annotation • Problem of error propagation • erroneous annotation by human error and spurious similarity hits • with filtering algorithms and domain structure ? • quality improvement of manual review of human experts ! • Manual annotation • Catalogue independent • Flexibility: first place in higher category and later step move to the finer categories • 528 categories: 20 main categories and 6 levels • confidence levels: “reject”, “low”, “medium”, “high” and default is “auto” • Data release management • new release data can be intelligently merged with existing data pool • transfer manual annotation between subsequent data release • “manual” field: “yes” or ”no” and default is “no” initially • example: a PFAM domain identified in new release ORF is “manual: no” and “conf: auto”

  13. Manual annotation transfer • Two genes fuse to one contig • Two contigs fuse to one • Gene boundary change • Appears new gene

  14. The PEDANT Genome Database • Annotation of publicly available completely sequenced and unfinished genomes • Genome annotated by MIPS • Completely sequenced and published genomic sequences • Unfinished and/or unpublished genomics sequences • gene prediction by ORPHEUS, allow large overlaps between ORFs • PEDANT as a structural genomics resource-0.3M proteins • class-based approach, cost-saving • (i)non-redundant protein sequence databases • (ii)PSI-BLAST search with SCOP against (I) abd saving resulting profiles • (iii)construct a SCOP profile library using IMPALA • (iv)IMPALA search with each genomic sequence against SCOP library • same procedure for nr PDB sequence database • performance of IMPALA • Cross-genome comparison • treat each genome as an individual contig : creat cross-genome datasets without any modification • 44 genomes

  15. Performance of IMPALA

  16. Applications • Arabidopsis thaliana chromosome IV • 3744 predicted protein coding genes • roughly 30% are known proteins or strongly similar to known proteins • multi-cellular organisms has higher all-alpha and smaller mixed alpha/beta structural domains ratio to unicellular species • Assembled human transcripts • human UniGene subjected PEDANT analysis, compare over 75000 contigs • this MySQL DB is close to 8GB • acceptable query time show the suitability of PEDANT for large-scale EST sequencing projects • Analysis of the GroEL substrates • GroEL: a common E.Coli chaperonin • structural motif common in 52 substrates relying on GroEL for folding in vivo : two or more alpha/beta domains involving buried beta-sheets with large hydrophobic surfaces--easy aggregation

  17. Classification of predicted genes • Classification by the degree of homology to functionally characterized proteins based on BLAST scores

  18. Summary and Outlook • PEDANT is a useful tool for genome annotation and bioinformatics research • It can automated and manual assignment of gene product to functional and structural categories • extensive hyperlinked protein report and advanced viewers • Outlook • better decision rules need to be employed • manually annotate predicted genetics eelments(ex. LTRs) • supporting Oracle RDBMS • automatic gene prediction pipeline for higher eukaryotes • interactive capabilities

More Related