Functional and structural genomics using PEDANT

Functional and structural genomics using PEDANT 陽明生技所生物資訊學程林千涵

Introduction • With increasing biological sequence data, it need a system with ability of storing and retreving tens of gigabytes of data, a mature database management system, and a good visualization tools • From case-oriented sequence analysis work to automated large-scale genome annotation

Introduction-PEDANT • Difference of existing genome analysis programs • protein oriented vs. DNA oriented analysis • interactive work vs. commandline operation • bioinformatics method applied • user interface • conveniency feature, project management and data editors • fidelity of result produced • Benchmark may vary in terms of chosen of balance between sensitivity and selectivity of the analyses • PEDANT (Protein Extraction, Description, and ANalysis Tool) was available in mid-1997(use FASTA as similarity search) • a workhorse for general bioinformatics research • a common framework for a number of genome analysis projects • a complete database of automated genomes • a tool for routine analysis of large amounts of genomic contigs and ESTs

System Architecture • Overview • database module: storing, modifying and accessing data • processing module: bioinformatics computations • user interface: web based communication

System Architecture-Cont. • Data access • primary table: store raw data (ex DNA, protein sequences and program results ex BLAST output ) • secondary table: parsed program results • simplified schema • Operation in command line mode • applying bioinformatics methods to sequences • parsing data tables • querying the resulting databases • Web interface • No static HTML pages required • DNA and Protein viewers make direct access to the SQL tables • Implementation and system requirements • Perl 5, and C++ for graphical viewer • Performance • parallel capabilities

Schema

Bioinformatics Method • Overview of the PEDANT processing pipeline • identification of coding regions and various analysis genetics elements • homology search • detection of protein motifs, prediction of secondary structure and other protein features and sensitive fold recognition • automatically attributed to pre-defined functional categories • Prediction of genes and other genetic elements • Table 1 • choose one of 15 genetic codes • http://www.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c • Functional and structural categories • similarity search : PSI-BLAST(Position-Specific Iterated BLAST) • special datasets: MIPS, COG, PROSITE, PFAM and BLOCKS • significant matches of PIR: annotations, keywords, enzyme classification and superfamily information • with significant relationship of PDB, secondary structure information: STRIDE(upper case), PREDATOR(lower case) • low complexity region, membrance regions, coiled coils and signal peptides • comparison of SCOP with IMPALA functional structural

Table 1

Bioinformatics Method-Cont. • Yeast biological role categories • first system of biological role of categories : E.Coli • MIPS: advanced hierarchical functional catalogue (Yeast) • Multidimensionality-protein:gene is M:M • automated assignment to MIPS is first approximation, will be refined by manual annotation • Distribution of ORFs • Visualization • a integrated, hypertext-linked protein report with calculated parameters and sequences as reference for further manual annotation • Protein report page

Distribution of ORFs

Protein report page

Bioinformatics Method-Cont.2 • Automatic versus manual annotation • Problem of error propagation • erroneous annotation by human error and spurious similarity hits • with filtering algorithms and domain structure ? • quality improvement of manual review of human experts ! • Manual annotation • Catalogue independent • Flexibility: first place in higher category and later step move to the finer categories • 528 categories: 20 main categories and 6 levels • confidence levels: “reject”, “low”, “medium”, “high” and default is “auto” • Data release management • new release data can be intelligently merged with existing data pool • transfer manual annotation between subsequent data release • “manual” field: “yes” or ”no” and default is “no” initially • example: a PFAM domain identified in new release ORF is “manual: no” and “conf: auto”

Manual annotation transfer • Two genes fuse to one contig • Two contigs fuse to one • Gene boundary change • Appears new gene

The PEDANT Genome Database • Annotation of publicly available completely sequenced and unfinished genomes • Genome annotated by MIPS • Completely sequenced and published genomic sequences • Unfinished and/or unpublished genomics sequences • gene prediction by ORPHEUS, allow large overlaps between ORFs • PEDANT as a structural genomics resource-0.3M proteins • class-based approach, cost-saving • (i)non-redundant protein sequence databases • (ii)PSI-BLAST search with SCOP against (I) abd saving resulting profiles • (iii)construct a SCOP profile library using IMPALA • (iv)IMPALA search with each genomic sequence against SCOP library • same procedure for nr PDB sequence database • performance of IMPALA • Cross-genome comparison • treat each genome as an individual contig : creat cross-genome datasets without any modification • 44 genomes

Performance of IMPALA

Applications • Arabidopsis thaliana chromosome IV • 3744 predicted protein coding genes • roughly 30% are known proteins or strongly similar to known proteins • multi-cellular organisms has higher all-alpha and smaller mixed alpha/beta structural domains ratio to unicellular species • Assembled human transcripts • human UniGene subjected PEDANT analysis, compare over 75000 contigs • this MySQL DB is close to 8GB • acceptable query time show the suitability of PEDANT for large-scale EST sequencing projects • Analysis of the GroEL substrates • GroEL: a common E.Coli chaperonin • structural motif common in 52 substrates relying on GroEL for folding in vivo : two or more alpha/beta domains involving buried beta-sheets with large hydrophobic surfaces--easy aggregation

Classification of predicted genes • Classification by the degree of homology to functionally characterized proteins based on BLAST scores

Summary and Outlook • PEDANT is a useful tool for genome annotation and bioinformatics research • It can automated and manual assignment of gene product to functional and structural categories • extensive hyperlinked protein report and advanced viewers • Outlook • better decision rules need to be employed • manually annotate predicted genetics eelments(ex. LTRs) • supporting Oracle RDBMS • automatic gene prediction pipeline for higher eukaryotes • interactive capabilities

Functional and structural genomics using PEDANT