From Transcriptome to Secretome: Bioinformatics Protocol and Tool Development for EST Sequence Analysis, Annotation and Identification of Secretory Proteins Dr. Jack Min
Extra-Readings • EST_review.pdf • EST_piper.pdf • OrfPredictor.pdf • TargetIdentifier.pdf
Project Objectives Identifying secretory enzymes from 15 fungal species by high throughput EST sequencing. Evaluation of the potential applications of identified enzymes in industrial processing and in environmental remediation.
EST ? EST: Expressed Sequence Tag. Single-pass (single-run) sequencing for a cDNA clone, usually from one end. The sequences generated in our fungal genomics project are ESTs.
Bioinformatics Objectives To control EST sequence quality: remove vector contaminants and low quality portions. To assemble ESTs into contigs/singletons to improve the sequence accuracy and predict the unique gene number. To functionally annotate ESTs and their assemblies: functionality, predicted protein sequences, functional domains, signal peptide.
Bioinformatics Objectives To identify potential target full-length ESTs and their assembled contigs. Targets: secretory enzymes. To map ESTs to the genome – to elucidate its gene structure. To make the data accessible to biologists.
Bioinformatics Subsystems & Databases GO Annotation (AmiGO) CloneTrack Internal EST Annotation Public EST Annotation Genome Annotation (EST to Genome & Gene Prediction) MicroArray (BASE) TargetFinding Secretome Analysis Re-sequenced Clone Annotation Passport
Sequence Quality Control and Assembling Pipeline 1 Sequence Chromatograms Phred Lucy BLASTN (E. coli/vector/plasmid DB) bvpFinder(remove contaminants) Phrap Singletons/contigs
Bioinformatics Protocols and Tools Pipeline 1: sequence quality control and assembling (Figure 1) Sequence chromatograms: a file generated by a sequencing machine. Phred: a base calling program to get a base and its score. Lucy: a program to trim low quality sequences and remove vector contaminants. Phrap: an assembler to assemble overlapped ESTs to generate a consensus sequence.
Sequence Data Analysis • Phred (Dr. Green):reads DNA sequencer trace data, calls bases, assigns quality values, and writes the base calls and quality values to output files. • q = -10*log10(p) q: quality value, p: estimated error probability for a base call.
Sequence Data Analysis • Lucy (Dr. Chou):a utility that prepares raw DNA sequence fragments for sequence assembly. • computes good quality regions; • removes vector contaminants; • chops off splice sites (vector and adaptor) from the sequence; • removes shorter insert sequences; • produces the cleaned sequence file for good quality regions, and a companion quality file
Sequence Data Analysis • bvpFinder: our in-house program to identify contaminants. • Phrap (Dr. Green): a program for assembling DNA sequences. • Input: sequence file and quality file from Lucy. • Output: sigletons, contigs, and ace file.
Data obtained after running Pipeline1 • Total sequence number; • Low quality sequence number; • Short sequence number; • Vector and other contaminant number; • Good sequence number; • Average good sequence length; • Singleton number; • Contig number.
Sequence Annotation Pipeline 2 BLASTX Sequence file (ESTs/Assemblies) TargetIdentifier/Annotator hit No hit Function known Function unknown OrfPredictor CDD, SignalP, TMHMM, Phobius, GO
Pipeline 2 sequence annotation BLASTX: NCBI-BLAST DNA against NCBI non-redundant protein database. TargetIdentifier/Annotator: an in-house tool for predicting whether a query cDNA sequence is full length or not; functional annotation based on BLASTX. Available as web server for public. OrfPredictor: an in-house program to predict the open reading frame of an anonymous cDNA sequence. Available as web server for public. SignalP: a program to predict if a protein contains a signal peptide. TMHMM: transmembrane domain prediction. Phobius: signal peptide and transmembrane domain predicton. GO: gene ontology mapping.
An Example of BLASTX Output Query= Asn_00765_MB111_076.ab1 CHROMAT_FILE: Asn_00765_MB111_076.ab1 PHD_FILE: Asn_00765_MB111_076.ab1.phd.1 CHEM: Unknown DYE: Unknown ET TIME: Mon Sep 23 16:14:14 2002 (483 letters) Database: FungalTargetDB 28,671 sequences; 10,818,323 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gi|17544913|ref|NP_518315.1| PROBABLE ZINC_DEPENDENT ALCOHOL DEH... 38 1E-20 >gi|17544913|ref|NP_518315.1| PROBABLE ZINC_DEPENDENT ALCOHOL DEHYDROGENASE OXIDOREDUCTASE PROTEIN [Ralstonia solanacearum] Length = 345 Score = 38.1 bits (87), Expect = 1E-20 Identities = 17/39 (43%), Positives = 25/39 (63%) Frame = +1 Query: 367MRAVDFRGPYKVADEERPVPRIQDAGDIVVSVTYTALWG 483 M+A+ + GP K++ E +P P +Q GD VV VT T + G Sbjct: 1MKALVYEGPGKISLENKPKPELQAPGDAVVRVTLTTICG 39 Jack Min, Concordia University, Montreal, QC
▲ 5’ ☺ 3’ ▲? 5’ 5’ 5’ ☺ 3’ 3’ 3’ d1 X ☺ Hit (Subject) d2 X If (d2-d1) < 10, Full-length, then check 3’-stop codon; Else If 10<=(d2-d1) <50, Short (#) Full-length, then check 3’-stop codon; Else Partial. 5’ ☺ 3’ Ambiguous Partial Algorithm for Identifying Full-length Genes *stop codon before start codon ☺start codon ▲stop codon after start codon * Full-length, Completely sequenced * Full-length, not completely sequenced | * Jack Min, Concordia University, Montreal, QC
In In - - frame start frame start codon codon no no yes yes 5 5 ’ ’ stop stop condon condon 5 5 ’ ’ stop stop condon condon yes yes no no yes yes no no (d1 + d3/3 (d1 + d3/3 Full Full - - length length Ambiguous Ambiguous (d2 (d2 – – d1) d1) – – – – d2) d2) ? ? 0 0 < 10 < 10 yes yes no no yes yes no no Full Full - - length length (d2 (d2 (d2 (d2 – – – – d1) d1) d1) d1) Possible full Possible full - - Partial Partial < 50 < 50 < 50 length length yes yes no no Short full Short full - - (d1 + d3/3 (d1 + d3/3 length length – – – – d2) d2) ? 0 0 yes yes no no Possible full Possible full - - Partial Partial length length
(A) 5’ AAA 3’ AAA 3’ (B) 5’ (C) 5’ AAA 3’ (D) 5’ AAA 3’ AAA 3’ (E) 5’ (F) 5’ AAA 3’ (G) 5’ AAA 3’ (H) 5’ AAA 3’ AAA 3’ (I) 5’ (J) 5’ AAA 3’ Stop codon at 5’ end Start initiation codon Internal start codon Stop codon at 3’ end Sequenced portion Unsequenced or truncated portion ORF-Predictor
ESTs Mapping to Genome ESTs/Assemblies, Genomic sequences SIM4 Pipeline 3 SIM2GFF mapped unmapped Generic Genome Browser
Bioinformatics Protocols and Tools Pipeline 3 mapping to genome SIM4: A program to map cDNAs to genomic DNA sequences. SIM2GFF: our in-house tool used to convert SIM4 output into GFF format and to integrate functional annotation. Generic Genome Browser: a web browser for displaying genome annotation.
Distribution of E-values in BLASTX of the assembly data set. The E-value from the top hit in BLASTX of the EST assemblies against the NCBI non-redundant protein database with a cutoff E-value of 1E-5.
Binding (46%, 1000) Catalytic activity (61%, 1331) (c) GO: (c) Molecular function (total 2177).
Types of Alternative Splicing Events Alternative donor site Alternative accepter site Exon skipping Mutually exclusive exons Retained intron