1 / 13

Improved techniques for the identification of pseudogenes

Improved techniques for the identification of pseudogenes. L. Coin and R. Durbin. Introduction. Pseudogene? Sequences Originally derived from functional genes. No longer translated into functional protein products. ~20,000 human pseudogenes. Two types Unprocessed pseudogene.

yael
Download Presentation

Improved techniques for the identification of pseudogenes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improved techniques for the identification of pseudogenes L. Coin and R. Durbin

  2. Introduction • Pseudogene? • Sequences • Originally derived from functional genes. • No longer translated into functional protein products. • ~20,000 human pseudogenes. • Two types • Unprocessed pseudogene. • From genome duplication. • Subsequently lost its function . • Rapid degeneration observed in prokaryotes. (Not eukaryotes) • Processed pseudogene • From reverse transcription (no intron). • Role • Regulatory role has been observed for human pseudogenes.

  3. Introduction • Mis-annotation problem • Current approaches • Presence of stop codon and frameshift mutation. • But, half of human pseudogenes have no detectable framshifts or internal stop codons. • Ratio of synonymous and non-synonymous substitution. • Not enough to accurately find pseudogenes.

  4. Algorithm • New approach • Look at pattern of substitution in conserved protein domains • Algorithm • Input • Alignment A • Unrooted tree T • Profile HMM D • Output • Score for each leaf-node which represents the belief that the node is a pseudogene.

  5. Algorithm • Alignment (ClustalW) • Xn. : row corresponding to leaf-node n. • X.i : i-th column. • A\Xn. : Alignment A excluding Xn. • Tree (Neighbor joining tree) • mj : j-th match column of profile HMM. • pn : parent node of n. • bn : brach from pn to n. • T\bn : Tree T excluding bn.

  6. Algorithm • Assumption • Null model : protein domain evolution on the tree • Test if • The final branch to the query node evolved by alternative drift models • Score for branch b is • Log-odds ratio of • Neutral (non-coding) DNA (Pnuc(b)) and Null Pfam domain model (Pdom(b)). • Protein coding (Pprot(b)) and Null Pfam domain model (Pdom(b)).

  7. Algorithm • PSILC score • Cnuc = {Pnuc(bn), Pdom(T\bn)} : neutral DNA on bn otherwise domain encoding. • Cprot = {Pprot(bn), Pdom(T\bn)} : protein coding on bn otherwise domain encoding. • Cdom = {Pnuc(bn), Pdom(T\bn)} : domain encoding on all T, including bn.

  8. Algorithm • Assumption • Xni in the row Xn is conditionally independent of other entries Xni' (i'≠i).(given other rows of the alignment A\Xn, tree T and constraint Ck) • (3) assumes that Xni is conditionally independent of all other columns in the alignment (given X.i\Xni, tree T and constraint Ck) • (4) uses the tree property that a leaf-node is conditionally independent of all other nodes in the tree given its parent.

  9. Algorithm • From (4) • Calculate the frequency distribution at the parent node given the constraints on all branches excluding the branch to the query node. • For each possible base at the parent node, calculate the transition probability to the child node assuming the appropriate evolutionary constraints on the branch to the child node. • First calculation above, • Construct a new tree • Re-rooting the tree at the parent node pn. • Remove the branch to the node n. • In (6) • First term is likelihood of the reduced alignment conditional on each possible base at the root of T\bn. • Second term is prior probability at the root given the evolutionary constraints.

  10. Algorithm • Prior distribution at root • Use equilibrium distribution of rate matrix. • Observed frequencies in the alignment for nucleotide and amino acid models. • Emission frequency distribution of match state for domain model. • Transition probability between different bases given different evolutionary constraints. • Pk(t) : matrix of transition probability at time t under the evolutionary constraint Pk. • Q : rate matrix • r : rate parameter • For amino acid models, : database estimates. • For nucleotide models, • : Parameterized model (e.g. HKY model, introduce parameter ). • : uniform distribution. • Free parameter f : trade-off between frequencies in the equilibrium distribution resulting from pressure to mutate from (f=1) and pressure to mutate toward (f=0) a particular base. • Calculate the values of r, f, which maximize the likelihood of alignment A given the tree.

  11. Algorithm • Directionality of the calculation • Score on an alignment of two transcripts x1, x2 is not symmetric. • If base X1i is more likely than X2i at a particular match state but equally likely under the protein model, score for x2. being a pseudogene is higher than score for x1. . • dN/dS does not have this property.

  12. Results • Test Data • 598 coding transcripts. • 97 pseudogenes. • Only apply when a Pfam domain can be aligned. • 68%/61% of coding trnascripts/pseudogenes. • PSILCprot/dom out-performs all other methods. • Expected PSILCnuc/dom out-performs PSILprot/dom. -> Not True! • Somehow penalize DNA evolution relative to protein evolution.

  13. Discussion • Applicable to large-scale analysis • Quality check on the gene annotation databases to identify potential pseudogenes. • A scan of various genomes for pseudogenes. • Analysis of functional DNA constraints on pseudogenes. • Future work • Infer loss of constraint along an entire clade of a tree. • Score mutation to predict the potential loss of functionality from a SNP. • Potential problem • Genes under positive selection will be misclassified as pseudogenes.

More Related