Position weight matrix (PWM), Perceptron and their applications. Xuhua Xia [email protected] http://dambe.bio.uottawa.ca. Sequence Annotation. Objective of sequence annotation: Given a nucleotide or amino acid sequence, find its biological function by bioinformatics tools. Approaches:
Position weight matrix (PWM), Perceptron and their applications
Sequences flanking the initiation codon of 508 CDSs:
N: Number of sequences, i.e. 508
L: Sequence length, i.e., 13
i: A, C, G or T
j: Site index, i.e., 1, 2, ..., 13
Site-specific frequencies: Pij
Non-site-specific (global) frequencies: Pi
S = ACGGTACCACGTT
Figure 5-1. Illustration of scanning the 5’-end of the NCF4 gene (30 bases upstream of the initiation codon ATG and 27 bases downstream of ATG. The highest peak, with PWMS = 12.3897, corresponds to the 13-mer with 5 bases flanking the ATG. PWMS computed with = 0.01.
Table 3: Site-specific frequencies and position weight matrix (PWM) for 275 5′ ss. The consensus sequence (UAAAG ∣GUAUGUU UAAUU) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics . The χ 2 test is performed for each site against the background frequencies (A = 0.3279, C = 0.1915, G = 0.2043, and U = 0.2763). The nucleotide sites are labeled with the five exon nucleotides as −5 to −1 and the 12 intron nucleotides as 1 to 12. The PWM is nearly identical when the introns in 5′ UTR were excluded.
Ma and Xia 2011
Table 4. Site-specific frequencies and position weight matrix (PWM) for 278 3’ ss. The consensus sequence (UUUUUUUUAYAG|GCUUC) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics. The 2 test is performed for each site against the expected background frequencies. The sites are labeled with first exon site as 1.
Ma and Xia 2011
Table 6. Position weight matrix scores (PWMS, as a proxy for splicing strength) is significantly smaller for splice sites from intron-containing genes (ICGs) whose transcripts failed to recruit U1 snRNPs (NRG for non-recruiting group) than for those from ICGs whose transcripts binds well to U1 snRNPs (RG for recruiting group). The pattern is consistent for both 5' ss and 3' ss, based on two-sample t-tests assuming equal variances. Mann-Whitney tests yield the same conclusion.
Highly expressed genes should have high splicing efficiency.
Predictions:(1) Highly transcribed genes should, on average, have introns with greater splicing efficiency(2) Lowly transcribed genes should have greater variance in splicing efficiency than highly transcribed genes.
Lowly expressed genes could have their splicing sites drifting to low efficiency
Table 5-3. The weighting matrix (W) for the fictitious example with two sequences of length 4 in each group, initialized with values of 1. The first row designates sites 1-4.
For amino acid sequences, the matrix would be 20 by 4.
What is the score
A WSi,j = 0 means either there is no data on that cell or the cell has no discriminant power
Large amount of data are needed to avoid the problem of overfitting
Population: women aged 40+
A woman has a chance of 0.01 of getting breast cancer.
80% of those with breast cancer will get positive mammographies.
10% of those without breast cancer will also get a positive mammography.
What is the probability that a woman with a positive mammography actually has breast cancer?
Many more diagnostic tools are needed and their predictions are combined to reach a better joint prediction.
Population: All 300mers
Probability of the 300mer is a gene: 0.02
95% of those 300mers from a gene will get positive scores.
15% of those 300mers from non-genes or pseudogenes will also get a positive score.
What is the probability that a 300mer with a positive score is from a real gene?