Position weight matrix (PWM), Perceptron and their applications. Xuhua Xia [email protected] http://dambe.bio.uottawa.ca. Sequence Annotation. Objective of sequence annotation: Given a nucleotide or amino acid sequence, find its biological function by bioinformatics tools. Approaches:
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Position weight matrix (PWM), Perceptron and their applications
Xuhua Xia
http://dambe.bio.uottawa.ca
Sequences flanking the initiation codon of 508 CDSs:
1234567890123
A4GALT ATACCATGTCCAA
ACO2 ACAAAATGGCGCC
ACR GGAGTATGGTTGA
ADM2 CCGCCATGGCCCG
.... .....
N: Number of sequences, i.e. 508
L: Sequence length, i.e., 13
i: A, C, G or T
j: Site index, i.e., 1, 2, ..., 13
Site-specific frequencies: Pij
Non-site-specific (global) frequencies: Pi
1234567890123
A4GALT ATACCATGTCCAA
ACO2 ACAAAATGGCGCC
ACR GGAGTATGGTTGA
ADM2 CCGCCATGGCCCG
.... .....
S = ACGGTACCACGTT
RCCAUGG
12345678901234567890123456789012345678901234567890123456789012345678901234567890
GGACUGGCUGGGCGAGACUCUCCACCUGCUCCCUGGGACCAUCGCCCACCAUGGCUGUGGCCCAGCAGCUGCGGGCCGAG
-------------
-------------
-------------
Figure 5-1. Illustration of scanning the 5’-end of the NCF4 gene (30 bases upstream of the initiation codon ATG and 27 bases downstream of ATG. The highest peak, with PWMS = 12.3897, corresponds to the 13-mer with 5 bases flanking the ATG. PWMS computed with = 0.01.
Slide 8
Table 3: Site-specific frequencies and position weight matrix (PWM) for 275 5′ ss. The consensus sequence (UAAAG ∣GUAUGUU UAAUU) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics . The χ 2 test is performed for each site against the background frequencies (A = 0.3279, C = 0.1915, G = 0.2043, and U = 0.2763). The nucleotide sites are labeled with the five exon nucleotides as −5 to −1 and the 12 intron nucleotides as 1 to 12. The PWM is nearly identical when the introns in 5′ UTR were excluded.
Ma and Xia 2011
Table 4. Site-specific frequencies and position weight matrix (PWM) for 278 3’ ss. The consensus sequence (UUUUUUUUAYAG|GCUUC) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics. The 2 test is performed for each site against the expected background frequencies. The sites are labeled with first exon site as 1.
Ma and Xia 2011
Table 6. Position weight matrix scores (PWMS, as a proxy for splicing strength) is significantly smaller for splice sites from intron-containing genes (ICGs) whose transcripts failed to recruit U1 snRNPs (NRG for non-recruiting group) than for those from ICGs whose transcripts binds well to U1 snRNPs (RG for recruiting group). The pattern is consistent for both 5' ss and 3' ss, based on two-sample t-tests assuming equal variances. Mann-Whitney tests yield the same conclusion.
Highly expressed genes should have high splicing efficiency.
Predictions:(1) Highly transcribed genes should, on average, have introns with greater splicing efficiency(2) Lowly transcribed genes should have greater variance in splicing efficiency than highly transcribed genes.
Lowly expressed genes could have their splicing sites drifting to low efficiency
POS1 ACGT
POS2 GCGC
NEG1 AGCT
NEG2 GGCC
Table 5-3. The weighting matrix (W) for the fictitious example with two sequences of length 4 in each group, initialized with values of 1. The first row designates sites 1-4.
For amino acid sequences, the matrix would be 20 by 4.
What is the score
for:
TAAA?
POS1 ACGT
POS2 GCGC
NEG1 AGCT
NEG2 GGCC
A WSi,j = 0 means either there is no data on that cell or the cell has no discriminant power
1234567890
P1 ACGUAUACGU
P2 ACGUCUACGU
P3 ACGUGUACGU
P4 ACGUUAACGU
P5 ACGUUCACGU
P6 ACGUUGACGU
N1 ACGUAAACGU
N1 ACGUACACGU
N1 ACGUAGACGU
N1 ACGUCAACGU
N1 ACGUCCACGU
N1 ACGUCGACGU
N1 ACGUGAACGU
N1 ACGUGCACGU
N1 ACGUGGACGU
N1 ACGUUUACGU
Large amount of data are needed to avoid the problem of overfitting
Population: women aged 40+
A woman has a chance of 0.01 of getting breast cancer.
80% of those with breast cancer will get positive mammographies.
10% of those without breast cancer will also get a positive mammography.
What is the probability that a woman with a positive mammography actually has breast cancer?
posterior
priors
0.8
0.075=0.008/(0.008+0.099)
0.008
0.008
0.01
0.2
0.002
0.099
0.099
0.99
0.925
0.891
0.1
0.9
Many more diagnostic tools are needed and their predictions are combined to reach a better joint prediction.
Population: All 300mers
Probability of the 300mer is a gene: 0.02
95% of those 300mers from a gene will get positive scores.
15% of those 300mers from non-genes or pseudogenes will also get a positive score.
What is the probability that a 300mer with a positive score is from a real gene?
posterior
priors
0.95
0.114=0.019/(0.019+0.147)
0.019
0.019
0.02
0.05
0.001
0.147
0.147
0.98
0.925
0.833
0.15
0.85