- 163 Views
- Uploaded on
- Presentation posted in: General

Position weight matrix (PWM), Perceptron and their applications

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Position weight matrix (PWM), Perceptron and their applications

Xuhua Xia

http://dambe.bio.uottawa.ca

- Objective of sequence annotation: Given a nucleotide or amino acid sequence, find its biological function by bioinformatics tools.
- Approaches:
- Homology search: Annotation based on known, well annotated genes in databases, using BLAST and FASTA
- Gene prediction: annotation based on known gene structure using position weight matrix, perceptron, HMM.

- Also called position-specific scoring matrix (PSSM)
- Used in
- Characterizing sequence motifs
- Eukaryotic translation initiation consensus
- Splicing sites
- Branchpoint sites
- Shine-Dalgarno sequences

- Database searches (PHI-BLAST, PSI-BLAST and RPS-BLAST)

- Characterizing sequence motifs

Sequences flanking the initiation codon of 508 CDSs:

1234567890123

A4GALT ATACCATGTCCAA

ACO2 ACAAAATGGCGCC

ACR GGAGTATGGTTGA

ADM2 CCGCCATGGCCCG

.... .....

N: Number of sequences, i.e. 508

L: Sequence length, i.e., 13

i: A, C, G or T

j: Site index, i.e., 1, 2, ..., 13

Site-specific frequencies: Pij

Non-site-specific (global) frequencies: Pi

- No: All sites have the same nuc/aa distributions
- Yes: Different sites have different nuc/aa distributions
- Related terms:
- Observation: S
- Likelihood: probability of having S given a model (a hypothesis)
- Odds ratio: LYes/LNo
- Log-odds: log(LYes/LNo)

1234567890123

A4GALT ATACCATGTCCAA

ACO2 ACAAAATGGCGCC

ACR GGAGTATGGTTGA

ADM2 CCGCCATGGCCCG

.... .....

S = ACGGTACCACGTT

- Two major purposes of PWM
- To characterize the sequence pattern (the motif)
- to facilitate the computation of log-odds (or PWM score), e.g., computing the PWMS for ATACCATGTCCAA

RCCAUGG

12345678901234567890123456789012345678901234567890123456789012345678901234567890

GGACUGGCUGGGCGAGACUCUCCACCUGCUCCCUGGGACCAUCGCCCACCAUGGCUGUGGCCCAGCAGCUGCGGGCCGAG

-------------

-------------

-------------

Figure 5-1. Illustration of scanning the 5’-end of the NCF4 gene (30 bases upstream of the initiation codon ATG and 27 bases downstream of ATG. The highest peak, with PWMS = 12.3897, corresponds to the 13-mer with 5 bases flanking the ATG. PWMS computed with = 0.01.

Slide 8

Table 3: Site-specific frequencies and position weight matrix (PWM) for 275 5′ ss. The consensus sequence (UAAAG ∣GUAUGUU UAAUU) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics . The χ 2 test is performed for each site against the background frequencies (A = 0.3279, C = 0.1915, G = 0.2043, and U = 0.2763). The nucleotide sites are labeled with the five exon nucleotides as −5 to −1 and the 12 intron nucleotides as 1 to 12. The PWM is nearly identical when the introns in 5′ UTR were excluded.

Ma and Xia 2011

Table 4. Site-specific frequencies and position weight matrix (PWM) for 278 3’ ss. The consensus sequence (UUUUUUUUAYAG|GCUUC) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics. The 2 test is performed for each site against the expected background frequencies. The sites are labeled with first exon site as 1.

Ma and Xia 2011

Table 6. Position weight matrix scores (PWMS, as a proxy for splicing strength) is significantly smaller for splice sites from intron-containing genes (ICGs) whose transcripts failed to recruit U1 snRNPs (NRG for non-recruiting group) than for those from ICGs whose transcripts binds well to U1 snRNPs (RG for recruiting group). The pattern is consistent for both 5' ss and 3' ss, based on two-sample t-tests assuming equal variances. Mann-Whitney tests yield the same conclusion.

Highly expressed genes should have high splicing efficiency.

Predictions:(1) Highly transcribed genes should, on average, have introns with greater splicing efficiency(2) Lowly transcribed genes should have greater variance in splicing efficiency than highly transcribed genes.

Lowly expressed genes could have their splicing sites drifting to low efficiency

- Expected PWMS is 0 when there is no site-specific difference in nucleotide frequency distribution
- What does a strongly negative PWMS mean?
- 5’ ss:
- HAC1: -8.8291
- HFM1: -7.3825
- HOP2: -7.8898

- 3’ ss:
- HAC1: -4.4039
- REC102: -3.4464

- The perceptron is one of the simplest artificial neural networks invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt (Rosenblatt, 1958).
- Perceptron has been used in bioinformatics research since 1980s:
- The identification of translational initiation sites in E. coli (Stormo et al., 1982a).
- Characterizing the ATP/GTP-binding motif (Hirst and Sternberg, 1991).
- More recent publications use multi-layer perceptrons which is more complicated than what we cover here.

- Positive sequencesPOS1 ACGTPOS2 GCGC
- Negative sequencesNEG1 AGCTNEG2 GGCC
- Objective: Find a scoring matrix that can distinguish between the two groups (positive and negative) of sequences

POS1 ACGT

POS2 GCGC

NEG1 AGCT

NEG2 GGCC

Table 5-3. The weighting matrix (W) for the fictitious example with two sequences of length 4 in each group, initialized with values of 1. The first row designates sites 1-4.

For amino acid sequences, the matrix would be 20 by 4.

What is the score

for:

TAAA?

POS1 ACGT

POS2 GCGC

NEG1 AGCT

NEG2 GGCC

A WSi,j = 0 means either there is no data on that cell or the cell has no discriminant power

1234567890

P1 ACGUAUACGU

P2 ACGUCUACGU

P3 ACGUGUACGU

P4 ACGUUAACGU

P5 ACGUUCACGU

P6 ACGUUGACGU

N1 ACGUAAACGU

N1 ACGUACACGU

N1 ACGUAGACGU

N1 ACGUCAACGU

N1 ACGUCCACGU

N1 ACGUCGACGU

N1 ACGUGAACGU

N1 ACGUGCACGU

N1 ACGUGGACGU

N1 ACGUUUACGU

Large amount of data are needed to avoid the problem of overfitting

- Objective: given molecular sequence, find its biological function (preferably in terms of gene ontology).
- Cellular localization
- Biological processes the gene (its product) participates in
- The biological reaction

- Related terms:
- Motif: e.g., RccAUGG
- Fingerprint: a set of aligned sequences from which a position weight matrix or the like can be constructed to predict the motif effectively

- Gene/Motif prediction methods
- Position weight matrix
- Perceptrons
- Supervised learning
- Hidden Markov Models (HMMs)
- Neural networks (e.g., self-organizing map or SOM)

Population: women aged 40+

A woman has a chance of 0.01 of getting breast cancer.

80% of those with breast cancer will get positive mammographies.

10% of those without breast cancer will also get a positive mammography.

What is the probability that a woman with a positive mammography actually has breast cancer?

posterior

priors

0.8

0.075=0.008/(0.008+0.099)

0.008

0.008

0.01

0.2

0.002

0.099

0.099

0.99

0.925

0.891

0.1

0.9

Many more diagnostic tools are needed and their predictions are combined to reach a better joint prediction.

Population: All 300mers

Probability of the 300mer is a gene: 0.02

95% of those 300mers from a gene will get positive scores.

15% of those 300mers from non-genes or pseudogenes will also get a positive score.

What is the probability that a 300mer with a positive score is from a real gene?

posterior

priors

0.95

0.114=0.019/(0.019+0.147)

0.019

0.019

0.02

0.05

0.001

0.147

0.147

0.98

0.925

0.833

0.15

0.85