Welcome to
Download
1 / 26

Welcome to - PowerPoint PPT Presentation


  • 303 Views
  • Updated On :

Welcome to Introduction to Bioinformatics Monday, 11 October Characteristics of PSSMs How to make a PSSM Uncertainty and information How to score a sequence Problem sets (Blast, Modeling) Scenario 1 Prediction of regulatory site heterocysts sucrose N 2 fixation in cyanobacteria N 2

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Welcome to' - JasminFlorian


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Welcome toIntroduction to BioinformaticsMonday, 11 October

  • Characteristics of PSSMs

    • How to make a PSSM

    • Uncertainty and information

    • How to score a sequence

Problem sets (Blast, Modeling)


Slide2 l.jpg

Scenario 1

Prediction of regulatory site


Slide3 l.jpg

heterocysts

sucrose

N2 fixation in cyanobacteria

N2

CO2

O2

Matveyev and Elhai (unpublished)


Slide4 l.jpg

mRNA

GTA…(8)…TAC

…(20-24)…TAnnnT

Differentiation in cyanobacteriaWhat does NtcA bind to?

Herrero et al (2001) J Bacteriol 183:411-425


Slide5 l.jpg

Differentiation in cyanobacteria

Sequence upstream from hetQ

ttctatgagaatataaaattttccttaagtttct

aaaaccgaccattctgatgaataagtccggtttt

tgctttttcgctttatttatctatatttccaagt

ggggtgacaactatcttgccaatattgtcgttat

gaaaaaatctGTAacatgagaTACacaatagcatttatatttgcttTAgtaTctctctcttgggtggg

…(20-24)…TAnnnT

GTA…(8)…TACNtcA binding site

Promoter


Slide6 l.jpg

Differentiation in cyanobacteriaIntegration of signals through HetR

HetQ

-N

NtcA

???

Genes needed for differentiation

Position in cell cycle

HetR

Level of PatS

Level of HetN

Master regulator

Stockholm


Slide7 l.jpg

Scenario 1: The aftermath

  • Did you go for it?

YES

  • Did it bind NtcA?

YES

  • Did killing the site prevent heterocysts?

NO

Stockholm


Slide8 l.jpg

Scenario 1: The aftermath

  • Did you go for it?

YES

  • Did it bind NtcA?

YES

  • Did killing the site prevent heterocysts?

NO

  • Fame and fortune?

NO

  • Reasonable paper?

YES


Slide9 l.jpg

Scenario 1: The aftermath

If hetQ isn’t the golden link, then what is?

-N

NtcA

???

Genes needed for differentiation

HetR

  • Gene preceded by NtcA-binding site

  • Blocking NtcA-binding affects gene expression

  • Gene product required for hetR expression


Slide10 l.jpg

Thousands of candidate hits

Regexps may also “overfit” the model – be too strict and miss real binding sites

Scenario 1: The aftermath

If hetQ isn’t the golden link, then what is?

-N

NtcA

???

Genes needed for differentiation

HetR

  • Gene preceded by NtcA-binding site

How to find?

  • Search for GTA…(N8)…TAC…(N20-24)…TA…T?


Slide11 l.jpg

Table 1: Examples of position-specific scoring matrices from sequence alignment

A. Sequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

Position-specific scoring matrices: A better way


Slide12 l.jpg

A. sequence alignmentSequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

B. Table of occurrencesa

A

3

2

0

0

1

0

0

5

2

1

3

4

3

2

2

1

1

5

0

2

4

2

2

1

C

1

0

0

2

0

0

0

0

1

4

0

0

2

0

0

2

0

0

5

2

0

0

0

2

G

1

0

1

0

0

5

0

0

1

0

1

0

0

1

1

0

0

0

0

0

0

0

0

0

T

0

3

4

3

4

0

5

0

1

0

1

1

0

2

2

2

4

0

0

1

1

3

3

2

Position-specific scoring matrices: A better way


Slide13 l.jpg

NtcA sequence alignment

???

HetR

TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG

AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT

TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC

TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC

GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC

CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA

TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA

AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG

AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA

TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG

CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC

GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG

GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT

CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA

ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA

TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT

CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA

CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG

AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC

CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA

TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT

Good match to NtcA-site

CTCCGTAAAC CTCTAAC...

Good match to NtcA-site

Good match to NtcA-site

Good match to NtcA-site

NtcA-based PSSM

Position-specific scoring matrices: A better way

Anabaena genome


Slide14 l.jpg

Table 1: Examples of position-specific scoring matrices from sequence alignment

A. Sequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

Position-specific scoring matrices: A better way


Slide15 l.jpg

A. sequence alignmentSequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

B. Table of occurrencesa

A

3

2

0

0

1

0

0

5

2

1

3

4

3

2

2

1

1

5

0

2

4

2

2

1

C

1

0

0

2

0

0

0

0

1

4

0

0

2

0

0

2

0

0

5

2

0

0

0

2

G

1

0

1

0

0

5

0

0

1

0

1

0

0

1

1

0

0

0

0

0

0

0

0

0

T

0

3

4

3

4

0

5

0

1

0

1

1

0

2

2

2

4

0

0

1

1

3

3

2

Position-specific scoring matrices: A better way


Slide16 l.jpg

B. sequence alignmentTable of occurrencesa

A

0

1

0

0

5

2

1

3

4

3

C

2

0

0

0

0

1

4

0

0

2

G

0

0

5

0

0

1

0

1

0

0

T

3

4

0

5

0

1

0

1

1

0

C. Position-specific scoring matrix (B = 0)b

A

0

.20

0

0

1.0

.40

.20

.60

.80

.60

C

.40

0

0

0

0

.20

.80

0

0

.40

G

0

0

1.0

0

0

.20

0

.20

0

0

T

.60

.80

0

1.0

0

.20

0

.20

.20

0

Position-specific scoring matrices: A better way


Slide17 l.jpg

Table 2: Scoring a sequence with a PSSM sequence alignment

urt-71

T

A

G

T

A

T

C

A

A

A

Score

.6

.2

1

1

1

.2

.8

.6

.8

.6

w/ps’countsb

.51

.24

.75

.79

.79

.24

.61

.51

.65

.51

Normal’db

1.6

.75

4.2

2.5

2.5

.75

3.4

1.6

2.0

1.6

Position-specific scoring matrices: A better way

Score = .60 * .20 * 1.0 * …


Slide18 l.jpg

A. sequence alignmentSequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

Position-specific scoring matricesIntroduction of pseudocounts

A?

qG,6 = 5 real counts

pG = ? pseudocounts


Slide19 l.jpg

Position-specific scoring matrices sequence alignmentIntroduction of pseudocounts

Score(position,nucleotide) = (q + p) / (N + B)

p = pseudocounts = B * (overall frequency of nucleotide)

[A] = 0.32[T] = 0.32[C] = 0.18[G] = 0.18

B = Total number of pseudocounts

= Square root (N) ?

or = 0.1 ?


Slide20 l.jpg

C. sequence alignmentPosition-specific scoring matrix (B = 0)b

A

0

.20

0

0

1.0

.40

.20

C

.40

0

0

0

0

.20

.80

G

0

0

1.0

0

0

.20

0

T

.60

.80

0

1.0

0

.20

0

D. Position-specific scoring matrix (B = N = 2.2)c

A

.099

.24

.099

.099

.79

.38

.24

C

.33

.056

.056

.056

.056

.19

.61

G

.056

.056

.75

.056

.056

.19

.056

T

.51

.65

.099

.79

.099

.24

.099

Position-specific scoring matricesIntroduction of pseudocounts


Slide21 l.jpg

A. sequence alignmentSequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

B. Table of occurrencesa

A

3

2

0

0

1

0

0

5

2

1

3

4

3

2

2

1

1

5

0

2

4

2

2

1

C

1

0

0

2

0

0

0

0

1

4

0

0

2

0

0

2

0

0

5

2

0

0

0

2

G

1

0

1

0

0

5

0

0

1

0

1

0

0

1

1

0

0

0

0

0

0

0

0

0

T

0

3

4

3

4

0

5

0

1

0

1

1

0

2

2

2

4

0

0

1

1

3

3

2

Position-specific scoring matricesNormalization

How to account for similarity due to similar base composition?

Compare ScorePSSM / Scorebackground frequency

0.79 / 0.32 = 2.2


Slide22 l.jpg

E. sequence alignmentPosition-specific scoring matrix (B = 0.1)c

A

.006

.20

.006

.006

.99

.40

.20

.59

C

.40

.004

.004

.004

.004

.20

.79

.004

G

.004

.004

.98

.004

.004

.20

.004

.20

T

.59

.79

.006

.99

.006

.20

.006

.20

F. Position-specific scoring matrix: Log-odds form (B = 0.1)c,d

A

2.2

0.7

2.2

2.2

0.0

0.4

0.7

0.2

C

0.4

2.5

2.5

2.5

2.5

0.7

0.1

2.5

G

2.5

2.5

0.0

2.5

2.5

0.7

2.5

0.7

T

0.2

0.1

2.2

0.0

2.2

0.7

2.2

0.7

Position-specific scoring matricesLog odds form

Log odds = -log(score)

Score * score * score … log + log + log …


Slide23 l.jpg

Position-specific scoring matrices sequence alignmentExpand training set through orthologs

Table 3: Training set including sequences from two Nostocsa

71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA

Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA

71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT

Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT

71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC

71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG

Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG

71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT

Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT

71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA

Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA


Slide24 l.jpg

Position-specific scoring matrices sequence alignmentDecrease complexity through info analysis

Uncertainty (Hc) = - Sum [piclog2(pic)]


Slide25 l.jpg

Position-specific scoring matrices sequence alignmentDecrease complexity through info analysis

Uncertainty (Hc) = - Sum [piclog2(pic)]

H1= -{[4/11 log2(4/11)] + [3/11 log2(3/11)] + [1/11 log2(1/11)] + [3/11 log2(3/11)]}

= 1.87

H31= -{[1/11 log2(1/11)] + [1/11 log2(1/11)] + [1/11 log2(1/11)] + [8/11 log2(8/11)]}

= 1.28

Information content = Sum (Hmax– Hc) (summed over all columns)


ad