slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Welcome to Introduction to Bioinformatics Monday, 11 October PowerPoint Presentation
Download Presentation
Welcome to Introduction to Bioinformatics Monday, 11 October

Loading in 2 Seconds...

play fullscreen
1 / 26

Welcome to Introduction to Bioinformatics Monday, 11 October - PowerPoint PPT Presentation


  • 317 Views
  • Uploaded on

Welcome to Introduction to Bioinformatics Monday, 11 October Characteristics of PSSMs How to make a PSSM Uncertainty and information How to score a sequence Problem sets (Blast, Modeling) Scenario 1 Prediction of regulatory site heterocysts sucrose N 2 fixation in cyanobacteria N 2

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Welcome to Introduction to Bioinformatics Monday, 11 October' - JasminFlorian


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Welcome toIntroduction to BioinformaticsMonday, 11 October

  • Characteristics of PSSMs
    • How to make a PSSM
    • Uncertainty and information
    • How to score a sequence

Problem sets (Blast, Modeling)

slide2

Scenario 1

Prediction of regulatory site

slide3

heterocysts

sucrose

N2 fixation in cyanobacteria

N2

CO2

O2

Matveyev and Elhai (unpublished)

slide4

mRNA

GTA…(8)…TAC

…(20-24)…TAnnnT

Differentiation in cyanobacteriaWhat does NtcA bind to?

Herrero et al (2001) J Bacteriol 183:411-425

slide5

Differentiation in cyanobacteria

Sequence upstream from hetQ

ttctatgagaatataaaattttccttaagtttct

aaaaccgaccattctgatgaataagtccggtttt

tgctttttcgctttatttatctatatttccaagt

ggggtgacaactatcttgccaatattgtcgttat

gaaaaaatctGTAacatgagaTACacaatagcatttatatttgcttTAgtaTctctctcttgggtggg

…(20-24)…TAnnnT

GTA…(8)…TACNtcA binding site

Promoter

slide6

Differentiation in cyanobacteriaIntegration of signals through HetR

HetQ

-N

NtcA

???

Genes needed for differentiation

Position in cell cycle

HetR

Level of PatS

Level of HetN

Master regulator

Stockholm

slide7

Scenario 1: The aftermath

  • Did you go for it?

YES

  • Did it bind NtcA?

YES

  • Did killing the site prevent heterocysts?

NO

Stockholm

slide8

Scenario 1: The aftermath

  • Did you go for it?

YES

  • Did it bind NtcA?

YES

  • Did killing the site prevent heterocysts?

NO

  • Fame and fortune?

NO

  • Reasonable paper?

YES

slide9

Scenario 1: The aftermath

If hetQ isn’t the golden link, then what is?

-N

NtcA

???

Genes needed for differentiation

HetR

  • Gene preceded by NtcA-binding site
  • Blocking NtcA-binding affects gene expression
  • Gene product required for hetR expression
slide10

Thousands of candidate hits

Regexps may also “overfit” the model – be too strict and miss real binding sites

Scenario 1: The aftermath

If hetQ isn’t the golden link, then what is?

-N

NtcA

???

Genes needed for differentiation

HetR

  • Gene preceded by NtcA-binding site

How to find?

  • Search for GTA…(N8)…TAC…(N20-24)…TA…T?
slide11

Table 1: Examples of position-specific scoring matrices from sequence alignment

A. Sequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

Position-specific scoring matrices: A better way

slide12

A. Sequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

B. Table of occurrencesa

A

3

2

0

0

1

0

0

5

2

1

3

4

3

2

2

1

1

5

0

2

4

2

2

1

C

1

0

0

2

0

0

0

0

1

4

0

0

2

0

0

2

0

0

5

2

0

0

0

2

G

1

0

1

0

0

5

0

0

1

0

1

0

0

1

1

0

0

0

0

0

0

0

0

0

T

0

3

4

3

4

0

5

0

1

0

1

1

0

2

2

2

4

0

0

1

1

3

3

2

Position-specific scoring matrices: A better way

slide13

NtcA

???

HetR

TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG

AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT

TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC

TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC

GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC

CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA

TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA

AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG

AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA

TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG

CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC

GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG

GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT

CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA

ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA

TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT

CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA

CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG

AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC

CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA

TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT

Good match to NtcA-site

CTCCGTAAAC CTCTAAC...

Good match to NtcA-site

Good match to NtcA-site

Good match to NtcA-site

NtcA-based PSSM

Position-specific scoring matrices: A better way

Anabaena genome

slide14

Table 1: Examples of position-specific scoring matrices from sequence alignment

A. Sequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

Position-specific scoring matrices: A better way

slide15

A. Sequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

B. Table of occurrencesa

A

3

2

0

0

1

0

0

5

2

1

3

4

3

2

2

1

1

5

0

2

4

2

2

1

C

1

0

0

2

0

0

0

0

1

4

0

0

2

0

0

2

0

0

5

2

0

0

0

2

G

1

0

1

0

0

5

0

0

1

0

1

0

0

1

1

0

0

0

0

0

0

0

0

0

T

0

3

4

3

4

0

5

0

1

0

1

1

0

2

2

2

4

0

0

1

1

3

3

2

Position-specific scoring matrices: A better way

slide16

B. Table of occurrencesa

A

0

1

0

0

5

2

1

3

4

3

C

2

0

0

0

0

1

4

0

0

2

G

0

0

5

0

0

1

0

1

0

0

T

3

4

0

5

0

1

0

1

1

0

C. Position-specific scoring matrix (B = 0)b

A

0

.20

0

0

1.0

.40

.20

.60

.80

.60

C

.40

0

0

0

0

.20

.80

0

0

.40

G

0

0

1.0

0

0

.20

0

.20

0

0

T

.60

.80

0

1.0

0

.20

0

.20

.20

0

Position-specific scoring matrices: A better way

slide17

Table 2: Scoring a sequence with a PSSM

urt-71

T

A

G

T

A

T

C

A

A

A

Score

.6

.2

1

1

1

.2

.8

.6

.8

.6

w/ps’countsb

.51

.24

.75

.79

.79

.24

.61

.51

.65

.51

Normal’db

1.6

.75

4.2

2.5

2.5

.75

3.4

1.6

2.0

1.6

Position-specific scoring matrices: A better way

Score = .60 * .20 * 1.0 * …

slide18

A. Sequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

Position-specific scoring matricesIntroduction of pseudocounts

A?

qG,6 = 5 real counts

pG = ? pseudocounts

slide19

Position-specific scoring matricesIntroduction of pseudocounts

Score(position,nucleotide) = (q + p) / (N + B)

p = pseudocounts = B * (overall frequency of nucleotide)

[A] = 0.32[T] = 0.32[C] = 0.18[G] = 0.18

B = Total number of pseudocounts

= Square root (N) ?

or = 0.1 ?

slide20

C. Position-specific scoring matrix (B = 0)b

A

0

.20

0

0

1.0

.40

.20

C

.40

0

0

0

0

.20

.80

G

0

0

1.0

0

0

.20

0

T

.60

.80

0

1.0

0

.20

0

D. Position-specific scoring matrix (B = N = 2.2)c

A

.099

.24

.099

.099

.79

.38

.24

C

.33

.056

.056

.056

.056

.19

.61

G

.056

.056

.75

.056

.056

.19

.056

T

.51

.65

.099

.79

.099

.24

.099

Position-specific scoring matricesIntroduction of pseudocounts

slide21

A. Sequence alignmenta

A

T

T

T

A

G

T

A

T

C

A

A

A

A

A

T

A

A

C

A

A

T

T

C

G

T

T

C

T

G

T

A

A

C

A

A

A

G

A

C

T

A

C

A

A

A

A

C

A

T

T

T

T

G

T

A

G

C

T

A

C

T

T

A

T

A

C

T

A

T

T

T

A

A

G

C

T

G

T

A

A

C

A

A

A

A

T

C

T

A

C

C

A

A

A

T

C

A

T

T

T

G

T

A

C

A

G

T

C

T

G

T

T

A

C

C

T

T

T

A

B. Table of occurrencesa

A

3

2

0

0

1

0

0

5

2

1

3

4

3

2

2

1

1

5

0

2

4

2

2

1

C

1

0

0

2

0

0

0

0

1

4

0

0

2

0

0

2

0

0

5

2

0

0

0

2

G

1

0

1

0

0

5

0

0

1

0

1

0

0

1

1

0

0

0

0

0

0

0

0

0

T

0

3

4

3

4

0

5

0

1

0

1

1

0

2

2

2

4

0

0

1

1

3

3

2

Position-specific scoring matricesNormalization

How to account for similarity due to similar base composition?

Compare ScorePSSM / Scorebackground frequency

0.79 / 0.32 = 2.2

slide22

E. Position-specific scoring matrix (B = 0.1)c

A

.006

.20

.006

.006

.99

.40

.20

.59

C

.40

.004

.004

.004

.004

.20

.79

.004

G

.004

.004

.98

.004

.004

.20

.004

.20

T

.59

.79

.006

.99

.006

.20

.006

.20

F. Position-specific scoring matrix: Log-odds form (B = 0.1)c,d

A

2.2

0.7

2.2

2.2

0.0

0.4

0.7

0.2

C

0.4

2.5

2.5

2.5

2.5

0.7

0.1

2.5

G

2.5

2.5

0.0

2.5

2.5

0.7

2.5

0.7

T

0.2

0.1

2.2

0.0

2.2

0.7

2.2

0.7

Position-specific scoring matricesLog odds form

Log odds = -log(score)

Score * score * score … log + log + log …

slide23

Position-specific scoring matricesExpand training set through orthologs

Table 3: Training set including sequences from two Nostocsa

71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA

Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA

71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT

Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT

71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC

71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG

Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG

71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT

Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT

71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA

Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA

slide24

Position-specific scoring matricesDecrease complexity through info analysis

Uncertainty (Hc) = - Sum [piclog2(pic)]

slide25

Position-specific scoring matricesDecrease complexity through info analysis

Uncertainty (Hc) = - Sum [piclog2(pic)]

H1= -{[4/11 log2(4/11)] + [3/11 log2(3/11)] + [1/11 log2(1/11)] + [3/11 log2(3/11)]}

= 1.87

H31= -{[1/11 log2(1/11)] + [1/11 log2(1/11)] + [1/11 log2(1/11)] + [8/11 log2(8/11)]}

= 1.28

Information content = Sum (Hmax– Hc) (summed over all columns)