Computational molecular biology biochem 218 biomedical informatics 231 http biochem218 stanford edu
Download
1 / 74

Computational Molecular Biology Biochem 218 BioMedical Informatics 231 biochem218.stanford - PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on

Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 http://biochem218.stanford.edu/. Discovering Transcription Factor Binding Sites in Co-Regulated Genes. Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy). Motivation. MicroArray analysis of

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Computational Molecular Biology Biochem 218 BioMedical Informatics 231 biochem218.stanford' - jerry


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Computational molecular biology biochem 218 biomedical informatics 231 http biochem218 stanford edu

Computational Molecular BiologyBiochem 218 – BioMedical Informatics 231http://biochem218.stanford.edu/

Discovering Transcription FactorBinding Sites in Co-Regulated Genes

Doug Brutlag

Professor Emeritus

Biochemistry & Medicine (by courtesy)


Motivation
Motivation

MicroArray analysis of

whole genome gene expression

Clustering of genes based on

their expression pattern

Searching for conserved sequence

motifs regulating the expression



Human gene expression signatures

T Cells Signaling

DNA Damage

Fibroblast Stimulation

B Cells Signaling

CMV Infection

Anoxia

Polio Infection

Monocytes Signaling IL4

Hormone

Human Gene Expression Signatures


Finding transcription factor binding sites

Pho 5

Pho 8

Pho 81

Pho 84

Pho …

Transcription Start

Finding Transcription Factor Binding Sites

Upstream RegionsCo-expressed

Genes

GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC

CACATCGCATCACGTGACCAGT...GACATGGACGGC

GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA

TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG

CGCTAGCCCACGTGGATCTTGA...AGAATGACTGGC


Finding transcription factor binding sites1
Finding Transcription Factor Binding Sites

Upstream RegionsCo-expressed

Genes

GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC

CACATCGCATCACGTGACCAGT...GACATGGACGGC

GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA

TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG

CGCTAGCCCACGTGGATCTTGT...AGAATGGCCTAT


Finding transcription factor binding sites2
Finding Transcription Factor Binding Sites

Upstream RegionsCo-expressed

Genes

ATGGCTGCACCACGTTTATGC...ACGATGTCTCGC

CACATCGCATCACGTGACCAGT...GACATGGACGGC

GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA

TTAGGACCATCACGTGA...ACAATGAGAGCG

CGCTAGCCCACGTTGATCTTGT...AGAATGGCCTAT

Pho4 binding


Three algorithms
Three Algorithms

  • BioProspector

    • Presented in 2000

    • Extends Gibb’s sampling (stochastic method)

    • For any cluster of sequences

  • MDScan

    • Deterministic approach

    • Enumerative

    • Very fast

    • For sequences with some ranking information

  • MotifCut and MotifScan

    • Graph-based

    • Does not use PSSMs

    • Novel and sensitive


Representing ambiguous dna motifs

Consensus motif: CACAAAA

Degenerate motif: CRCAAAW

A/T

A/G

Representing Ambiguous DNA Motifs

  • Sequence Patterns (Regular expressions)

  • IUPAC nomenclatures for DNA ambiguities


Weight matrix for transcription factor binding sites

Frequency weight Matrix

Alignment Matrix

Weight Matrix for Transcription Factor Binding Sites

A DNA Motif as a position specific frequency weight matrix

Sites

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG


Weight matrix with consensus sequence logotype with degenerate consensus
Weight Matrix with Consensus Sequence & Logotype with Degenerate Consensus

Weight Matrix or Position Specific Scoring Matrix

TTWHYCGGHY


Bioprospector initialization
BioProspector Initialization Degenerate Consensus

Gather together upstream regulatory regions


Bioprospector initialization1

a Degenerate Consensus 1

a2

a3

a4

ak

BioProspector Initialization

Actual Location of Regulatory Motifs is Unknown


Bioprospector initialization2

a Degenerate Consensus 1'

a2'

a3'

a4'

ak'

Initial Motif

BioProspector Initialization

Randomly initialize the beginning motif


Bioprospector iterative update

Motif Without Degenerate Consensus

a1' Segment

BioProspector Iterative Update

Take out one sequence at a time with its segment

a1'

a2'

a3'

a4'

ak'


Bioprospector iterative update1

Motif Without Degenerate Consensus

a1' Segment

BioProspector Iterative Update

Score each segment with the current motif

Sequence 1

Segment (1-6): 1.5

a2'

a3'

a4'

ak'


Bioprospector iterative update2

Motif Without Degenerate Consensus

a1' Segment

BioProspector Iterative Update

Score each segment with the current motif

Sequence 1

Segment (2-7): 3

a2'

a3'

a4'

ak'


Bioprospector iterative update3

Motif Without Degenerate Consensus

a1' Segment

BioProspector Iterative Update

Score each segment with the current motif

Sequence 1

Segment (3-8): 2.7

a2'

a3'

a4'

ak'


Bioprospector iterative update4

Motif Without Degenerate Consensus

a1' Segment

BioProspector Iterative Update

Score each segment with the current motif

Sequence 1

Segment (4-9): 9.0

a2'

a3'

a4'

ak'


Bioprospector iterative update5

Motif Without Degenerate Consensus

a1' Segment

BioProspector Iterative Update

Score each segment with the current motif

Sequence 1

Segment (5-10): 3.2

a2'

a3'

a4'

ak'


Bioprospector iterative update6

Motif Without Degenerate Consensus

a1' Segment

BioProspector Iterative Update

Score each segment with the current motif

Sequence 1

Segment (6-11): 27.1

a2'

a3'

a4'

ak'


Bioprospector iterative update7

Motif Without Degenerate Consensus

a1' Segment

BioProspector Iterative Update

Score each segment with the current motif

Sequence 1

Segment (7-12): 11.2

a2'

a3'

a4'

ak'


Bioprospector iterative update8

Motif Without Degenerate Consensus

a1' Segment

BioProspector Iterative Update

Score each segment with the current motif

Sequence 1

Segment (8-13): 2.9

a2'

a3'

a4'

ak'


Bioprospector iterative update9

Motif Without Degenerate Consensus

a1' Segment

BioProspector Iterative Update

Score each segment with the current motif

Sequence 1

Segment (9-14): 9.1

a2'

a3'

a4'

ak'


Bioprospector iterative update10

a Degenerate Consensus 1"

Candidate Motif

Motif Without

a1' Segment

BioProspector Iterative Update

Score sequence 1 in all possible alignments

a2'

a3'

a4'

ak'


Bioprospector iterative update11

a Degenerate Consensus 1"

Motif Without

a2' Segment

BioProspector Iterative Update

Repeat the process until convergence

a2'

a3'

a4'

ak'


Challenges for bioprospector http bioprospector stanford edu
Challenges for BioProspector Degenerate Consensus http://bioprospector.stanford.edu/

  • Variable (0-n) motif sites per sequence

  • Motif enriched only in upstream sequences, not in the whole genome

  • Some motifs could have two conserved blocks separated by a variable length gap

  • Motifs are not highly conserved (~50%)

  • Some motifs show a palindromic symmetry

  • Assign motifs a measure of statistical significance


Thresholds allow for variable motif copies

Keep Degenerate Consensus

TH

Sample

TL

Discard

Thresholds Allow forVariable Motif Copies

  • Sequences that do not have the motif

  • Sequences with multiple copies of motif


Bioprospector finds motif with two blocks
BioProspector Finds Motif With Two Blocks Degenerate Consensus

Two-block motifs:

GACACATTACCTATGCTGGCCCTACGACCTCTCGC

CACAATTACCACCA TGGCGTGATCTCAGACACGGACGGC

GCCTCGATTACCGTGGTA TGGCTAGTTCTCAAACCTGACTAAA

TCTCGTTAGATTACCACCCATGGCCGTATCGAGAGCG

CGCTAGCCATTACCGAT TGGCGTTCTCGAGAATTGCCTAT


Bioprospector finds motifs with two blocks

97.9 Degenerate Consensus

Sample

Block2 start

Sequence

blk1

block2

Min Gap

22.4

26.5

30.1

18.9

Max Gap

BioProspector Finds MotifsWith Two Blocks

Two-block motifs


Bioprospector finds motif with inverse complementary blocks

GCGTAA Degenerate Consensus

AATGCG

BioProspector Finds Motif With Inverse Complementary Blocks

Two-block motifs

Palindrome motifs:


Bioprospector results b subtilis two block promoter
BioProspector Results: Degenerate Consensus B. subtilis two-block promoter

  • B. subtilis transcription best studied

  • 136 σA-dependent promoter sequences [-100, 15]

  • Look for w1 = w2 = 5, gap[15, 20] two-block motif

  • Correctly identified motif [TTGACA, TATAAT]

    and 70% of all the sites

  • Occasionally predicted two promoters

“Correct” site Second site

abrB TTGACG TACAAT

veg TTGACA TATAAT

f105 TTTACA TACAAT


Bioprospector web server http bioprospector stanford edu
BioProspector Web Server: Degenerate Consensus http://bioprospector.stanford.edu/


Bioprospector web server http bioprospector stanford edu1
BioProspector Web Server: Degenerate Consensus http://bioprospector.stanford.edu/


Compare prospector http compareprospector stanford edu
Compare Prospector Degenerate Consensus http://compareprospector.stanford.edu/

Liu et al, 2004, Genome Res 14(3): 451-458.


Compare prospector http compareprospector stanford edu1

Motif Degenerate Consensus

Regions conserved between

two species

1 kb

Liu et al, 2004, Genome Res 14(3): 451-458

Compare Prospectorhttp://compareprospector.stanford.edu/


Compare prospector http compareprospector stanford edu2

Gene 1 Degenerate Consensus

Gene 2

Gene 3

Gene 4

Gene 5

Gene n

Tch

Biased sampling:

Initial iterations: Tch

Later iterations: Tcl

Tcl

Tch

Tcl

Liu et al, 2004, Genome Res 14(3): 451-8,

Compare Prospectorhttp://compareprospector.stanford.edu/


Compare prospector http compareprospector stanford edu3
Compare Prospector Degenerate Consensus http://compareprospector.stanford.edu/

(Liu Y et al, Nucleic Acids Res 32:W204-7)


Compare prospector http compareprospector stanford edu4
Compare Prospector Degenerate Consensus http://compareprospector.stanford.edu/

(Liu Y et al, Nucleic Acids Res 32:W204-7)


Yeast rap1 sequences
Yeast Rap1 Sequences Degenerate Consensus

  • Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment


Yeast rap1 sequences1

Cross link protein-DNA interaction Degenerate Consensus

Yeast Rap1 Sequences

  • Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment


Yeast rap1 sequences2

Cross link protein-DNA interaction Degenerate Consensus

Shear DNA

Yeast Rap1 Sequences

  • Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment


Yeast rap1 sequences3

Immunoprecipitation Degenerate Consensus

Yeast Rap1 Sequences

  • Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment


Yeast rap1 sequences4

PCR amplify and label DNA Degenerate Consensus

Yeast Rap1 Sequences

  • Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment


Yeast rap1 sequences5

Hybridize with microarray and measure reading Degenerate Consensus

Yeast Rap1 Sequences

  • Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment


Yeast rap1 sequences6

Cross link protein-DNA interaction Degenerate Consensus

Shear DNA

Immunoprecipitation

Purify DNA

Purify DNA

PCR amplify and label DNA

Hybridize with microarray and measure reading

Yeast Rap1 Sequences

  • Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment


Chromatin immune precipitation
Chromatin Immune Precipitation Degenerate Consensus


Yeast rap1 sequences7
Yeast Rap1 Sequences Degenerate Consensus

  • Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment

  • Rap1 IP Enriched 727 DNA fragments

    • 45% are intergenic

    • Average length 1-2 KB

    • Some are false positives

    • Some have multiple Rap1 sites


Useful insights
Useful Insights Degenerate Consensus

  • In ChIP-array experiments, highly enriched sequences are usually the real targets

  • Transcription factor binding sites occurs more abundantly in these real targets

  • Search TF sites from high-confidence sequences first before examine the rest sequences?

Motif Discovery Scan (MDscan)


Mdscan algorithm define m matches

m Degenerate Consensus -matches for an 8-mer

MDscan Algorithm:Define m-matches

For a given w-mer and any other random w-mer

TGTAACGT 8-mer

TGTAACGT matched 8

AGTAACGT matched 7

TGCAACAT matched 6

TGACACGG matched 5

AATAACAG matched 4

Pick a reasonable m, e.g. in yeast


Mdscan algorithm finding candidate motifs
MDscan Algorithm: Degenerate Consensus Finding candidate motifs

m-matches

Seed 1

Top

Seqs

All IP enriched sequences


Mdscan algorithm finding candidate motifs1
MDscan Algorithm: Degenerate Consensus Finding candidate motifs

m-matches

Seed 2

Top

Seqs

All IP enriched sequences


Mdscan algorithm finding candidate motifs2
MDscan Algorithm: Degenerate Consensus Finding candidate motifs

m-matches

Seed 3

Top

Seqs

All IP enriched sequences


Mdscan algorithm scanning sequences with top motifs

Specificity Degenerate Consensus

(unlikely in genome)

Motif Signal

Abundance

Conserved

Positions

MDscan Algorithm:Scanning sequences with top motifs

  • Keep 30-50 top scoring candidate motifs:


Mdscan algorithm scanning sequences with top motifs1

Specificity Degenerate Consensus

(unlikely in genome)

Motif Signal

Abundance

Conserved

Positions

MDscan Algorithm:Scanning sequences with top motifs

  • Keep 30-50 top scoring candidate motifs:

  • Scan the rest of the sequences with the candidate motifs


Mdscan algorithm finding all motif instances
MDscan Algorithm: Degenerate Consensus Finding All Motif Instances

m-matches

Seed 3

Top

Seqs

All IP enriched sequences


Mdscan algorithm refine the motifs
MDscan Algorithm: Degenerate Consensus Refine the motifs

m-matches

Seed 3

Top

Seqs

X

X

X

All IP enriched sequences


Mdscan simulation

GACTCCCA Degenerate Consensus

GATTGCCT

GGCTACCT

GACTACCA

GAGTACCA

GACTATCT

GAGTACCA

GGCTCCCA

GACTCCCA

GACTCCGA

GGGAACCA

GCTTCCAA

GACTACCA

CAGTACGA

GGCTAGCA

GACTGCCG

GACTACCA

GACTCCCG

W8S1

More

Conserved

W8S3

Less

Conserved

MDscan Simulation

  • Nine motif matrix models with 3 widths and 3 degeneracy


Mdscan simulation1

Higher confidence Degenerate Consensus

Motif more abundant

MDscan Simulation

Each test set:

  • 100 sequences of 600 bases from yeast intergenic

  • Motif segments generated and inserted according to the following abundance:


Mdscan simulation2

3600 tests Degenerate Consensus

MDscan Simulation

  • 100 tests for

    3 widths

    3 strengths

    4 abundances


Mdscan simulation3

3600 tests Degenerate Consensus

MDscan Simulation

  • 100 tests for

    3 widths

    3 degeneracy

    4 abundance

    3 X Consensus

  • MDscan speed 14 X BioProspector

    27 X AlignACE


Mdscan simulation accuracy w 8
MDscan Simulation Accuracy Degenerate Consensus w = 8


Mdscan simulation accuracy w 12
MDscan Simulation Accuracy Degenerate Consensus w = 12


Mdscan simulation accuracy w 16
MDscan Simulation Accuracy Degenerate Consensus w = 16


Mdscan biological tests
MDscan Biological Tests Degenerate Consensus

  • Gal4 & Ste12 [Ren et al. Science 2000]

    • Gal4: galactose metabolism

    • Ste12: responds to mating pheromones


Mdscan biological tests1
MDscan Biological Tests Degenerate Consensus

  • SBF & MBF [Iyer et al.Nature 2001]

    • SBF: Swi4 + Swi6 budding, membrane, cell wall biosynthesis

    • MBF: Mbp1 + Swi6 DNA replication and repair


Mdscan biological tests2
MDscan Biological Tests Degenerate Consensus

  • Rap1 [Lieb et al.Nature Genetics 2001]

    • Repressor activator

    • 37% pol II events in exponentially growing cells


Tamo tools for the analysis of motifs http fraenkel mit edu tamo
TAMO: Tools for the Analysis of Motifs Degenerate Consensus http://fraenkel.mit.edu/TAMO/


Webmotifs http fraenkel mit edu webmotifs
WebMotifs Degenerate Consensus http://fraenkel.mit.edu/webmotifs/


Webmotifs http fraenkel mit edu webmotifs1
WebMotifs Degenerate Consensus http://fraenkel.mit.edu/webmotifs/


Melina comparing motifs http melina1 hgc jp
Melina: Comparing Motifs Degenerate Consensus http://melina1.hgc.jp/


Melina comparing motifs http melina1 hgc jp1
Melina: Comparing Motifs Degenerate Consensus http://melina1.hgc.jp/


Single microarray determination of transcription factor motifs
Single Microarray Determination of Transcription Factor Motifs

One microarray experiment, no clustering needed

Basic idea: more affected

sequences may contain more

motif TF sites

Induced

Expression log ratio

Genes

Repressed


Summary
Summary Motifs

  • BioProspector is stochastic

  • BioProspector can get trapped in local maxima

  • BioProspector must be run multiple times to discover the true globally optimal motif

  • BioProspector is slow

  • MDScan is deterministic

  • MDScan always gives the same answer with the same data

  • MDScan is fast

  • MDScan uses rank order data to accelerate the search process and to allow it to be deterministic

  • MDScan is fast enough to search intergenic regions from entire genomes.

  • MDScan is not as sensitive as BioProspector


ad