Transcription regulation transcription factor motif finding
Download
1 / 75

Transcription Regulation Transcription Factor Motif Finding - PowerPoint PPT Presentation


  • 166 Views
  • Uploaded on

Transcription Regulation Transcription Factor Motif Finding. Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520. Outline. Biology of transcription regulation and challenges of computational motif finding Scan for known TF motif sites TRASFAC and JASPAR, Sequence Logo De novo method

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Transcription Regulation Transcription Factor Motif Finding' - aoife


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Transcription regulation transcription factor motif finding

Transcription RegulationTranscription Factor Motif Finding

Xiaole Shirley Liu

STAT115, STAT215, BIO298, BIST520


Outline
Outline

  • Biology of transcription regulation and challenges of computational motif finding

  • Scan for known TF motif sites

    • TRASFAC and JASPAR, Sequence Logo

  • De novo method

    • Regular expression enumeration: w-mer enumerate

    • Position weight matrix update: EM and Gibbs

  • Motif finding in different organisms

    • Motif clusters and conservation


Imagine a chef

Restaurant Dinner

Home Lunch

Certain recipes used to

make certain dishes

Imagine a Chef



Each cell is like a chef1

Adult Liver

Infant Skin

Glucose, Oxygen,

Amino Acid

Fat, Alcohol

Nicotine

Certain genes expressed to

make certain proteins

Healthy

Skin Cell

State

Disease

Liver Cell

State

Each Cell Is Like a Chef


Understanding a genome
Understanding a Genome

Get the complete sequence

(encoded cook book)

Observe gene expressions

at different cell states

(meals prepared at different situations)

Decode gene regulation

(decode the book, understand the rules)


Information in dna

Coding region 2%

What is to be made

Milk->Yogurt

Egg->Omelet

Fish->Sushi

Flour->Cake

Beef->Burger

Information in DNA

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATTTACCACATCGCATCACAGTTCAGGACTAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT


Information in dna1

Morning

Morning

Butter

Japanese Restaurant

5 Oz

Butter

9 Oz

Information in DNA

Non-coding region 98%

Regulation: When, Where,

Amount, Other Conditions, etc

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATTTACCACATCGCATCACTACGACGGACTAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT

Coding region 2%

Milk->Yogurt

Egg->Omelet

Fish->Sushi

Flour->Cake

Beef->Burger


Measure gene expression
Measure Gene Expression

  • Microarray or SAGE detects the expression of every gene at a certain cell state

  • Clustering find genes that are co-expressed (potentially share regulation)


Decode gene regulation

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

Decode Gene Regulation

Look at genes always expressed together:

Upstream RegionsCo-expressed

Genes

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

STAT115, 04/01/2008


Decode gene regulation1
Decode Gene Regulation

Look at genes always expressed together:

Upstream RegionsCo-expressed

Genes

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

STAT115, 04/01/2008


Decode gene regulation2

Morning

Decode Gene Regulation

Look at genes always expressed together:

Upstream RegionsCo-expressed

Genes

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

STAT115, 04/01/2008


Biology of transcription regulation

Hemoglobin Beta

atttgctt ttcact gcaacct

aactccagt

Hemoglobin Zeta

gcaacct

actca

Hemoglobin Alpha

gcaacct

Hemoglobin Gamma

ccagcgccg

gcaacct

Transcription Factor (TF)

TF Binding Motif

gcaacct

Biology of Transcription Regulation

...acatttgcttctgacacaactgtgttcactagcaacctca...aacagacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT...

...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggcagcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC...

...cgctcgcgggccggcactcttctggtccccacagactcag...gatacccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA...

...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gatgcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA...

Motif can only be computational discovered when

there are enough cases for machine learning


Computational motif finding
Computational Motif Finding

  • Input data:

    • Upstream sequences of gene expression profile cluster

    • 20-800 sequences, each 300-5000 bps long

  • Output: enriched sequence patterns (motifs)

  • Ultimate goals:

    • Which TFs are involved and their binding motifs and effects (enhance / repress gene expression)?

    • Which genes are regulated by this TF, why is there disease when a TF goes wrong?

    • Are there binding partner / competitor for a TF?


Challenges where what the signal

Water

Water

Water

Water

Water

Challenges: Where/what the signal

The motif should be abundant

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAAATAAGACACGGACGGC

GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA

TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG

CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT


Challenges where what the signal1

Coconut

Coconut

Coconut

Coconut

Coconut

Challenges: Where/what the signal

The motif should be abundant

And Abundant with significance

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAAATAAGACACGGACGGC

GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA

TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG

CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT


Challenges double stranded dna

|||||||||||||||||||||||||||||

GTGTAGCGTACCATTTATGGTCAAGTCTG

|||||||||||||||||||||||||||||

AGAGTCCATTTAGTCAGTATGATGGGTGT

Challenges: Double stranded DNA

Motif appears in both

strands

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC

TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG


Challenges base substitutions
Challenges: Base substitutions

Sequences do not have to match the motif

perfectly, base substitutions are allowed

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATGTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG

CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT


Challenges variable motif copies
Challenges: Variable motif copies

Some sequences do not have the motif

Some have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC

CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC

TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG

GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA

CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT


Challenges variable motif copies1

Sushi

Fish

Fish

Fish

Hand Roll

Fish

Fish

Fish

Fish

Sashimi

Tempura

Sake

Challenges: Variable motif copies

Some sequences do not have the motif

Some have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC

CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC

TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG

GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA

CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT


Challenges two block motifs

Coconut

Milk

or palindromic patterns

AATGCG

GCGTAA

Challenges: Two-block motifs

Some motifs have two parts

GACACATTTACCTATGCTGGCCCTACGACCTCTCGC

CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC

GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA

TCTCGTTAGATTTACCACCCATGGCCGTATCGAGAGCG

CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT


Scan for known tf motif sites
Scan for Known TF Motif Sites

  • Experimental TF sites: TRANSFAC, JASPAR

  • Motif representation:

    • Regular expression: Consensus CACAAAA

      binary decision Degenerate CRCAAAW

      IUPAC

A/G

A/T


Scan for known tf motif sites1

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)

p(generate ATGCAGCT from background)

p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Scan for Known TF Motif Sites

  • Experimental TF sites: TRANSFAC, JASPAR

  • Motif representation:

    • Regular expression: Consensus CACAAAA

      binary decision Degenerate CRCAAAW

    • Position weight matrix (PWM): need score cutoff

Motif Matrix

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Sites


Iupac for dna

A adenosine

C cytidine

G guanine

T thymidine

U uridine

R G A (purine)

Y T C (pyrimidine)

K G T (keto)

M A C (amino)

S G C (strong)

W A T (weak)

B C G T (not A)

D A G T (not C)

H A C T (not G)

V A C G (not T)

N A C G T (any)

IUPAC for DNA


Protein binding microarrays
Protein Binding Microarrays

  • In vitro protein-DNA interactions

  • Better capture motifs


Jaspar
JASPAR

  • User defined cutoff to scan for a particular motif


A word on sequence logo
A Word on Sequence Logo

  • SeqLogo consists of stacks of symbols, one stack for each position in the sequence

  • The overall height of the stack indicates the sequence conservation at that position

  • The height of symbols within the stack indicates the relative frequency of nucleic acid at that position

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG


Scan known tf motifs
Scan Known TF Motifs

  • Drawbacks:

    • Limited number of motifs

    • Limited number of sites to represent each motif

      • Low sensitivity and specificity

    • Poor description of motif

      • Binding site borders not clear

      • Binding site many mismatches

    • Many motifs look very similar

      • E.g. GC-rich motif, E-box (CACGTG)


De novo sequence motif finding
De novo Sequence Motif Finding

  • Goal: look for common sequence patterns enriched in the input data (compared to the genome background)

  • Regular expression enumeration

    • Pattern driven approach

    • Enumerate patterns, check significance in dataset

    • Oligonucleotide analysis, MobyDick

  • Position weight matrix update

    • Data driven approach, use data to refine motifs

    • Consensus, EM & Gibbs sampling

    • Motif score and Markov background


Regular expression enumeration

Expected occurrence of w in the data

pw from

genome background

size of sequence data

Regular Expression Enumeration

  • Oligonucleotide Analysis: check over-representation for every w-mer:

    • Expected w occurrence in data

      • Consider genome sequence + current data size

    • Observed w occurrence in data

    • Over-represented w is potential TF binding motif

Observed occurrence of win the data


Mobydick
MobyDick

  • A sequence data and a dictionary of motif words

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}

Pw = {0.22, 0.28, 0.28, 0.22}


Mobydick1

A

C

G

T

A

AA

AC

AG

AT

C

CA

CC

CG

CT

G

GA

GC

GG

GT

T

TA

TC

TG

TT

MobyDick

  • A sequence data and a dictionary of motif words

  • Check over-representation of every word-pair

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}

Pw = {0.28, 0.22, 0.22, 0.28}


Mobydick2

A

C

G

T

A

AA

AC

AG

AT

C

CA

CC

CG

CT

G

GA

GC

GG

GT

T

TA

TC

TG

TT

MobyDick

  • A sequence data and a dictionary of motif words

  • Check over-representation of every word-pair

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}

Pw = {0.28, 0.28, 0.22, 0.22}

D = {A,C,G,T,AA,GA,TA,GG}

Pw = {?}


Mobydick3
MobyDick

  • D = {A,C,G,T,AA,GA,TA,GG}

  • Seq: AAGATAA

  • Possible partitions:

    A A G A T A A pA pA pG pA pT pA pA

    AA G A T A A pAA pG pA pT pA pA

    AA GA T A A pAA pGA pT pA pA

    AA GA TA A pAA pGA pTA pA

    A A GA T AA pAA pGA pT pAA

  • Assign probabilities as to maximize total probability of generating the sequence


Mobydick4

A

C

G

T

A

AA

AC

AG

AT

C

CA

CC

CG

CT

G

GA

GC

GG

GT

T

TA

TC

TG

TT

MobyDick

  • A sequence data and a dictionary of motif words

  • Check over-representation of every word-pair

  • Reassign word probability and consider every new word-pair to build even longer words

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}

Pw = {0.28, 0.28, 0.22, 0.22}

D = {A,C,G,T,AA,GA,TA,GG}

Pw = {?}


Regular expression enumeration1
Regular Expression Enumeration

  • RE Enumeration Derivatives:

    • oligo-analysis, spaced dyads w1.ns.w2

    • IUPAC alphabet

    • Markov background (later)

    • 2-bit encoding, fast index access

    • Enumerate limited RE patterns known for a TF protein structure or interaction theme

  • Exhaustive, guaranteed to find global optimum, and can find multiple motifs

  • Not as flexible with base substitutions, long list of similar good motifs, and limited with motif width


Consensus

Seq1

Seq2

Good Motifs

CACGTGC GTCAGTC

CACGTTC GTCAGTC

Bad Motifs

CACGTGC GTCAGTC

GTGACATTGGAAAT

Consensus

  • Starting from the 1st sequence, add one sequence at a time to look for the best motifs obtained with the additional sequence


Consensus1

Seq3

Good Motifs

CACGTGC GTCAGTC

CACGTTC GTCAGTC

CTCGTGC GACAGTC

Bad Motifs

CACGTGC GTCAGTC

CACGTTC GTCAGTC

TTCAAAGAGACTCA

Consensus

  • Starting from the 1st sequence, add one sequence at a time to look for the best motifs obtained with the additional sequence

Remaining good motifs


Consensus2
Consensus

  • Starting from the 1st sequence, add one sequence at a time to look for the best motifs obtained with the additional sequence

  • G Stormo, algorithm runs very fast

  • Sequence order plays a big role in performance

    • First two sequences better contain the motif

    • Sites stop accumulating at the first bad sequence

    • Newer version allowing [0-n] is much slower


Expectation maximization and gibbs sampling model
Expectation Maximization and Gibbs Sampling Model

  • Objects:

    • Seq: sequence data to search for motif

    • 0: non-motif (genome background) probability

    • : motif probability matrix parameter

    • : motif site locations

  • Problem: P(, | seq, 0)

  • Approach: alternately estimate

    •  by P( | , seq, 0)

    •  by P( | , seq, 0)

    • EM and Gibbs differ in the estimation methods


Expectation maximization

E step:  | , seq, 0

TTGACGACTGCACGT

TTGAC p1

TGACG p2

GACGA p3

ACGAC p4

CGACT p5

GACTG p6

ACTGC p7

CTGCA p8

...

P1 = likelihood ratio =

P(TTGAC| )

P(TTGAC| 0)

Expectation Maximization

p0T  p0T  p0G  p0A p0C

= 0.3  0.3  0.2  0.3  0.2


Expectation maximization1

E step:  | , seq, 0

TTGACGACTGCACGT

TTGAC p1

TGACG p2

GACGA p3

ACGAC p4

CGACT p5

GACTG p6

ACTGC p7

CTGCA p8

...

M step:  | , seq, 0

p1 TTGAC

p2 TGACG

p3 GACGA

p4 ACGAC

...

Scale ACGT at each position,  reflects weighted average of 

Expectation Maximization


Em derivatives
EM Derivatives

  • First EM motif finder (C Lawrence)

    • Deterministic algorithm, guarantee local optimum

  • MEME (TL Bailey)

    • Prior probability allows 0-n site / sequence

    • Parallel running multiple

      EM with different seed

    • User friendly results


Gibbs sampling
Gibbs Sampling

  • Stochastic process, although still may need multiple initializations

    • Sample  from P( | , seq, 0)

    • Sample  from P( | , seq, 0)

  • Collapsed form:

    •  estimatedwith counts, not sampling from Dirichlet

    • Sample site from one seq based on sites from other seqs

  • Converged motif matrix  and converged motif sites  represent stationary distribution of a Markov Chain


Gibbs Sampler

nA1 + sA

nA1 + sA +nC1 + sC +nG1 + sG +nT1 + sT

 estimated with counts

pA1 =

1

11

2

21

31

3

4

41

51

5

Initial 1

  • Randomly initialize a probability matrix


Gibbs sampler

1 Without

11Segment

Gibbs Sampler

  • Take out one sequence with its sites from current motif

11

21

31

41

51


Gibbs sampler1

1 Without

11Segment

Gibbs Sampler

  • Score each possible segment of this sequence

Sequence 1

Segment (1-8)

21

31

41

51


Gibbs Sampler

1 Without

11Segment

  • Score each possible segment of this sequence

Sequence 1

Segment (2-9)

21

31

41

51


Segment score

Motif Matrix

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)

p(generate ATGCAGCT from background)

p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Sites

Segment Score

  • Use current motif matrix to score a segment


Scoring segments
Scoring Segments

Motif 1 2 3 4 5 bg

A 0.4 0.1 0.3 0.4 0.2 0.3

T 0.2 0.5 0.1 0.2 0.2 0.3

G 0.2 0.2 0.2 0.3 0.4 0.2

C 0.2 0.2 0.4 0.1 0.2 0.2

Ignore pseudo counts for now…

Sequence: TTCCATATTAATCAGATTCCG… score

TAATC …

AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383

ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185

TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444

CAGAT …


Gibbs sampler2

12

Modified 1

 estimated with counts

Gibbs Sampler

  • Sample site from one seq based on sites from other seqs

21

31

41

51


How to sample
How to Sample?

  • Rand(subtotal) = X

  • Find the first position with subtotal larger than X


Gibbs sampler3

1 Without

21Segment

Gibbs Sampler

  • Repeat the process until motif converges

21

12

31

41

51


Gibbs sampler intuition
Gibbs Sampler Intuition

  • Beginning:

    • Randomly initialized motif

    • No preference towards any segment


Gibbs sampler intuition1
Gibbs Sampler Intuition

  • Motif appears:

    • Motif should have enriched signal (more sites)

    • By chance some correct sites come to alignment

    • Sites bias motif to attract other similar sites


Gibbs sampler intuition2
Gibbs Sampler Intuition

  • Motif converges:

    • All sites come to alignment

    • Motif totally biased to sample sites every time


Gibbs Sampler

1

2

3

4

5

1i

2i

3i

4i

5i

  • Column shift

  • Metropolis algorithm:

    • Propose * as  shifted 1 column to left or right

    • Calculate motif score u() and u(*)

    • Accept * with prob = min(1, u(*) / u())


Gibbs sampling derivatives
Gibbs Sampling Derivatives

  • Gibbs Motif Sampler (JS Liu)

    • Add prior probability to allow 0-n site / seq

    • Sample motif positions to consider

  • AlignACE (F Roth)

    • Look for motifs from both strands

    • Mask out one motif to find more different motifs

  • BioProspector (XS Liu)

    • Use background model with Markov dependencies

    • Sampling with threshold (0-n sites / seq), new scoring function

    • Can find two-block motifs with variable gap


Scoring motifs

Motif Matrix

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)

p(generate ATGCAGCT from background)

p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Sites

Scoring Motifs

  • Information Content (also known as relative entropy)

    • Suppose you have x aligned segments for the motif

    • pb(s1 from mtf) / pb(s1 from bg) *

      pb(s2 from mtf) / pb(s2 from bg) *…

      pb(sx from mtf) / pb(sx from bg)


Scoring motifs1

Motif Matrix

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)

p(generate ATGCAGCT from background)

p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Sites

Scoring Motifs

  • Information Content (also known as relative entropy)

    • Suppose you have x aligned segments for the motif

    • pb(s1 from mtf) / pb(s1 from bg) *

      pb(s2 from mtf) / pb(s2 from bg) *…

      pb(sx from mtf) / pb(sx from bg)


Scoring motifs2
Scoring Motifs

pb(s1 from mtf) / pb(s1 from bg) *

pb(s2 from mtf) / pb(s2 from bg) *…

pb(sx from mtf) / pb(sx from bg)

= (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2 (pC2/pC0)C2…

Take log of this:

= A1 log (pA1/pA0) + T1 log (pT1/pT0) +

T2 log (pT2/pT0) + G2 log (pG2/pG0) + …

Divide by the number of segments (if all the motifs have same number of segments)

= pA1 log (pA1/pA0) + pT1 log (pT1/pT0) + pT2 log (pT2/pT0)…

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG


Scoring motifs3

=

Motif Conservedness: How likely to see the current aligned segments from this motif model

Bad

AGGCA

ATCCC

GCGCA

CGGTA

TGCCA

ATGGT

TTGAA

Good

ATGCA

ATGCC

ATGCA

ATGCA

TTGCA

ATGGA

ATGCA

Scoring Motifs

  • Original function: Information Content


Scoring motifs4

=

Motif Specificity:

How likely to see the current aligned segments from background

Scoring Motifs

  • Original function: Information Content

Good

AGTCC

AGTCC

AGTCC

AGTCC

AGTCC

AGTCC

AGTCC

Bad

ATAAA

ATAAA

ATAAA

ATAAA

ATAAA

ATAAA

ATAAA


Scoring motifs5

=

Scoring Motifs

  • Original function: Information Content

    Which is better?

    (data = 8 seqs)

Motif 1

AGGCTAAC

AGGCTAAC

Motif 2

AGGCTAAC

AGGCTACC

AGGCTAAC

AGCCTAAC

AGGCCAAC

AGGCTAAC

TGGCTAAC

AGGCTTAC

AGGCTAAC

AGGGTAAC


Scoring motifs6

Specific (unlikely in genome background)

Motif Signal

Abundant

Positions

Conserved

Scoring Motifs

  • Motif scoring function:

  • Prefer: conserved motifs with many sites, but are not often seen in the genome background


Markov background increases motif specificity

Prefers motif segments enriched only in data, but not so likely to occur in the background

Segment ATGTA score =

p(generate ATGTA from )

p(generate ATGTA from 0)

3rd order Markov dependency

p( )

Markov Background Increases Motif Specificity

TCAGC = .25  .25  .25  .25  .25

.3  .18  .16  .22  .24

ATATA = .25  .25  .25  .25  .25

.3  .41  .38  .42  .30


Position weight matrix update
Position Weight Matrix Update likely to occur in the background

  • Advantage

    • Can look for motifs of any widths

    • Flexible with base substitutions

  • Disadvantage:

    • EM and Gibbs sampling: no guaranteed convergence time

    • No guaranteed global optimum


Motif finding in bacteria
Motif Finding in Bacteria likely to occur in the background

  • Promoter sequences are short (200-300 bp)

  • Motif are usually long (10-20 bases)

    • Some have two blocks with a gap, some are palindromes

    • Long motifs are usually very degenerate

  • Single microarray experiment sometimes already provides enough information to search for TF motifs


Motif finding in lower eukaryotes
Motif Finding in Lower Eukaryotes likely to occur in the background

  • Upstream sequences longer (500-1000 bp), with some simple repeats

  • Motif width varies (5 – 17 bases)

  • Expression clusters provide decent input sequences quality for TF motif finding

  • Motif combination and redundancy appears, although single motifs are usually significant enough for identification


Yeast promoter architecture
Yeast Promoter Architecture likely to occur in the background

  • Co-occurring regulators suggest physical interaction between the regulators


Motif finding in higher eukaryotes
Motif Finding in Higher Eukaryotes likely to occur in the background

  • Upstream sequences very long (3KB-20KB) with repeats, TF motif could appear downstream

  • Motifs can be short or long (6-20 bases), and appear in combination and clusters

  • Gene expression cluster not good enough input

  • Need:

    • Comparative Genomics: phastcons score

    • Motif modules: motif clusters

    • ChIP-chip/seq


Yeast regulatory sequence conservation
Yeast Regulatory Sequence Conservation likely to occur in the background


Ucsc phastcons conservation
UCSC PhastCons Conservation likely to occur in the background

  • Functional regulatory sequences are under stronger evolutionary constraint

  • Align orthologous sequences together

  • PhastCons conservation score (0 – 1) for each nucleotide in the genome can be downloaded from UCSC


Conserved motif clusters
Conserved Motif Clusters likely to occur in the background

  • First find conserved regions in the genome

  • Then look for repeated transcription factors (TF) binding sites

  • They form transcription factor modules


Summary
Summary likely to occur in the background

  • Biology and challenge of transcription regulation

  • Scan for known TF motif sites: TRANSFAC & JASPAR

  • De novo method

    • Regular expression enumeration

      • Oligonucleotide analysis

      • MobyDick: build long motifs from short ones

    • Position weight matrix update

      • CONSENSUS (sequence order)

      • EM (iterate , ;  ~ weighted  average)

      • Gibbs Sampler (sample , ; Markov chain convergence)

      • Motif score and Markov background

  • Motif cluster and motif conservation


ad