Transcription regulation transcription factor motif finding
This presentation is the property of its rightful owner.
Sponsored Links
1 / 75

Transcription Regulation Transcription Factor Motif Finding PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on
  • Presentation posted in: General

Transcription Regulation Transcription Factor Motif Finding. Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520. Outline. Biology of transcription regulation and challenges of computational motif finding Scan for known TF motif sites TRASFAC and JASPAR, Sequence Logo De novo method

Download Presentation

Transcription Regulation Transcription Factor Motif Finding

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Transcription regulation transcription factor motif finding

Transcription RegulationTranscription Factor Motif Finding

Xiaole Shirley Liu

STAT115, STAT215, BIO298, BIST520


Outline

Outline

  • Biology of transcription regulation and challenges of computational motif finding

  • Scan for known TF motif sites

    • TRASFAC and JASPAR, Sequence Logo

  • De novo method

    • Regular expression enumeration: w-mer enumerate

    • Position weight matrix update: EM and Gibbs

  • Motif finding in different organisms

    • Motif clusters and conservation


Imagine a chef

Restaurant Dinner

Home Lunch

Certain recipes used to

make certain dishes

Imagine a Chef


Each cell is like a chef

Each Cell Is Like a Chef


Each cell is like a chef1

Adult Liver

Infant Skin

Glucose, Oxygen,

Amino Acid

Fat, Alcohol

Nicotine

Certain genes expressed to

make certain proteins

Healthy

Skin Cell

State

Disease

Liver Cell

State

Each Cell Is Like a Chef


Understanding a genome

Understanding a Genome

Get the complete sequence

(encoded cook book)

Observe gene expressions

at different cell states

(meals prepared at different situations)

Decode gene regulation

(decode the book, understand the rules)


Information in dna

Coding region 2%

What is to be made

Milk->Yogurt

Egg->Omelet

Fish->Sushi

Flour->Cake

Beef->Burger

Information in DNA

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATTTACCACATCGCATCACAGTTCAGGACTAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT


Information in dna1

Morning

Morning

Butter

Japanese Restaurant

5 Oz

Butter

9 Oz

Information in DNA

Non-coding region 98%

Regulation: When, Where,

Amount, Other Conditions, etc

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATTTACCACATCGCATCACTACGACGGACTAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT

Coding region 2%

Milk->Yogurt

Egg->Omelet

Fish->Sushi

Flour->Cake

Beef->Burger


Measure gene expression

Measure Gene Expression

  • Microarray or SAGE detects the expression of every gene at a certain cell state

  • Clustering find genes that are co-expressed (potentially share regulation)


Decode gene regulation

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

Decode Gene Regulation

Look at genes always expressed together:

Upstream RegionsCo-expressed

Genes

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

STAT115, 04/01/2008


Decode gene regulation1

Decode Gene Regulation

Look at genes always expressed together:

Upstream RegionsCo-expressed

Genes

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

STAT115, 04/01/2008


Decode gene regulation2

Morning

Decode Gene Regulation

Look at genes always expressed together:

Upstream RegionsCo-expressed

Genes

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

STAT115, 04/01/2008


Biology of transcription regulation

Hemoglobin Beta

atttgctt ttcact gcaacct

aactccagt

Hemoglobin Zeta

gcaacct

actca

Hemoglobin Alpha

gcaacct

Hemoglobin Gamma

ccagcgccg

gcaacct

Transcription Factor (TF)

TF Binding Motif

gcaacct

Biology of Transcription Regulation

...acatttgcttctgacacaactgtgttcactagcaacctca...aacagacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT...

...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggcagcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC...

...cgctcgcgggccggcactcttctggtccccacagactcag...gatacccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA...

...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gatgcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA...

Motif can only be computational discovered when

there are enough cases for machine learning


Computational motif finding

Computational Motif Finding

  • Input data:

    • Upstream sequences of gene expression profile cluster

    • 20-800 sequences, each 300-5000 bps long

  • Output: enriched sequence patterns (motifs)

  • Ultimate goals:

    • Which TFs are involved and their binding motifs and effects (enhance / repress gene expression)?

    • Which genes are regulated by this TF, why is there disease when a TF goes wrong?

    • Are there binding partner / competitor for a TF?


Challenges where what the signal

Water

Water

Water

Water

Water

Challenges: Where/what the signal

The motif should be abundant

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAAATAAGACACGGACGGC

GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA

TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG

CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT


Challenges where what the signal1

Coconut

Coconut

Coconut

Coconut

Coconut

Challenges: Where/what the signal

The motif should be abundant

And Abundant with significance

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAAATAAGACACGGACGGC

GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA

TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG

CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT


Challenges double stranded dna

|||||||||||||||||||||||||||||

GTGTAGCGTACCATTTATGGTCAAGTCTG

|||||||||||||||||||||||||||||

AGAGTCCATTTAGTCAGTATGATGGGTGT

Challenges: Double stranded DNA

Motif appears in both

strands

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC

TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG


Challenges base substitutions

Challenges: Base substitutions

Sequences do not have to match the motif

perfectly, base substitutions are allowed

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATGTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG

CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT


Challenges variable motif copies

Challenges: Variable motif copies

Some sequences do not have the motif

Some have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC

CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC

TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG

GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA

CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT


Challenges variable motif copies1

Sushi

Fish

Fish

Fish

Hand Roll

Fish

Fish

Fish

Fish

Sashimi

Tempura

Sake

Challenges: Variable motif copies

Some sequences do not have the motif

Some have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC

CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC

TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG

GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA

CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT


Challenges two block motifs

Coconut

Milk

or palindromic patterns

AATGCG

GCGTAA

Challenges: Two-block motifs

Some motifs have two parts

GACACATTTACCTATGCTGGCCCTACGACCTCTCGC

CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC

GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA

TCTCGTTAGATTTACCACCCATGGCCGTATCGAGAGCG

CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT


Scan for known tf motif sites

Scan for Known TF Motif Sites

  • Experimental TF sites: TRANSFAC, JASPAR

  • Motif representation:

    • Regular expression: Consensus CACAAAA

      binary decisionDegenerate CRCAAAW

      IUPAC

A/G

A/T


Scan for known tf motif sites1

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)

p(generate ATGCAGCT from background)

p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Scan for Known TF Motif Sites

  • Experimental TF sites: TRANSFAC, JASPAR

  • Motif representation:

    • Regular expression: Consensus CACAAAA

      binary decision Degenerate CRCAAAW

    • Position weight matrix (PWM): need score cutoff

Motif Matrix

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Sites


Iupac for dna

A adenosine

Ccytidine

Gguanine

T thymidine

U uridine

R G A (purine)

Y T C (pyrimidine)

K G T (keto)

M A C (amino)

S G C (strong)

W A T (weak)

B C G T (not A)

D A G T (not C)

H A C T (not G)

VA C G (not T)

NA C G T (any)

IUPAC for DNA


Protein binding microarrays

Protein Binding Microarrays

  • In vitro protein-DNA interactions

  • Better capture motifs


Jaspar

JASPAR

  • User defined cutoff to scan for a particular motif


A word on sequence logo

A Word on Sequence Logo

  • SeqLogo consists of stacks of symbols, one stack for each position in the sequence

  • The overall height of the stack indicates the sequence conservation at that position

  • The height of symbols within the stack indicates the relative frequency of nucleic acid at that position

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG


Scan known tf motifs

Scan Known TF Motifs

  • Drawbacks:

    • Limited number of motifs

    • Limited number of sites to represent each motif

      • Low sensitivity and specificity

    • Poor description of motif

      • Binding site borders not clear

      • Binding site many mismatches

    • Many motifs look very similar

      • E.g. GC-rich motif, E-box (CACGTG)


De novo sequence motif finding

De novo Sequence Motif Finding

  • Goal: look for common sequence patterns enriched in the input data (compared to the genome background)

  • Regular expression enumeration

    • Pattern driven approach

    • Enumerate patterns, check significance in dataset

    • Oligonucleotide analysis, MobyDick

  • Position weight matrix update

    • Data driven approach, use data to refine motifs

    • Consensus, EM & Gibbs sampling

    • Motif score and Markov background


Regular expression enumeration

Expected occurrence of w in the data

pw from

genome background

size of sequence data

Regular Expression Enumeration

  • Oligonucleotide Analysis: check over-representation for every w-mer:

    • Expected w occurrence in data

      • Consider genome sequence + current data size

    • Observed w occurrence in data

    • Over-represented w is potential TF binding motif

Observed occurrence of win the data


Mobydick

MobyDick

  • A sequence data and a dictionary of motif words

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}

Pw = {0.22, 0.28, 0.28, 0.22}


Mobydick1

A

C

G

T

A

AA

AC

AG

AT

C

CA

CC

CG

CT

G

GA

GC

GG

GT

T

TA

TC

TG

TT

MobyDick

  • A sequence data and a dictionary of motif words

  • Check over-representation of every word-pair

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}

Pw = {0.28, 0.22, 0.22, 0.28}


Mobydick2

A

C

G

T

A

AA

AC

AG

AT

C

CA

CC

CG

CT

G

GA

GC

GG

GT

T

TA

TC

TG

TT

MobyDick

  • A sequence data and a dictionary of motif words

  • Check over-representation of every word-pair

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}

Pw = {0.28, 0.28, 0.22, 0.22}

D = {A,C,G,T,AA,GA,TA,GG}

Pw = {?}


Mobydick3

MobyDick

  • D = {A,C,G,T,AA,GA,TA,GG}

  • Seq: AAGATAA

  • Possible partitions:

    A A G A T A ApA pA pG pA pT pA pA

    AA G A T A ApAA pG pA pT pA pA

    AA GA T A ApAA pGA pT pA pA

    AA GA TA ApAA pGA pTA pA

    A A GA T AApAA pGA pT pAA

  • Assign probabilities as to maximize total probability of generating the sequence


Mobydick4

A

C

G

T

A

AA

AC

AG

AT

C

CA

CC

CG

CT

G

GA

GC

GG

GT

T

TA

TC

TG

TT

MobyDick

  • A sequence data and a dictionary of motif words

  • Check over-representation of every word-pair

  • Reassign word probability and consider every new word-pair to build even longer words

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}

Pw = {0.28, 0.28, 0.22, 0.22}

D = {A,C,G,T,AA,GA,TA,GG}

Pw = {?}


Regular expression enumeration1

Regular Expression Enumeration

  • RE Enumeration Derivatives:

    • oligo-analysis, spaced dyads w1.ns.w2

    • IUPAC alphabet

    • Markov background (later)

    • 2-bit encoding, fast index access

    • Enumerate limited RE patterns known for a TF protein structure or interaction theme

  • Exhaustive, guaranteed to find global optimum, and can find multiple motifs

  • Not as flexible with base substitutions, long list of similar good motifs, and limited with motif width


Consensus

Seq1

Seq2

Good Motifs

CACGTGC GTCAGTC

CACGTTC GTCAGTC

Bad Motifs

CACGTGC GTCAGTC

GTGACATTGGAAAT

Consensus

  • Starting from the 1st sequence, add one sequence at a time to look for the best motifs obtained with the additional sequence


Consensus1

Seq3

Good Motifs

CACGTGC GTCAGTC

CACGTTC GTCAGTC

CTCGTGC GACAGTC

Bad Motifs

CACGTGC GTCAGTC

CACGTTC GTCAGTC

TTCAAAGAGACTCA

Consensus

  • Starting from the 1st sequence, add one sequence at a time to look for the best motifs obtained with the additional sequence

Remaining good motifs


Consensus2

Consensus

  • Starting from the 1st sequence, add one sequence at a time to look for the best motifs obtained with the additional sequence

  • G Stormo, algorithm runs very fast

  • Sequence order plays a big role in performance

    • First two sequences better contain the motif

    • Sites stop accumulating at the first bad sequence

    • Newer version allowing [0-n] is much slower


Expectation maximization and gibbs sampling model

Expectation Maximization and Gibbs Sampling Model

  • Objects:

    • Seq: sequence data to search for motif

    • 0: non-motif (genome background) probability

    • : motif probability matrix parameter

    • : motif site locations

  • Problem: P(, | seq, 0)

  • Approach: alternately estimate

    •  by P( | , seq, 0)

    •  by P( | , seq, 0)

    • EM and Gibbs differ in the estimation methods


Expectation maximization

E step:  | , seq, 0

TTGACGACTGCACGT

TTGACp1

TGACGp2

GACGAp3

ACGACp4

CGACTp5

GACTGp6

ACTGCp7

CTGCAp8

...

P1 = likelihood ratio =

P(TTGAC| )

P(TTGAC| 0)

Expectation Maximization

p0T  p0T  p0G  p0A p0C

= 0.3  0.3  0.2  0.3  0.2


Expectation maximization1

E step:  | , seq, 0

TTGACGACTGCACGT

TTGACp1

TGACGp2

GACGAp3

ACGACp4

CGACTp5

GACTGp6

ACTGCp7

CTGCAp8

...

M step:  | , seq, 0

p1 TTGAC

p2 TGACG

p3 GACGA

p4 ACGAC

...

Scale ACGT at each position,  reflects weighted average of 

Expectation Maximization


Em derivatives

EM Derivatives

  • First EM motif finder (C Lawrence)

    • Deterministic algorithm, guarantee local optimum

  • MEME (TL Bailey)

    • Prior probability allows 0-n site / sequence

    • Parallel running multiple

      EM with different seed

    • User friendly results


Gibbs sampling

Gibbs Sampling

  • Stochastic process, although still may need multiple initializations

    • Sample  from P( | , seq, 0)

    • Sample  from P( | , seq, 0)

  • Collapsed form:

    •  estimatedwith counts, not sampling from Dirichlet

    • Sample site from one seq based on sites from other seqs

  • Converged motif matrix  and converged motif sites  represent stationary distribution of a Markov Chain


Transcription regulation transcription factor motif finding

Gibbs Sampler

nA1 + sA

nA1 + sA +nC1 + sC +nG1 + sG +nT1 + sT

 estimated with counts

pA1 =

1

11

2

21

31

3

4

41

51

5

Initial 1

  • Randomly initialize a probability matrix


Gibbs sampler

1 Without

11Segment

Gibbs Sampler

  • Take out one sequence with its sites from current motif

11

21

31

41

51


Gibbs sampler1

1 Without

11Segment

Gibbs Sampler

  • Score each possible segment of this sequence

Sequence 1

Segment (1-8)

21

31

41

51


Transcription regulation transcription factor motif finding

Gibbs Sampler

1 Without

11Segment

  • Score each possible segment of this sequence

Sequence 1

Segment (2-9)

21

31

41

51


Segment score

Motif Matrix

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)

p(generate ATGCAGCT from background)

p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Sites

Segment Score

  • Use current motif matrix to score a segment


Scoring segments

Scoring Segments

Motif12345bg

A0.40.10.30.40.20.3

T0.20.50.10.20.20.3

G0.20.20.20.30.40.2

C0.20.20.40.10.20.2

Ignore pseudo counts for now…

Sequence: TTCCATATTAATCAGATTCCG… score

TAATC…

AATCA0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383

ATCAG0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185

TCAGA0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444

CAGAT…


Gibbs sampler2

12

Modified 1

 estimated with counts

Gibbs Sampler

  • Sample site from one seq based on sites from other seqs

21

31

41

51


How to sample

How to Sample?

  • Rand(subtotal) = X

  • Find the first position with subtotal larger than X


Gibbs sampler3

1 Without

21Segment

Gibbs Sampler

  • Repeat the process until motif converges

21

12

31

41

51


Gibbs sampler intuition

Gibbs Sampler Intuition

  • Beginning:

    • Randomly initialized motif

    • No preference towards any segment


Gibbs sampler intuition1

Gibbs Sampler Intuition

  • Motif appears:

    • Motif should have enriched signal (more sites)

    • By chance some correct sites come to alignment

    • Sites bias motif to attract other similar sites


Gibbs sampler intuition2

Gibbs Sampler Intuition

  • Motif converges:

    • All sites come to alignment

    • Motif totally biased to sample sites every time


Transcription regulation transcription factor motif finding

Gibbs Sampler

1

2

3

4

5

1i

2i

3i

4i

5i

  • Column shift

  • Metropolis algorithm:

    • Propose * as  shifted 1 column to left or right

    • Calculate motif score u() and u(*)

    • Accept * with prob = min(1, u(*) / u())


Gibbs sampling derivatives

Gibbs Sampling Derivatives

  • Gibbs Motif Sampler (JS Liu)

    • Add prior probability to allow 0-n site / seq

    • Sample motif positions to consider

  • AlignACE (F Roth)

    • Look for motifs from both strands

    • Mask out one motif to find more different motifs

  • BioProspector (XS Liu)

    • Use background model with Markov dependencies

    • Sampling with threshold (0-n sites / seq), new scoring function

    • Can find two-block motifs with variable gap


Scoring motifs

Motif Matrix

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)

p(generate ATGCAGCT from background)

p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Sites

Scoring Motifs

  • Information Content (also known as relative entropy)

    • Suppose you have x aligned segments for the motif

    • pb(s1 from mtf) / pb(s1 from bg) *

      pb(s2 from mtf) / pb(s2 from bg) *…

      pb(sx from mtf) / pb(sx from bg)


Scoring motifs1

Motif Matrix

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)

p(generate ATGCAGCT from background)

p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Sites

Scoring Motifs

  • Information Content (also known as relative entropy)

    • Suppose you have x aligned segments for the motif

    • pb(s1 from mtf) / pb(s1 from bg) *

      pb(s2 from mtf) / pb(s2 from bg) *…

      pb(sx from mtf) / pb(sx from bg)


Scoring motifs2

Scoring Motifs

pb(s1 from mtf) / pb(s1 from bg) *

pb(s2 from mtf) / pb(s2 from bg) *…

pb(sx from mtf) / pb(sx from bg)

= (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2 (pC2/pC0)C2…

Take log of this:

= A1 log (pA1/pA0) + T1 log (pT1/pT0) +

T2 log (pT2/pT0) + G2 log (pG2/pG0) + …

Divide by the number of segments (if all the motifs have same number of segments)

= pA1 log (pA1/pA0) + pT1 log (pT1/pT0) + pT2 log (pT2/pT0)…

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG


Scoring motifs3

=

Motif Conservedness: How likely to see the current aligned segments from this motif model

Bad

AGGCA

ATCCC

GCGCA

CGGTA

TGCCA

ATGGT

TTGAA

Good

ATGCA

ATGCC

ATGCA

ATGCA

TTGCA

ATGGA

ATGCA

Scoring Motifs

  • Original function: Information Content


Scoring motifs4

=

Motif Specificity:

How likely to see the current aligned segments from background

Scoring Motifs

  • Original function: Information Content

Good

AGTCC

AGTCC

AGTCC

AGTCC

AGTCC

AGTCC

AGTCC

Bad

ATAAA

ATAAA

ATAAA

ATAAA

ATAAA

ATAAA

ATAAA


Scoring motifs5

=

Scoring Motifs

  • Original function: Information Content

    Which is better?

    (data = 8 seqs)

Motif 1

AGGCTAAC

AGGCTAAC

Motif 2

AGGCTAAC

AGGCTACC

AGGCTAAC

AGCCTAAC

AGGCCAAC

AGGCTAAC

TGGCTAAC

AGGCTTAC

AGGCTAAC

AGGGTAAC


Scoring motifs6

Specific (unlikely in genome background)

Motif Signal

Abundant

Positions

Conserved

Scoring Motifs

  • Motif scoring function:

  • Prefer: conserved motifs with many sites, but are not often seen in the genome background


Markov background increases motif specificity

Prefers motif segments enriched only in data, but not so likely to occur in the background

Segment ATGTA score =

p(generate ATGTA from )

p(generate ATGTA from 0)

3rd order Markov dependency

p( )

Markov Background Increases Motif Specificity

TCAGC = .25  .25  .25  .25  .25

.3  .18  .16  .22  .24

ATATA = .25  .25  .25  .25  .25

.3  .41  .38  .42  .30


Position weight matrix update

Position Weight Matrix Update

  • Advantage

    • Can look for motifs of any widths

    • Flexible with base substitutions

  • Disadvantage:

    • EM and Gibbs sampling: no guaranteed convergence time

    • No guaranteed global optimum


Motif finding in bacteria

Motif Finding in Bacteria

  • Promoter sequences are short (200-300 bp)

  • Motif are usually long (10-20 bases)

    • Some have two blocks with a gap, some are palindromes

    • Long motifs are usually very degenerate

  • Single microarray experiment sometimes already provides enough information to search for TF motifs


Motif finding in lower eukaryotes

Motif Finding in Lower Eukaryotes

  • Upstream sequences longer (500-1000 bp), with some simple repeats

  • Motif width varies (5 – 17 bases)

  • Expression clusters provide decent input sequences quality for TF motif finding

  • Motif combination and redundancy appears, although single motifs are usually significant enough for identification


Yeast promoter architecture

Yeast Promoter Architecture

  • Co-occurring regulators suggest physical interaction between the regulators


Motif finding in higher eukaryotes

Motif Finding in Higher Eukaryotes

  • Upstream sequences very long (3KB-20KB) with repeats, TF motif could appear downstream

  • Motifs can be short or long (6-20 bases), and appear in combination and clusters

  • Gene expression cluster not good enough input

  • Need:

    • Comparative Genomics: phastcons score

    • Motif modules: motif clusters

    • ChIP-chip/seq


Yeast regulatory sequence conservation

Yeast Regulatory Sequence Conservation


Ucsc phastcons conservation

UCSC PhastCons Conservation

  • Functional regulatory sequences are under stronger evolutionary constraint

  • Align orthologous sequences together

  • PhastCons conservation score (0 – 1) for each nucleotide in the genome can be downloaded from UCSC


Conserved motif clusters

Conserved Motif Clusters

  • First find conserved regions in the genome

  • Then look for repeated transcription factors (TF) binding sites

  • They form transcription factor modules


Summary

Summary

  • Biology and challenge of transcription regulation

  • Scan for known TF motif sites: TRANSFAC & JASPAR

  • De novo method

    • Regular expression enumeration

      • Oligonucleotide analysis

      • MobyDick: build long motifs from short ones

    • Position weight matrix update

      • CONSENSUS (sequence order)

      • EM (iterate , ;  ~ weighted  average)

      • Gibbs Sampler (sample , ; Markov chain convergence)

      • Motif score and Markov background

  • Motif cluster and motif conservation


  • Login