Comparative motif finding
Download
1 / 42

Comparative Motif Finding - PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on

Comparative Motif Finding. CS 374 – Lecture 23 Mayukh Bhaowal. Reference Papers.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Comparative Motif Finding' - gezana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Comparative motif finding

Comparative Motif Finding

CS 374 – Lecture 23

Mayukh Bhaowal


Reference papers
Reference Papers

  • Xiaohui Xie, Jun Lu, E. J. Kulbokas, Todd R. Golub, Vamsi Mootha, Kerstin Lindblad-Toh, Eric S. Lander, Manolis Kellis, “Systematic discovery of regulatory motifs in human promoters and 3’UTRs by comparison of several mammals”, Nature, 2005

  • Mathieu Blanchette and Martin Tompa, “Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting”, Genome Res. 2002 12: 739-748


What is a motif
What is a Motif ?

  • A motif is a nucleotide sequence pattern and has biological significance.

  • Regulatory motifs are DNA fragments


Motif logos
Motif Logos

  • Height of letters represents probability of being found in that location in the motif


Why is it difficult to find them
Why is it difficult to find them?

1. Short fragments

2. Degenerate

3. Unpredictable

Motifs can occur in either strands.


Promoter
Promoter

  • In genetics, a promoter is a DNA sequence that enables a gene to be transcribed. The promoter is recognized by RNA polymerase, which then initiates transcription.


3 utr
3’ UTR

  • The three prime untranslated region (3' UTR) is a particular section of messenger RNA (mRNA).

  • An mRNA codes for a protein through translation. The mRNA also contains regions that are not translated. In eukaryotes the 5' untranslated region, 3' untranslated region, cap and polyA tail.

Image source : http://en.wikipedia.org/wiki/Image:MRNA_structure.png


What the paper proposes
What the paper proposes

  • What? Discovering the regulatory motifs in human promoters and 3’ UTRs.

  • How? By comparing sequence motifs of several mammals. That’s why it is called comparative motif finding.

  • Which mammals? Human, mouse, rat, dog.



Methods
Methods

  • Chose 17,700 well annotated genes from RefSeq database.

  • Promoters = 4kb centered at transcriptional start site (only noncoding)

  • 3-UTRs = based on annotation of reference mRNA

  • Intronic sequences as a control (last two introns from each gene)


Motif conservation score
Motif Conservation Score

  • A motif is said to be conserved when an exact match is found in all 4 species.

  • Conservation =conserved occurrences/all occurrences

  • MCS =

Observed conservation

– random conservation

Standard deviation


Known highly conserved motif
Known highly conserved motif

  • Err α [TGACCTTG]

  • Of the 434 times err α occurs in human promoter regions, 162 of them are conserved across all the 4 species.

  • Conservation rate = 37%

  • Random 8-mer motif shows only 6.8% conservation rate


Results promoter region
Results: Promoter Region

  • 174 highly conserved motifs (MCS > 6)

  • 59 strong match to known motifs, 10 weaker match.

  • 105 potential new regulatory motifs


Approaches to explore biological significance
Approaches to explore biological significance

  • So why is the motif biologically significant?

    1. tissue specificity

    2. positional bias


Tissue specificity
Tissue Specificity

  • Tissue specificity of expression for genes containing discovered motifs

  • Expression data for 75 tissues

  • 59 of 69 known, and 53 of 105 unknown show tissue specificity


Position bias
Position Bias

  • Motifs show position bias

  • Conserved motifs show strong position bias

  • Preferential occurrence within 100bases of TSS


Results motifs in 3 utrs
Results: motifs in 3’ UTRs

  • In UTR 106 conserved motifs found (MCS>6)

  • 3’-UTR motifs have not studied before

  • Comparison of discovered motifs to a large collection of previously known motifs not possible

  • Two unique properties

    • Strand specificity

    • Bias towards 8-mers


Property1 strand specificity
Property1: strand specificity

Xie, X. et al., Nature, 2005


Property2 bias towards 8 mers
Property2 : bias towards 8-mers

Xie, X. et al., Nature, 2005


Digression mirna
Digression: miRNA

  • Single stranded RNA

  • transcribed from DNA but not translated into protein

  • Many mature miRNA start with U followed by a 7-base “seed” complementary to a site in the 3’ UTR of target mRNAs.

  • Thus many are 8 mers

microRNA that regulates insulin secretion by an NYU study published in Nature.


Inference
Inference

  • Thus we can infer many of the conserved 8-mer motifs act as binding sites for miRNA

  • Leads to discovery of 52% existing miRNA genes

  • Leads to discovery of 129 new miRNA genes



Problem definition why
Problem Definition (why?)

  • Major challenge of current genomics is to understand how gene expression is regulated.

  • An important step towards this understanding is the capability to identify regulatory elements.


What?

  • Phylogenetic footprinting is

    1. method for the discovery of regulatory elements

    2. in orthologous regulatory regions

    3. from multiple species.


Image source: http://www.biorecipes.com/Orthologues/code.html


Main idea
Main idea http://www.biorecipes.com/Orthologues/code.html

  • Coding sequences evolving at a slower rate than non-coding sequences cause selective pressure

  • Transition in a coding sequence can possibly alter the whole function of coded protein

  • Transition in a non-coding sequence (RE) may only change expression frequency of a gene


Phylogenetic footprinting1
Phylogenetic Footprinting http://www.biorecipes.com/Orthologues/code.html

  • Study orthologous non-coding DNA from species that are related (phylogenetic tree)

    Differentiation:

  • Tree

  • Find one motif in many species

Well conserved = possible Regulatory Element


Formalization
Formalization http://www.biorecipes.com/Orthologues/code.html

Given:

  • phylogenetic tree T,

  • set of orthologous sequences at leaves of T,

  • length k of motif

  • threshold d

    Problem:

  • Find each set Sof k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.


Small example

AGTCGTACGTGAC http://www.biorecipes.com/Orthologues/code.html...(Human)

AGTAGACGTGCCG...(Chimp)

ACGTGAGATACGT...(Rabbit)

GAACGGAGTACGT...(Mouse)

TCGTGACGGTGAT... (Rat)

Small Example

Size of motif sought: k = 4


Solution

AGTCGT http://www.biorecipes.com/Orthologues/code.htmlACGTGAC...

AGTAGACGTGCCG...

ACGTGAGATACGT...

GAACGGAGTACGT...

TCGTGACGGTGAT...

ACGT

ACGT

ACGT

ACGG

Solution

Parsimony score: 1 mutation


An exhaustive algorithm

http://www.biorecipes.com/Orthologues/code.html

ACGG: +ACGT: 0

...

… ACGG:ACGT :0 ...

… ACGG:ACGT :0 ...

… ACGG:ACGT :0 ...

… ACGG: 1 ACGT: 0

...

… ACGG: 2ACGT: 1

...

ACGG: 1ACGT: 1

\...

ACGG: 0ACGT: 2

...

… ACGG: 0 ACGT: +

...

An Exhaustive Algorithm

Wu[s] = best parsimony score for subtree rooted at node u,

if u is labeled with string s.

AGTCGTACGTG

ACGGGACGTGC

ACGTGAGATAC

GAACGGAGTAC

TCGTGACGGTG


Simple recurrence

W http://www.biorecipes.com/Orthologues/code.htmlu[s] =  min ( Wv[t] + h(s, t) )

v:children t ofu

Simple Recurrence

Words Good:

K-mer score at a node is the sum of its

children’s best parsimony scores for that k-mer


Running time

W http://www.biorecipes.com/Orthologues/code.htmlu[s] =  min ( Wv[t] + h(s, t) )

v:children t ofu

Average

sequence

length

Number of

species

Total time O(n k(42k + l))

Motif length

Running Time

O(k 42k )timeper node


Results
Results http://www.biorecipes.com/Orthologues/code.html

  • Metallothionein Gene Family

  • Insulin Gene Family

  • C-myc promoter


Metallothionein gene family
Metallothionein Gene Family http://www.biorecipes.com/Orthologues/code.html

  • Large number of promoter sequences

  • Large number of RE

  • Binding sites occurs within 300 bp of start codon

  • 590 bp of sequence located upstream of start codon

  • Conserved elements of lengths 7,8,9,10 (K values)

  • Identified 12 motifs of which 4 have been confirmed

Analysis


Insulin gene family
Insulin Gene Family http://www.biorecipes.com/Orthologues/code.html

  • two rodents and a pig (two gene copies each)

  • motifs with 0 mutations, K=8

  • motifs with 1 mutation, K=9,10

  • 4 conserved motifs identified

  • Several binding sites missed as they contain very few mutations

Analysis


C myc promoter
C-myc Promoter http://www.biorecipes.com/Orthologues/code.html

  • 7 species analyzed

  • Contains members from diverse animal phyla (fishes, birds, mammals, batrachians)

  • 4 of 9 predictions known are binding sites

  • Most located in 120 bp promoter region

Analysis


Drawbacks
Drawbacks http://www.biorecipes.com/Orthologues/code.html

  • Some binding sites does not have significant matches to most other species

  • Some binding sites show good conservation rate in sequences shorter than footprinter looked at

T3R


Drawbacks cont d
Drawbacks cont’d http://www.biorecipes.com/Orthologues/code.html

  • Deletions/Insertions

  • Failure to meet statistical significance

  • Some TFs bind as dimers where the binding site may consist of 2 conserved regions, separated by a few variable nucleotides


Thank you
Thank You http://www.biorecipes.com/Orthologues/code.html


ad