slide1
Download
Skip this Video
Download Presentation
MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

Loading in 2 Seconds...

play fullscreen
1 / 23

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers - PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers. Pengyu Hong 10/06/2005. mRNA transcript. Binding sites. Regulators. Genes. Motivation. Understand transcriptional regulation. Gene X. TF. Model transcriptional regulatory networks. Motivation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers' - morton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

Pengyu Hong

10/06/2005

motivation

mRNA transcript

Binding sites

Regulators

Genes

Motivation
  • Understand transcriptional regulation

Gene X

TF

  • Model transcriptional regulatory networks
motivation1
Motivation

Previous works on motif finding

  • AlignACE (Hughes et al 2000)
  • ANN-Spec (Workman et al 2000)
  • BioProspector (Liu et al 2001)
  • Consensus (Hertz et al 1999)
  • Gibbs Motif Sampler (Lawrence et al 1993)
  • LogicMotif (Keles et al 2004)
  • MDScan (Liu et al 2002)
  • MEME (Bailey and Elkan 1995)
  • Motif Regressor (Colon et al 2003)
  • … …
motivation2

A A C A T C C G

• • •

• • •

Motivation

A widely used model – Motif Weight Matrix

(Stormo et al 1982)

1 2 3 4 5 6 7 8

A 0.19 1.11 -0.17 1.65 -2.65 -2.66 -1.98 0.92

C -0.14 -0.49 1.89 -1.81 1.70 2.32 2.14 -2.07

G -1.39 0.25 -1.22 -1.07 -2.07 -2.07 -2.07 1.13

T 0.86 -1.39 -2.65 -2.65 0.41 -2.65 -1.16 -1.80

Score of the site =

= 10.84

vs. threshold

+

A sequence is a target if it contains a binding site (score > threshold).

Computational << Molecular

motivation3
Motivation

Non-linear binding effects, e.g., different binding modes.

• • • CACCCATACAT • • •

Mode 1

Preferred binding

• • • CATCCGTACAT • • •

Mode 2

• • • CA C/T CC A/G TACAT • • •

• • • CACCCGTACAT • • •

Mode 3

Non-preferred binding

• • • CATCCATACAT • • •

Mode 4

modeling
Modeling

Model a TF-DNA binding classifier as an ensemble model.

ensemble model

weight

base classifier

modeling1

qm(Si)

hm(Si)

Modeling

The mth base classifier

Sequence scoring function:

fm(sik) is a site scoring function (weight matrix + threshold).

The scoring function considers

(a) the number of matching sites

(b) the degree of matching

training boosting

(a) Decide the number of base classifiers.

(b) Learn the parameters of each base classifier and its weight.

Training – Boosting

Modify the confidence-rated boosting (CRB) algorithm (Schapire et al. 1999) to train ensemble models

why boosting

Margin of training samples

Generalization error

Training error

Why Boosting?

Booting is a Newton-like technique that iteratively adds base classifiers to minimize the upper bound on the training error.

(Schapire et al. 1998)

challenges
Challenges

•Positive sequences – targets of a TF

•Negative sequences

  • Sequences are labeled, but not the sites in the sequences.
  • Cannot be well separated by the weight matrix model (linear).
  • Number of negative sequences >> number of positive sequences.
boosting
Boosting

Initialization

•Positive

  • Total weight of the positive samples == Total weight of the negative samples.
  • Since the motif must be an enriched pattern in the positive sequences, use Motif Regressor to find a seed motif matrix W0.

•Negative

boosting1
Boosting

Train a base classifier (BC)

•Positive

•Negative

  • Use the seed matrix W0 +to initialize the mth base classifier qm() and let m=1.
  • Refine m and the parameters of qm() to minimize

where yi is the label of Si and dim is the weight of Si in the mth round.

BC 1

  • Negative information is explicitly used to train qm() and m.
boosting2
Boosting

Adjust sample weights and gives higher weights to previously misclassified samples.

•Positive

•Negative

  • yi is the label of Si
  • dim is the weight of Si in the mth round.
  • dim+1 is the new weight of Si.

BC 1

boosting3
Boosting

Add a new base classifier

•Positive

•Negative

BC 1

BC 2

boosting4
Boosting

Add a new base classifier

•Positive

•Negative

Decision boundary

boosting5
Boosting

Adjust sample weights again

•Positive

•Negative

Decision boundary

boosting6
Boosting

Add one more base classifier

•Positive

•Negative

BC 3

boosting7
Boosting

Add one more base classifier

•Positive

•Negative

Decision boundary

boosting8
Boosting

•Positive

Stop if the result is perfect or the performance on the internal validation sequences drops.

•Negative

Decision boundary

results
Results

Data: ChIP-chip data of Saccharomyces cerevisiae (Lee et al. 2002 )

  • Positive sequences
    • p-value < 0.001
    • Number of positive sequences  25.
  • Negative sequences
    • p-value  0.05 & ratio  1

Got 40 TFs.

results1
Results

Leave-one-out test results

Boosted models vs. Seed weight matrices

Vertical axis: Improvements on specificity

Horizontal axis: TFs

results2
Results

Capture Position-Correlation

+

RAP1

0

Weight Matrix

Base classifier 1

Base classifier 2

Base classifier 3

Boosting

results3
Results

Capture Position-Correlation

REB1

Weight Matrix

Base classifier 1

Base classifier 2

Boosting

ad