A discriminative method for protein
Sponsored Links
This presentation is the property of its rightful owner.
1 / 17

A discriminative method for protein remote homology detection based on N-Gram PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on
  • Presentation posted in: General

A discriminative method for protein remote homology detection based on N-Gram. Reporter : Xie sifa Mentor : Zou quan. Outline. Introduction. Method. Improve P&R. Conclusion. Introduction. Introduction. Protein homology detection. detect 10%~30% protein structure.

Download Presentation

A discriminative method for protein remote homology detection based on N-Gram

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A discriminative method for protein

remote homology detection based on N-Gram

Reporter : Xie sifa

Mentor : Zou quan


Outline

Introduction

Method

Improve P&R

Conclusion


Introduction


Introduction

Protein homology detection

detect 10%~30%

protein structure

Remote homology detection

...ATTATCCGACGGCCGCCT...

...TCATCTGCACGGCCTCAC...

Similarity<25%

--《生物信息学基础》

孙啸,陆祖宏,谢建明


Process

Data Set

Feature Extraction

Classify


Date Set

Benchmark (Liao and Noble,2003)

Same

superfamily

Similatiry<10-25

4352proteins

TrainSet

Different

family

54 Families

Familyi

Same

family

Test

Set

Different

family


Ngram

2Gram: 400

3Gram: 8000

1Gram: 20

"A Closer Look at Skip-gram Modelling"

--David Guthrie,Ben Allison et al

Skip-Ngram:

"I hit the tennis ball"

"hit the ball" !!!

"the tennis ball"

"I hit the"

"hit the tennis"


Random Forest

Ensemble !!!


Result

the area under the ROC curve

up to first 50 false positives


Result


Result


Improving Recall and Precision

Unbalance data set

Trade-off


Improving Recall and Precision

One family one threshold


Improving Recall and Precision

Train set

0.98+

0.95+

0.93+

0.92+

0.90-

0.87-

0.85+

0.84-

0.81+

0.79+

0.77-

0.75-

0.73-

0.69+

0.65-

0.62-

0.58-

0.55-

0.53-

F value

0.88

0.85

0.82

0.79

0.78

0.76

0.75

0.72

0.70

0.68

0.67

0.63

0.60

0.57

0.56

0.54

0.51

0.49

0.48

0.79

New test

New train

F value

F value

no value

but position!

F value


Improving Recall and Precision


Conclusion

1. Ngram model is successfully used to detect protein remote homology.

The result on the benchmark is satisfied.

2. A novel method is proposed to improve the recall and precision of positive samples. This method yields values of 0.86752 and 0.56470 for mean recall and mean precision, respectively.


  • Login