Weka solution for the 2004 kdd cup protein homology prediction task
This presentation is the property of its rightful owner.
Sponsored Links
1 / 13

Weka solution for the 2004 KDD Cup Protein Homology Prediction task PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on
  • Presentation posted in: General

Weka solution for the 2004 KDD Cup Protein Homology Prediction task. Bernhard Pfahringer Weka Group, University of Waikato, New Zealand. The problem. Detect homologous protein sequences 153 train sequences * ~1000 sequences ==> 145751 pairs classified as match or not

Download Presentation

Weka solution for the 2004 KDD Cup Protein Homology Prediction task

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Weka solution for the 2004 kdd cup protein homology prediction task

Weka solution for the 2004 KDD CupProtein Homology Prediction task

Bernhard Pfahringer

Weka Group, University of Waikato, New Zealand


The problem

The problem

  • Detect homologous protein sequences

  • 153 train sequences * ~1000 sequences ==>

    145751 pairs classified as match or not

  • Very skewed: only 1296 matches (< 1%!)

  • BUT: excellent attributes


The attributes

The attributes


Algorithms doing well

Algorithms doing well

  • 2fold cross-validation, looking only at predictive accuracy:

    • Linear SVM (with Logistic model on output for better probs, Platt1999)

    • 10 AdaBoosted unpruned decision trees

    • Random rules (~ RandomForest, ECML2004 Rule learning WS)


Performance criteria

Performance criteria

  • Top1: fraction of blocks with a homologous sequence ranked top1 (max)

  • RMSE: root mean squared error (min)

  • RKL: average rank of the lowest ranked homologous sequence (min)

  • APR: average of the average precision in each block (max)

  • Only RMSE depends on absolute values, for all other criteria a good ranking is sufficient


Unique solution

Unique Solution

  • Voted ensemble of three classifiers:

    • Linear SVM + logistic model on output

    • Adaboosted 10 unpruned J48 trees

    • 10^5 random rules

  • Non-standard voting:

    • If SVM and RandomRules agree ==>

      • Average their probabilities

    • ELSE

      • Use Booster as tie-breaker

  • Lucky (first on Proteins, 18th on Physics)


Ensemble performance

Ensemble performance


Attribute ranks

Attribute ranks


What i should have done

What I should have done

  • Optimize separately

  • Bagging for better probability estimates

  • More data engineering (e.g. PCA, …)

  • View it as an outlier detection problem

  • Utilize block structure

  • ?


Standard lessons

(Standard) Lessons

  • Data engineering (good attributes) essential

  • Ensembles are more robust

  • Weka is not just an educational tool

    • [at least some parts scale well]

  • Java/open source DM tools are competitive

  • But: could improve Weka considerably ( volunteers and/or sponsors, get in touch :-)


Finally

Finally

  • A big “THANK YOU” to the organizers of the KDD Cup 2004 !


  • Login