slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
• Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Pr PowerPoint Presentation
Download Presentation
• Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Pr

Loading in 2 Seconds...

play fullscreen
1 / 22

• Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Pr - PowerPoint PPT Presentation


  • 195 Views
  • Uploaded on

Analyze/StripMiner ™. • Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Predicting on Blind Data PLS (Please Listen to Svante Wold) • LOO, BOO and n-Fold Cross-Validation Error Measures Albumin Data Set and Feature Selection

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '• Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Pr' - colum


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide2

Analyze/StripMiner™

  • • Analyze/StripMiner ™ Overview
  • To obtain an idiot’s guide type “analyze > readme.txt”
  • Standard Analyze Scripts
  • Predicting on Blind Data
  • PLS (Please Listen to Svante Wold)
  • • LOO, BOO and n-Fold Cross-Validation Error Measures
  • Albumin Data Set and Feature Selection
  • • Bio-Informatics
slide3

Analyze/StripMiner™

  • Modeling
    • ANN (Neural Networks)
    • SVM (Support Vector Machines)
    • PLS (Partial-Least Squares)
    • GA-based regression clustering
    • PCA regression
    • Local Learning
    • Outlier Detection (GAMOL)
  • Data Processing
    • Interface with RECON
    • Different Scaling Modes
    • Outlier detection/data cleansing
  • Visualization
    • Correlation Plots
    • 2-D Sensitivity Plots
    • Outlier Visualization Plots
    • Different Scaling Options
    • Cluster Ranking Plots
    • Standard ROC curves
    • Continuous ROC curves
  • Learning Modes
    • Bootstrapping
    • Bagging
    • Boosting
    • Leave-one-out cross-validation
  • Code Specifics
    • Tight Classic C-code (< 15000 lines)
    • Script-Based Shell Program
    • Runs on all Platforms
    • Ultra Fast
    • Use: TransScan – GE - KODAK
    • Doppler broadening
    • Macro-Economics Analysis
  • Feature Selection
    • Sensitivity Analysis
    • Genetic Algorithms
    • Correlation GA (GAFEAT)
    • Method specific

DDASSL

slide4

Analyze/StripMiner ™ Coding Philosophy

  • Standard C code that compiles on all platforms
  • WINDOWS™ and Linux platforms
  • Supporting visualizations use Java and/or gnuplot
  • Flexible GUI with sample problems and demos
  • Fastest code possible with efficient memory requirements
  • Long history of code use with variety of users for troubleshooting
  • Flexible code based on scripts and operators
  • Operates on a numeric standard data mining format file
slide5

Practical Tips for PCA

  • NIPALS algorithm assumes the features are zero centered
  • It is standard practice to do a Mahalanobis scaling of the data
  • PCA regression does not consider the response data
  • The t’s are called the scores
  • It is common practice to drop 4 sigma outlier features
  • (if there are many features)
slide6

StripMiner Script Examples

  • PCA visualization (pca.bat)
  • Pharma-plot (pharma.bat)
  • Prediction for iris with PCA (iris.bat)
  • Bootstrap prediction for iris (iris_boo.bat)
  • Predicting with an external test set example (iris_ext.bat))
  • PLS and ROC curve for iris problem (roc.bat)
  • Leave-One-Out PLS for HIV (loo_hiv.bat)
  • Feature selection for HIV (prune.bat)
  • Starplots (star.bat)
slide7

File Flow for PCA.bat Script

num_eg.txt

stats.txt

la_sscala.txt

iris.txt.txt.txt.txt

  • num_eg.txt contains the number of PCAs (2-10)
  • usually data are first Mahalanobis scaled (option #-3: “PLS scaling”, data only)
slide8

File Flow for pharma.bat script

num_eg.txt

stats.txt

la_sscala.txt

dmatrix.txt

a.txt

pharmaplot

• num_eg.txt has to contain a 4 for a pharmaplot

• use pharmaplot.m for visualization in MATLAB

• adjust color setting threshold in pharmaplot.m

slide9

File Flow For iris.bat Script: Predicting Class

stats.txt

la_sscala.txt

a.txt

cmatrix.txt

dmatrix.txt

resultss.xxx

resultss.ttt

results.xxx

results.ttt

num_eg.txt

  • For the random seed in splitting routine don’t use 0 (preserves order)
  • The test set is really only for validation purposes (answer is known)
  • Note: descaling from PLS uses la_sscala.txt file
  • Notice q2, Q2, and RSME error measures
slide10

File Flow for iris_boo.bat Script:

Bootstrap Validation for Estimating Prediction Confidence

stats.txt

la_sscala.txt

a.txt

resultss.xxx

resultss.ttt

results.ttt

num_eg.txt

  • We use bootstrap cross-validation (e.g., leave 7 out 100 times)
  • Use MATLAB script dos_mbotw results.ttt to display results for test set
  • Use MATLAB script dos_mbotw resultss.xxx to display results training set
  • Notice q2, Q2, and RSME error measures
slide11

Error Measure Criteria

For training set we use:

- RMSE: root mean square error for training set

- r2 : correlation coefficient for training set

- R2: PRESS R2

For validation/test set we use:

- RMSE: reast mean square error for validation set

- q2 : 1 – rtest2

- Q2: PRESS/SD

slide12

Script for Scaling with an External Test Set

  • 3305 scatterplot (Java)
  • -3305 scatterplot gnuplot
  • 3313 errorplot (Java)
  • -3313 errorplot (gnuplot)
slide14

Docking Ligands is a Nonlinear Problem

DDASSL

Drug Design and Semi-Supervised Learning

slide15

Feature Selection (data strip mining)

PLS, K-PLS, SVM, ANN

Fuzzy Expert System Rules

GA or Sensitivity Analysis to select descriptors

slide16

Script for ALBUMIN_LOO.BAT: Pls-loo Validation For Albumin Data

cmatrix.ori

dmatrix.ori

num_eg.txt

stats.txt

la_sscala.txt

a.txt

results.xxx

results.ttt

sel_lbls.txt

bbmatrixx.txt

bbmatrixxx.txt

  • PLS-LOO stands for leave-one-out PLS cross-validation
  • Training set is in cmatrix.ori and external validation set in dmatrix.ori
  • External validation set has –999 or 0 in the activity field
  • Note that we create generic labels and and that there is a test set
  • Notice the dropping of non-changing features and 4-sigma ouliers
  • Notice the acrobatics for displaying metrics (visualize with dos_mbotw)
slide19

PLS Feature Selection Script For Albumin Data

aa.pat

bbmatrixx.txt

sel_lbls.txt

select.txt

sel_lbls.txt

aa.pat

aa.tes

bbmatrixx.txt

bbmatrixxx.txt

  • Do several iterative prunings, typically leave 7 out 100 x
  • Use different seeds
  • Number of selected feature example: 400, 300, 200, 150, 120, 100, 80, 60, 50, 45, …
slide21

STARPLOT.BAT: Starplot for Selected Features for Albumin

sel_lbls.txt

aa.pat

bbmatrixxx.txt

sel_lbls.txt

starplot.txt

starplot

  • First generate bbmatrixxx.txt which contains all sensitivities for (e.g.) 30 boostraps
  • using PLS bootstrap option 33
  • Generate starplot.txt from bbmatrixxx.txt using option 3320
  • Use the MATLAB routine starplot.m (operates on starplot.txt and sel_lbls.txt)