Rule Extraction From Trained Neural Networks

Rule Extraction From Trained Neural Networks Brian Hudson University of Portsmouth, UK

Advantages • High accuracy • Robust • Noisy data • Disadvantages • Lack of comprehensibilty Artificial Neural Networks

Trepan • A method for extracting a decision tree from an artificial neural network (Craven, 1996). • The tree is built by expanding nodes in a best first manner, producing an unbalanced tree. • The splitting tests at the nodes are m-of-n tests • e.g. 2-of-{x1, ¬x2, x3}, where the xi are Boolean conditions • The network is used as an oracle to answer queries during the learning process.

Splitting Tests • Start with a set of candidate tests • binary tests on each value for nominal features • binary tests on thresholds for real-valued features • Find optimal splitting test by a beam search, initializing beam with candidate test maximizing the information gain.

Splitting Tests • To each m-of-n test in the beam and each candidate test, apply two operators: • m-of-(n+1) • e.g. 2-of-{x1, x2} => 2-of-{x1, x2, x3} • (m+1)-of-(n+1) • e.g. 2-of-{x1, x2} => 3-of-{x1, x2, x3} • Admit new tests to the beam if they increase the information gainand differ significantly(chi-squared) from existing tests.

Data Modelling • The amount of training data reaching each node decreases with depth of tree. • TREPAN creates new training cases by sampling the distributions of the training data • empirical distributions for nominal inputs • kernel density estimates for continuous inputs • Apply oracle (i.e. neural network) to new training cases to assign output values.

Application to Bioinformatics Prediction of Splice Junction sites in Eukaryotic DNA

Splice Junction Sites

Consensus Sequences • Donor -3 -2 -1 +1 +2 +3 +4 +5 +6 C/G A G | G T A/G A G T • Acceptor -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 C/T C/T C/T C/T C/T C/T C/T C/T C/T C/T A G |G

EBI Dataset • Clean dataset generated at EBI (Thanaraj, 1999) • Donors • training set: 567 positive, 943 negative • test set: 229 positive, 373 negative • Acceptors • training set: 637 positive, 468 negative • test set: 273 positive, 213 negative

Results

3 of {-2=A, -1=G, +3=A, +4=A, +5=G} Negative 43:533 Positive 869:74 TREPAN Donor Tree Yes No C/G A G | G T A/G A G T

C5 Donor Tree (extract) p5=G p3=C or p3=T => NEGATIVE p3=A p2=G => POSITIVE p2=A p4=A or p4=G => POSITIVE p4=C or p4=T => NEGATIVE p2=C p4=A => POSITIVE else => NEGATIVE p2=T p6=A or p6=G => NEGATIVE p6=C or p6=T => POSITIVE p3=G p4=T => NEGATIVE p4=C p6=T => POSITIVE else => NEGATIVE

1 of {-3=G, -5=G} NEGATIVE {-3=A} 2 of {+1!=G, -5=G} NEGATIVE POSITIVE NEGATIVE Trepan Acceptor Tree C/T … C/T A G| G

Application to Chemoinformatics Learning general rules Conformational Analysis QSAR dataset

Oprea Dataset • 137 diverse compounds • Classification • 62 leads, 75 drugs • 14 descriptors (from Cerius-2) • MW, MR, AlogP • Ndonor, Nacceptor, Nrotbond • Number of Lipinski violations • T.I. Oprea, A.M. Davis, S.J. Teague & P.D. Leeson, “Is there a difference between Leads & Drugs? A Historical Perspective”, J. Chem. Inf. & Comput. Sci., 41, 1308-1315, (2001).

C5 tree MW <= 380 [ Mode: lead ] Rule of 5 Violations = 0 [ Mode: lead ] Hbond acceptor <= 2 [ Mode: lead ] => lead Hbond acceptor > 2 [ Mode: drug ] => drug Rule of 5 Violations > 0 [ Mode: lead ] => lead MW > 380 [ Mode: drug ] => drug

1 of { MW<296, MR<85 } Lead 52:3 MW<454 Unclassified 12:49 Drug 1:20 Trepan Oprea Tree

Conformational Analysis • 300 conformations from • 5ns MD simulation of rosiglitazone • Classified by length of long axis into • Extended – distance > 10A • Folded – distance < 10A • 8 torsion angles • In house data.

Rosiglitazone • Agonist of PPAR gamma Nuclear Receptor • Regulates HDL/LDL and triglycerides • Active ingredient of Avandia for Type II Diabetes

Distances

C5 tree T5 <= 269 [ Mode: extended ] T5 <= 52 [ Mode: extended ] T7 <= 185 [ Mode: extended ] => extended T7 > 185 [ Mode: folded ] T6 <= 75 [ Mode: folded ] => folded T6 > 75 [ Mode: extended ] T5 <= 41 [ Mode: folded ] T8 <= 249 [ Mode: folded ] => folded T8 > 249 [ Mode: extended ] => extended T5 > 41 [ Mode: extended ] => extended T5 > 52 [ Mode: extended ] T6 <= 73 [ Mode: extended ] T8 <= 242 [ Mode: extended ] T5 <= 7 [ Mode: extended ] T8 <= 22 [ Mode: extended ] => extended T8 > 22 [ Mode: folded ] => folded T5 > 7 [ Mode: extended ] => extended T8 > 242 [ Mode: extended ] => extended T6 > 73 [ Mode: extended ] => extended T5 > 269 [ Mode: folded ] => folded

T5 < 180 Extended 133:0 2 of { T7<181, T2>172} Unclassified 2:5 Folded 0:161 Trepan Conformation Tree

Ferreira Dataset • “typical” QSAR dataset • 48 HIV-1 Protease inhibitors • Activity as pIC50 • Low pIC50 < 8.0 • High pIC50 > 8.0 • 14 descriptors (mostly topological) • R. Kiralj and M.M.C. Ferreira, “A-priori Molecular Descriptors in QSAR : a case of HIV-1 protease inhibitors I. The Chemometric Approach”, J. Mol. Graph. & Modell. 21, 435-448, (2003)

Original Results • PLS model • Activity determined by • X9,X11,X10,X13 • R2 = 0.91, Q2=0.85, Ncomps=3

C5 tree X11 <= 2.5 [ Mode: low ] X13 <= 16.7 [ Mode: low ] => low X13 > 16.7 [ Mode: high ] => high X11 > 2.5 [ Mode: high ] => high

1 of { X13<16.1, X9<3.4 } High 1:24 X1<552 X6<0.04 Low 17:1 Low 4:1 High 0:1 Trepan Ferreira Tree

Accuracy

Conclusions • Reasonable Accuracy • Comprehensible Rules

Acknowledgements • David Whitley. • Tony Browne. • Martyn Ford. • BBSRC grant reference BIO/12005.

Rule Extraction From Trained Neural Networks

Rule Extraction From Trained Neural Networks

Presentation Transcript

Neural Networks

Neural Networks

Neural Networks

Neural Networks

NEURAL NETWORKS

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural networks

Neural Networks

From Neurons to Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Text Extraction from an Image Using Neural Networks

Neural Networks

Neural Networks