1 / 38

Hendrik Blockeel Katholieke Universiteit Leuven

Experiment Databases: A novel methodology for experimental research in machine learning and data mining. Hendrik Blockeel Katholieke Universiteit Leuven. Overview. Motivation Disadvantages of current practice How to improve this Experiment databases Setting up an experiment database

glenna
Download Presentation

Hendrik Blockeel Katholieke Universiteit Leuven

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiment Databases:A novel methodology for experimental research in machine learning and data mining Hendrik Blockeel Katholieke Universiteit Leuven

  2. Overview • Motivation • Disadvantages of current practice • How to improve this • Experiment databases • Setting up an experiment database • Querying an experiment database • An illustration • Conclusions

  3. Motivation

  4. Motivation • Much research in machine learning / data mining involves experimental evaluation • Interpreting experimental results is more difficult than it may seem • Generalizability is not always clear • Reproducibility is often low • Moreover, experiments are not reusable • Experimental hypothesis changes -> need new experiments • Can we improve this?

  5. 1) Generalizability • In a typical experiment, a few specificimplementations of algorithms, with specificparametersettings, are compared on afewdatasets, and then general conclusions are drawn • How generalizable are these results really?

  6. The curse of dimensionality:very sparse evidence! High-dimensional algorithm parameter space AP x High-dimensional dataset characteristics space DC = very high-dimensional space APxDC A few runs on a few datasets = very sparse evidence! . . .

  7. Evidence suggests that too general conclusions are often drawn • E.g., Perlich et al.: different relative performance of techniques depending on size of dataset • Perlich, Provost, Simonoff, Tree induction vs. logistic regression: a learning curve analysis, JMLR Dec. 2003 • How many papers explicitly take this size into account? • We should strive for a better covering of the whole AP x DC space

  8. 2) Reproducibility • Experiments are often performed: • with a specific implementation (“version 1.3.23”) of some algorithm... • with certain parameter settings... • sometimes on specific versions of datasets... • Unless all this information is logged, experiments are not repeatable • The information is definitely not given in most papers! • Many techniques are way too complex to give a full description anyway

  9. 3) Reusability • A typical story: • I want to measure the effect of parameter “minsupport” on performance of Apriori • I find some effect, indeed • But I suddenly wonder whether this effect is different for different values of the “sparseness” of the data • I have to setup a new batch of experiments to do this! (takes time and effort... => I probably won’t do it) • If only I had recorded the “sparseness” feature from the beginning...

  10. An improved methodology • We here argue in favour of an improved experimental methodology: • Perform much more experiments • Better coverage of AP x DC space • Store results in an “experiment database” • Better reproducibility • Query / Mine that database for patterns • More advanced analysis possible • This approach has some characteristics of “inductive databases”: inductive queries, constraint-based mining

  11. Experiment databases

  12. Better coverage of AP-DC • To obtain better coverage of AP-DC space: • DC: Generate many datasets with varying properties • Synthetic dataset generator needed • Should generate datasets with widely varying (random?) characteristics (challenging!) • AP: Generate many instantiations of a single algorithm • Widely varying (random?) parameter values • Run random algorithm instantiation on random dataset, record results

  13. Why random values?Illustration in 2-D Assume we want to determine the effect of p1 and p2 on accuracy, and we measure accuracy at the following points: p2 p2 . p2 . p2 . . . . . . . . . . . . . . . . . . . . . . p1 p1 p1 p1 effect of p2 is not measured effect of p2 cannot be distinguished from effect of p1 p1 and p2 vary independently p1 disturbs effect of p2, and v.v. p1 and p2 vary independently #points = exp(D)

  14. Better reproducibility • To obtain better reproducibility: • Store all dataset characteristics and algorithm parameters in the database • Ideally: store datasets themselves! • Might not be feasible; instead: store the generator’s parameter values (incl. random seed), allowing regeneration of exact dataset • Storing only dataset characteristics should still allow good, not perfect, reproducibility

  15. Better reusability • To obtain better reusability: • Store as many parameters and performance metrics as possible, even if not needed immediately • E.g., store runtime, accuracy, TP and FP rate, complexity of resulting model, ... • When a new hypothesis arises, relevant data to test the hypothesis can be extracted from the existing database (no new experiments needed) • Stimulates more thorough analysis of results

  16. Setup of an experiment database • The ExpDB is filled with results from random instantiations of algorithms, on random datasets • Algorithm parameters, dataset properties are recorded • Performance criteria are measured and stored • These experiments cover the whole APxDC-space Choose alg. Choose param Generate dataset Run Store Alg. par., dataset prop., results CART C4.5 Ripper ... Leaf size > 2 Heuristic = gain ... #examples=1000 #attr=20 ...

  17. Setup of an experiment database • When experimenting with 1 learner, e.g., C4.5: Algorithm parameters Dataset characteristics Performance MLS heur ... Ex Attr Compl ... TP FP RT ... 2 gain ... 1000 20 17 ... 350 65 17 ... 2 gr ... 5000 25 12 ... 300 13 21 ... ... ... ... ... ... ... ... ... ... ... ...

  18. Setup of an experiment database • When experimenting with multiple learners: • More complicated setting: for instance, different learners have different parameters • We won’t consider this, for now ExpDB Alg. Inst. PI Ex Attr Compl ... TP FP RT ... DT C4.5 C45-1 1000 20 17 ... 1000 20 17 ... DT CART CA-1 2000 50 12 ... 1000 20 17 ... C4.5ParInst CART-ParInst PI MLS heur ... PI BS heur ... C45-1 2 gain ... CA-1 yes Gini ...

  19. Querying the ExpDB • Once the experiment database is filled, experimental hypotheses can be tested by querying the database of existing experiments (instead of setting up new experiments) • Much less effort for the human user • Simple queries may already allow quite thorough analysis • But would work even better if appropriate query languages were available • SQL not ideal (we will see some examples) • Look at inductive query languages, from IDB community

  20. Experimental questions and hypotheses • Example questions: • What is the effect of Parameter X on runtime ? • What is the effect of the number of examples in the dataset on TP and FP? • Do X and Y interact with respect to their effect on runtime? • I.e., is the effect of X on RT different for different values of Y? • For what kind of datasets does a larger value of parameter X lead to improved performance? • Which parameter is most important to tune (has the biggest effect on predictive accuracy)? • ....

  21. x SELECT NItems, AVG Runtime FROM ExpDB GROUP BY NItems SORT BY NItems x x Runtime x x x x x x x NItems Investigating a simple effect • The effect of #Items on Runtime, for frequent itemset algorithms x x x x x SELECT NItems, Runtime FROM ExpDB SORT BY NItems Runtime x x x x x x x NItems

  22. Investigating a simple effect • Note: setting all parameters randomly creates more variance in the results • In the classical approach, these other parameters would simply be kept constant • This leads to clearer, but possibly less generalisable results • This can be simulated easily in the ExpDB setting! • + : condition is explicit in the query • - : we use only a part of the ExpDB • Hence, ExpDB needs to have many experiments x SELECT NItems, Runtime FROM ExpDB WHERE MinSupport=0.05 SORT BY NItems Runtime x x x NItems

  23. Investigating interaction of effects • E.g., does effect of NItems on Runtime change with MinSupport and NTrans? a=0.01 FOR a=0.01, 0.02, 0.05, 0.1 DO FOR b=103,104, 105,106,107 DO PLOT SELECT Nitems, Runtime FROM ExpDB WHERE MinSupport=$a AND $b <= NTrans < 10*$b SORT BY NITems Runtime a=0.1 Nitems

  24. Direct questions instead of repeated hypothesis testing (“true” data mining) • What is the algorithm parameter that has the strongest influence on the runtime of my learning system? SELECT ParName, Var(A)/Avg(V) as Effect FROM AlgorithmParameters, (SELECT $ParName, Var(Runtime) as V, Avg(Runtime) as A FROM ExpDB GROUP BY $ParName) GROUP BY ParName SORT BY Effect Not (easily) expressible in standard SQL ! (pivoting: possible by hardcoding all attribute names in the query: not very readable or reusable)

  25. AlgorithmParameters SELECT Minsupport, Var(Runtime) as V, Avg(Runtime) as A FROM ExpDB GROUP BY Minsupport SELECT Order, Var(Runtime) as V, Avg(Runtime) as A FROM ExpDB GROUP BY Order ParName / V A “MinSupport” 0.01 8602 543 “MinSupport” 0.02 360 100 “MinSupport” 0.05 20 12 “MinSupport” 0.10 0.3 0.8 “Order” BF 802 230 Minsupport V A Order V A “Order” DF 315 120 0.01 8602 543 BF 802 230 Parname Effect 0.02 360 100 DF 315 120 “Minsupport” 19 0.05 20 12 “Order” 7.3 0.10 0.3 0.8 + SELECT ParName, Var(A)/Avg(V) as Effect FROM AlgorithmParameters, (SELECT $ParName, Var(Runtime) as V, Avg(Runtime) as A FROM ExpDB GROUP BY $ParName) GROUP BY ParName SORT BY Effect ParName “Minsupport” “Order”

  26. The IDB point of view • “Inductive databases” (Imielinski & Mannila): • Treat patterns as regular objects, which can be stored in the DB, and queried • Data mining = querying for patterns • Advantages: • Efficiency: stored patterns can quickly be retrieved • Flexibility: user can explain exactly what patterns she is looking for (constraint-based mining) • Requires the development of “inductive query languages” (e.g., Meo et al.: MINE RULE operator; De Raedt et al., querying for patterns in sequences) • Querying / mining ExpDB is very similar • Need suitable query languages

  27. A comparison Classical approach ExpDB approach • Experiments are goal-oriented • Need to perform new experiments when • new research questions pop up • 2) Experiments seemmore convincing • than they are • Conditions under which results are valid • are unclear • 3) Relatively simple analysis of results • Mostly repeated hypothesis testing, rather • than direct questions • 4) Low reusability and reproducibility • Experiments are general-purpose • No new experiments needed when • new research questions pop up • 2) Experiments seem as convincing • as they are • Conditions under which results are • valid are explicit in the query • 3) Sophisticated analysis of results • Direct questions possible, given • suitable inductive query languages • 4) Better reusability and reproducibility

  28. Illustration: MITI ExpDB

  29. Illustration: MITI ExpDB • ExpDB-like approach applied to “Multi-instance decision trees” (Blockeel et al., ICML 2005) • Novel algorithm proposed (MITI) for learning decision trees from multi-instance data • 1 example = classified bag of instances • Bag is positive iff at least one instance is positive • Learn to classify individual instances as pos/neg • Many parameters influence MITI’s behaviour • ICML paper compares MITI to other approaches and measures effect of some parameters, on a few real/synthetic datasets • Do those results generalize?

  30. The ICML paper • MITI parameters: • Best-first vs. depth-first tree building • Heuristics for best-first node expansion • Heuristics for deciding best test • “Threshold” above which leaves predicts positive • Weights for larger / smaller bags • ICML paper conclusions: • Best-first is crucial • Heuristics for best test seem to matter less • (1/bagsize) weights improve performance

  31. The MITI ExpDB approach • We ran a few thousand random experiments • Synthetic datasets • Various target concepts • Various parameters • Stored results in a table • Used WEKA to mine that table • Visualisation, trees, rules, ... • Note: less direct method of finding patterns (less goal-oriented than SQL-like queries just mentioned)

  32. MITI ExpDB Conclusions • This approach yielded some additional information: • Decision tree explicitly indicates most influential parameters for predictive accuracy • Positive influence of weights was not confirmed on our synthetic datasets • Tentative explanation: our synthetic datasets have a constant proportion of truly positive instances in positive bags • Still, also with this approach, careful interpretation is important... • E.g., dataset characteristics may influence accuracy but also “default accuracy” (majority class prediction)

  33. Example decision tree Node-expansion depthfirst bestfirst Avg-bagsize Avg-bagsize l l vh a h vh a h Nr-attr Nr-attr Nr-attr Nr-attr … Nr-attr Nr-attr … … … … … … … Indicates AP “Node-expansion” has largest effect on accuracy; DC “average bagsize” and “number of attributes” are 2nd and 3rd

  34. ExpDB-related conclusions • Obtained histograms, decision trees, rule sets, ... • These provide interesting information • Still, more focused querying/mining (specialized language) desirable • Interactively asking for / zooming in on / refining specific effects that may or may not be there • The “inductive querying” approach would be really useful here

  35. Conclusions

  36. Summary • ExpDB approach • Is more efficient • The same set of experiments is reusable and reused • Is more precise and trustworthy • Conditions under which the conclusions hold are explicitly stated • Yields better documented experiments • Precise information on all experiments is kept, experiments are reproducible • Allows more sophisticatedanalysis of results • Interaction of effects, true data mining capacity • Note: interesting for meta-learning!

  37. The challenges... • Good dataset generators necessary (WIP) • Generating truly varying datasets is not easy • Could start from real-life datasets (build variations) • Descriptions of datasets and algorithms (WIP) • Vary/record as many possibly relevant properties as possible • Multi-algorithm ExpDB • Suitable inductive query languages

  38. Acknowledgements • Thanks to ... • Saso Dzeroski, Carlos Soares, Ashwin Srinivasan (discussions & suggestions) • Joaquin Vanschoren (MITI ExpDB) • Anton Dries, Joaquin Vanschoren (implementation of DS generator) • Reviewers and audience of KDID 2005 • Paper: • H. Blockeel, ”Experiment Databases: A novel methodology for experimental research”, KDID workshop at ECML/PKDD 2005

More Related