1 / 18

Implementation of GUHA decision trees – initial remarks

Implementation of GUHA decision trees – initial remarks. Martin Ralbovský KIZI FIS VŠE 6.12.2007. The GUHA method. Provides a general mainframe for retrieving interesting information from data Strong foundations in logics and statistics

ivan
Download Presentation

Implementation of GUHA decision trees – initial remarks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementation of GUHA decision trees – initial remarks Martin Ralbovský KIZI FIS VŠE 6.12.2007

  2. The GUHA method • Provides a general mainframe for retrieving interesting information from data • Strong foundations in logics and statistics • One of the main principles of the method is to provide “everything interesting” to the user.

  3. Decision trees • One of the most known classification methods • There are several known algorithms for construction of decision trees (ID3, C4.5…) • Algorithm outline: Iterate through attributes, in each step choose the best attribute for branching and make node from the attribute • Best decision tree in output

  4. Making decision trees GUHA (PetrBerka) • A decision tree can be viewed as a GUHA verification/hypothesis • But there is only 1 tree in the output • Modification of the initial algorithm – ETree procedure • We do not branch according to best attribute, but according to n best attributes • In each iteration, nodes suitable for branching from existing trees are selected and branched • Only sound decision trees to output

  5. ETree parameters (PetrBerka) • Criterion for attribute ordering - 2 Trees: • Maximal tree depth (parameter) • Allow only full length trees • Number of attributes for branching Branching: • Minimal node frequency • Minimal node purity • Stopping branching criterion (frequency, purity, frequency OR purity) Sound trees: • Confusion matrix, F-Measure + any 4ft-quantifier in Ferda

  6. How to branch I Attribute1 = {A,B,C} Attribute2 = {1,2} Ad b) A B C Ad a) A B C A B C A B C A B C A B C

  7. How to branch II Attribute1 = {A,B,C} Attribute2 = {1,2} Ad a) Ad b) A B C A B C A B C A B C A B C 1 2 2 1 2 1

  8. Pseudocode algorithm LIFO stack; stack.Push(MakeSeedTree()); while (stack.Length >= 0) { Tree processTree = stack.Pop(); foreach (Node n in NodesForBranching(processTree) { stack.Push(CreateTree(processTree,n); } if (QualityTree(processTree)) { PutToOutPut(processTree); } }

  9. Implementation in Ferda Instead of creating a new DM tool, modularity of Ferda was used • Data preparation boxes • 4ft-quantifiers can be used to measure quality of trees • MiningProcessor (bit string generation engine) usage

  10. ETree task example 4ft-quantifiers ETree task box Existing data preparation boxes

  11. Output + settings example

  12. Experiment 1 - Barbora Barbora bank, cca. 6100 cliends, classification of client status from: • Loan amount • Client district • Loan duration • Client Salary Number of attributes for branching = 4 Minimal node purity = 0.8 Minimal node frequency = 61 (1% of data)

  13. Results - Barbora Performance: 36 verifications/sec

  14. Experiment 2: Forest tree cover UCI KDD Dataset for classification (10K sample) Classification of tree cover based on characteristics: • Wilderness area • Elevation • Slope • Horizontal + vertical distance to hydrology • Horizontal distance to fire point Number of attributes for branching : 1,3,5 Minimal node purity: 0.8 Minimal node frequency: 100 (1% of dataset)

  15. Results – Forest tree cover Attributes for branching: 1 Performance: 39VPS Attributes for branching: 3 Performance: 86VPS Attributes for branching: 5 Performance: 71VPS

  16. Experiment 3: Forest tree cover • Construction of trees for whole dataset (cca. 600K) • Does increase of number attributes for branching result in better trees? • Tree length = 3 the other parameters same as in experiment 2. Number of attributes for branching = 1 • Best hypothesis: 0.30, 6VPS (strings in cache) Number of attributes for branching = 4 • Best hypothesis: 0.52, 2VPS(strings in cache)

  17. Verifications 4FT vs. ETree On tasks on similar data table length • 4FT (in Ferda) approx. 5000 VPS • ETree about 70 VPS The ETree verification is far more complicated: • In addition to computing quantifier, counting 2 for each node suitable for branching • Hard operations (sums) instead of easy operations (conjunctions…) • Not only verification of a tree, but construction of trees derived from this tree

  18. Further work • How new/known is the method? • Boxes for attribute selection criteria • Classification box • Better result browsing + result reduction • Optimizing • Elective classification - tree has a vote (PetrBerka) • Experiments with various data sources • Decision trees from fuzzy attributes • Better estimation of relevant questions count

More Related