150 likes | 169 Views
Explore parallelizing decision tree induction for enhanced classification speed and scalability. Learn about serial vs. parallel approaches, performance tests, and results. Discover potential challenges and future improvements.
E N D
Parallel-DT Parallel Decision Tree Induction Team members: Eremia Bogdan Minca Andrei Sorici Alexandru
Outline • Introduction (what are decision trees) • Serial version • Parallel approach • Results and Conclusions • Future work
Introduction • Classification is an important data mining problem • classification problem • has an input dataset called the training dataset • consists of a number of examples each having a number of attributes • Objective: use the training set to build a model of the class label based on the other attributes – use the model to classify new data not from the training set
Introduction(2) • Decision Trees (DT) are probably the most popular algorithm used for classification problems • They obtain reasonable accuracy and are relatively inexpensive to compute • Most current implementations such as C4.5 (used in our tests) rely on the ID3 algorithm developed by Quinlan
Introduction(3) • Why parallelize decision tree inference? • Data mining datasets tend to be very large • If using only a small sample as training data => loss of classification accuracy • Therefore there is a need for computationally efficient, scalable algorithms • DT construction is naturally concurrent: once a node is generated its child nodes can be generated concurrently
Serial Version • The original C4.5 source code written in C • The tree inference process is recursive and relies on an information gain heuristic which determines the attribute on which to branch at a certain point • Dataset: poker-hand. Nominal attributes. Contains 1000000 examples.
Serial version(perfomance) • Tests performed on the Solaris VM with SunStudio 12 • Time measurements for relevant functions • Group function takes a lot of time just moving data around
Parallel approach • Synchronous tree construction approach • All processors collaborate to expand a node of the decision tree • Simultaneously compute class distribution information at the current node • Simultaneously compute the entropy gains of each attribute and select the best one for child node expansion • Possible problems • Synchronization necessary when computing the best attribute. • Load imbalance – one attribute may have more values than others
Parallel approach(2) • Partitioned tree construction approach • Different processors work on different parts of the classification tree • Node n is expanded in n1, n2 …, nk child nodes => processor group Pn is divided into k subgroups • Each processor subgroup is now responsible for a subtree • Problems • Load imbalance – because the tree structure can be very uneven • Closest OpenMP work-sharing construct to this approach is #pragma omp task
Results • We managed to implement the Synchronous tree construction approach • Working in a shared memory model using OpenMP • Parallelization of attribute entropy gain computation with #omp parallel for construct • We only tested on a dataset with nominal attributes.
Results (2) • Unfortunately data movement takes more time than gain computation • We couldn’t parallelize data movement, because of to many synchronization issues => no speed-up
Conclusions and future work • The original C4.5 code by Quinlan was not written to be easily parallelized • Processor data access has to be rethought • Parallelization of the routines handling continues attributes – because they have a lot more work to do • Implementing the partitioned tree approach using OpenMP task constructs