1 / 21

SPRINT : A Scalable Parallel Classifier for Data Mining

SPRINT : A Scalable Parallel Classifier for Data Mining. John Shafer, Rakesh Agrawal, Manish Mehta. PATHWAY. Terms Partition Algorithm Data Structures Performing Split Serial SPRINT Parallel SPRINT Results. Terms. Training Data Set Attributes : Categorical and Continuous Class Label.

urban
Download Presentation

SPRINT : A Scalable Parallel Classifier for Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SPRINT : A Scalable Parallel Classifier for Data Mining John Shafer, Rakesh Agrawal, Manish Mehta

  2. PATHWAY • Terms • Partition Algorithm • Data Structures • Performing Split • Serial SPRINT • Parallel SPRINT • Results

  3. Terms • Training Data Set • Attributes : Categorical and Continuous • Class Label

  4. Partition Algorithm Partition( Data S ) { if all points in S are in the same class return for each attribute A evaluate split on attribute A find best split partition S into S1 and S2 call Partition( S1 ) call Partition( S2 ) }

  5. Data Structures • Attribute Lists • Histograms : Continuous and Categorical

  6. Finding Split Point Gini(S) = 1 – Sum( Pj*Pj ) Gini Index(S) = Gini(S1)*n1/n + Gini(S2)*n2/n

  7. Split on Continuous Attributes • Threshold value : Cabove and Cbelow • Sorted Once and Sequential Scan • Deallocation of Cabove and Cbelow

  8. Split on Categorical Attributes • Create Count-Matrix • All subsets of attribute values as possible split point • Compute Gini Index • Gini from Count Matrix only • Memory deallocation

  9. Perform Split and Partitioning • Select splitting attribute and splitting value • Create two child nodes and divide data on RIDs • Optimization using Hashing <RID,child-ptr> • Optimization depending on number of RIDs • Partitioned Hashing for large hash-table • Create new histogram and count-matrix of children

  10. Parallel SPRINT • Environment : Shared nothing • Data placement and workload balancing • Parallel computation of categorical attribute lists

  11. Repartition of Continuous Attributes • Global Sort • Equal re-partitioning • Relation between Cabove and Cbelow and processor number • Parallel computation of split index

  12. Split point for Categorical Attributes • Create global matrix at coordinator • Compute split-index

  13. Partitioning • Collect RIDs of splitting attributes from processors • Exchange RIDs

  14. Age < 27.5 0 1 2

  15. Attribute List H L Cbelow Position 0 Cabove H L Position 3 Cbelow Cabove

  16. Count Matrix Attribute List H L family sport truck

  17. Breakdown of Response Time

  18. Scaleup of SPRINT

  19. Speedup of SPRINT

  20. Sizeup of SPRINT

  21. Example:Decision Tree Age < 25 CarType=sports High High Low

More Related