Sprint a scalable parallel classifier for data mining
Download
1 / 21

SPRINT : A Scalable Parallel Classifier for Data Mining - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

SPRINT : A Scalable Parallel Classifier for Data Mining. John Shafer, Rakesh Agrawal, Manish Mehta. PATHWAY. Terms Partition Algorithm Data Structures Performing Split Serial SPRINT Parallel SPRINT Results. Terms. Training Data Set Attributes : Categorical and Continuous Class Label.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' SPRINT : A Scalable Parallel Classifier for Data Mining' - urban


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Sprint a scalable parallel classifier for data mining
SPRINT : A Scalable Parallel Classifier for Data Mining

John Shafer, Rakesh Agrawal, Manish Mehta


Pathway
PATHWAY

  • Terms

  • Partition Algorithm

  • Data Structures

  • Performing Split

  • Serial SPRINT

  • Parallel SPRINT

  • Results


Terms
Terms

  • Training Data Set

  • Attributes : Categorical and Continuous

  • Class Label


Partition algorithm
Partition Algorithm

Partition( Data S ) {

if all points in S are in the same class

return

for each attribute A

evaluate split on attribute A

find best split

partition S into S1 and S2

call Partition( S1 )

call Partition( S2 )

}


Data structures
Data Structures

  • Attribute Lists

  • Histograms : Continuous and Categorical


Finding split point
Finding Split Point

Gini(S) = 1 – Sum( Pj*Pj )

Gini Index(S) = Gini(S1)*n1/n + Gini(S2)*n2/n


Split on continuous attributes
Split on Continuous Attributes

  • Threshold value : Cabove and Cbelow

  • Sorted Once and Sequential Scan

  • Deallocation of Cabove and Cbelow


Split on categorical attributes
Split on Categorical Attributes

  • Create Count-Matrix

  • All subsets of attribute values as possible split point

  • Compute Gini Index

  • Gini from Count Matrix only

  • Memory deallocation


Perform split and partitioning
Perform Split and Partitioning

  • Select splitting attribute and splitting value

  • Create two child nodes and divide data on RIDs

  • Optimization using Hashing <RID,child-ptr>

  • Optimization depending on number of RIDs

  • Partitioned Hashing for large hash-table

  • Create new histogram and count-matrix of children


Parallel sprint
Parallel SPRINT

  • Environment : Shared nothing

  • Data placement and workload balancing

  • Parallel computation of categorical attribute lists


Repartition of continuous attributes
Repartition of Continuous Attributes

  • Global Sort

  • Equal re-partitioning

  • Relation between Cabove and Cbelow and processor number

  • Parallel computation of split index


Split point for categorical attributes
Split point for Categorical Attributes

  • Create global matrix at coordinator

  • Compute split-index


Partitioning
Partitioning

  • Collect RIDs of splitting attributes from processors

  • Exchange RIDs


Age < 27.5

0

1

2


Attribute List

H

L

Cbelow

Position 0

Cabove

H

L

Position 3

Cbelow

Cabove


Count Matrix

Attribute List

H

L

family

sport

truck






Example:Decision Tree

Age < 25

CarType=sports

High

High

Low


ad