Sprint a scalable parallel classifier for data mining
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

SPRINT : A Scalable Parallel Classifier for Data Mining PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on
  • Presentation posted in: General

SPRINT : A Scalable Parallel Classifier for Data Mining. John Shafer, Rakesh Agrawal, Manish Mehta. PATHWAY. Terms Partition Algorithm Data Structures Performing Split Serial SPRINT Parallel SPRINT Results. Terms. Training Data Set Attributes : Categorical and Continuous Class Label.

Download Presentation

SPRINT : A Scalable Parallel Classifier for Data Mining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sprint a scalable parallel classifier for data mining

SPRINT : A Scalable Parallel Classifier for Data Mining

John Shafer, Rakesh Agrawal, Manish Mehta


Pathway

PATHWAY

  • Terms

  • Partition Algorithm

  • Data Structures

  • Performing Split

  • Serial SPRINT

  • Parallel SPRINT

  • Results


Terms

Terms

  • Training Data Set

  • Attributes : Categorical and Continuous

  • Class Label


Partition algorithm

Partition Algorithm

Partition( Data S ) {

if all points in S are in the same class

return

for each attribute A

evaluate split on attribute A

find best split

partition S into S1 and S2

call Partition( S1 )

call Partition( S2 )

}


Data structures

Data Structures

  • Attribute Lists

  • Histograms : Continuous and Categorical


Finding split point

Finding Split Point

Gini(S) = 1 – Sum( Pj*Pj )

Gini Index(S) = Gini(S1)*n1/n + Gini(S2)*n2/n


Split on continuous attributes

Split on Continuous Attributes

  • Threshold value : Cabove and Cbelow

  • Sorted Once and Sequential Scan

  • Deallocation of Cabove and Cbelow


Split on categorical attributes

Split on Categorical Attributes

  • Create Count-Matrix

  • All subsets of attribute values as possible split point

  • Compute Gini Index

  • Gini from Count Matrix only

  • Memory deallocation


Perform split and partitioning

Perform Split and Partitioning

  • Select splitting attribute and splitting value

  • Create two child nodes and divide data on RIDs

  • Optimization using Hashing <RID,child-ptr>

  • Optimization depending on number of RIDs

  • Partitioned Hashing for large hash-table

  • Create new histogram and count-matrix of children


Parallel sprint

Parallel SPRINT

  • Environment : Shared nothing

  • Data placement and workload balancing

  • Parallel computation of categorical attribute lists


Repartition of continuous attributes

Repartition of Continuous Attributes

  • Global Sort

  • Equal re-partitioning

  • Relation between Cabove and Cbelow and processor number

  • Parallel computation of split index


Split point for categorical attributes

Split point for Categorical Attributes

  • Create global matrix at coordinator

  • Compute split-index


Partitioning

Partitioning

  • Collect RIDs of splitting attributes from processors

  • Exchange RIDs


Sprint a scalable parallel classifier for data mining

Age < 27.5

0

1

2


Sprint a scalable parallel classifier for data mining

Attribute List

H

L

Cbelow

Position 0

Cabove

H

L

Position 3

Cbelow

Cabove


Sprint a scalable parallel classifier for data mining

Count Matrix

Attribute List

H

L

family

sport

truck


Sprint a scalable parallel classifier for data mining

Breakdown of Response Time


Sprint a scalable parallel classifier for data mining

Scaleup of SPRINT


Sprint a scalable parallel classifier for data mining

Speedup of SPRINT


Sprint a scalable parallel classifier for data mining

Sizeup of SPRINT


Sprint a scalable parallel classifier for data mining

Example:Decision Tree

Age < 25

CarType=sports

High

High

Low


  • Login