1 / 21

# SPRINT : A Scalable Parallel Classifier for Data Mining - PowerPoint PPT Presentation

SPRINT : A Scalable Parallel Classifier for Data Mining. John Shafer, Rakesh Agrawal, Manish Mehta. PATHWAY. Terms Partition Algorithm Data Structures Performing Split Serial SPRINT Parallel SPRINT Results. Terms. Training Data Set Attributes : Categorical and Continuous Class Label.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'SPRINT : A Scalable Parallel Classifier for Data Mining' - urban

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

John Shafer, Rakesh Agrawal, Manish Mehta

• Terms

• Partition Algorithm

• Data Structures

• Performing Split

• Serial SPRINT

• Parallel SPRINT

• Results

• Training Data Set

• Attributes : Categorical and Continuous

• Class Label

Partition( Data S ) {

if all points in S are in the same class

return

for each attribute A

evaluate split on attribute A

find best split

partition S into S1 and S2

call Partition( S1 )

call Partition( S2 )

}

• Attribute Lists

• Histograms : Continuous and Categorical

Gini(S) = 1 – Sum( Pj*Pj )

Gini Index(S) = Gini(S1)*n1/n + Gini(S2)*n2/n

• Threshold value : Cabove and Cbelow

• Sorted Once and Sequential Scan

• Deallocation of Cabove and Cbelow

• Create Count-Matrix

• All subsets of attribute values as possible split point

• Compute Gini Index

• Gini from Count Matrix only

• Memory deallocation

• Select splitting attribute and splitting value

• Create two child nodes and divide data on RIDs

• Optimization using Hashing <RID,child-ptr>

• Optimization depending on number of RIDs

• Partitioned Hashing for large hash-table

• Create new histogram and count-matrix of children

• Environment : Shared nothing

• Data placement and workload balancing

• Parallel computation of categorical attribute lists

• Global Sort

• Equal re-partitioning

• Relation between Cabove and Cbelow and processor number

• Parallel computation of split index

• Create global matrix at coordinator

• Compute split-index

• Collect RIDs of splitting attributes from processors

• Exchange RIDs

0

1

2

H

L

Cbelow

Position 0

Cabove

H

L

Position 3

Cbelow

Cabove

Attribute List

H

L

family

sport

truck

Age < 25

CarType=sports

High

High

Low