SPRINT : A Scalable Parallel Classifier for Data Mining

Download Presentation

SPRINT : A Scalable Parallel Classifier for Data Mining

Loading in 2 Seconds...

- 83 Views
- Uploaded on
- Presentation posted in: General

SPRINT : A Scalable Parallel Classifier for Data Mining

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

John Shafer, Rakesh Agrawal, Manish Mehta

- Terms
- Partition Algorithm
- Data Structures
- Performing Split
- Serial SPRINT
- Parallel SPRINT
- Results

- Training Data Set
- Attributes : Categorical and Continuous
- Class Label

Partition( Data S ) {

if all points in S are in the same class

return

for each attribute A

evaluate split on attribute A

find best split

partition S into S1 and S2

call Partition( S1 )

call Partition( S2 )

}

- Attribute Lists
- Histograms : Continuous and Categorical

Gini(S) = 1 – Sum( Pj*Pj )

Gini Index(S) = Gini(S1)*n1/n + Gini(S2)*n2/n

- Threshold value : Cabove and Cbelow
- Sorted Once and Sequential Scan
- Deallocation of Cabove and Cbelow

- Create Count-Matrix
- All subsets of attribute values as possible split point
- Compute Gini Index
- Gini from Count Matrix only
- Memory deallocation

- Select splitting attribute and splitting value
- Create two child nodes and divide data on RIDs
- Optimization using Hashing <RID,child-ptr>
- Optimization depending on number of RIDs
- Partitioned Hashing for large hash-table
- Create new histogram and count-matrix of children

- Environment : Shared nothing
- Data placement and workload balancing
- Parallel computation of categorical attribute lists

- Global Sort
- Equal re-partitioning
- Relation between Cabove and Cbelow and processor number
- Parallel computation of split index

- Create global matrix at coordinator
- Compute split-index

- Collect RIDs of splitting attributes from processors
- Exchange RIDs

Age < 27.5

0

1

2

Attribute List

H

L

Cbelow

Position 0

Cabove

H

L

Position 3

Cbelow

Cabove

Count Matrix

Attribute List

H

L

family

sport

truck

Breakdown of Response Time

Scaleup of SPRINT

Speedup of SPRINT

Sizeup of SPRINT

Example:Decision Tree

Age < 25

CarType=sports

High

High

Low