sprint a scalable parallel classifier for data mining
Download
Skip this Video
Download Presentation
SPRINT : A Scalable Parallel Classifier for Data Mining

Loading in 2 Seconds...

play fullscreen
1 / 21

SPRINT : A Scalable Parallel Classifier for Data Mining - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

SPRINT : A Scalable Parallel Classifier for Data Mining. John Shafer, Rakesh Agrawal, Manish Mehta. PATHWAY. Terms Partition Algorithm Data Structures Performing Split Serial SPRINT Parallel SPRINT Results. Terms. Training Data Set Attributes : Categorical and Continuous Class Label.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' SPRINT : A Scalable Parallel Classifier for Data Mining' - urban


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sprint a scalable parallel classifier for data mining
SPRINT : A Scalable Parallel Classifier for Data Mining

John Shafer, Rakesh Agrawal, Manish Mehta

pathway
PATHWAY
  • Terms
  • Partition Algorithm
  • Data Structures
  • Performing Split
  • Serial SPRINT
  • Parallel SPRINT
  • Results
terms
Terms
  • Training Data Set
  • Attributes : Categorical and Continuous
  • Class Label
partition algorithm
Partition Algorithm

Partition( Data S ) {

if all points in S are in the same class

return

for each attribute A

evaluate split on attribute A

find best split

partition S into S1 and S2

call Partition( S1 )

call Partition( S2 )

}

data structures
Data Structures
  • Attribute Lists
  • Histograms : Continuous and Categorical
finding split point
Finding Split Point

Gini(S) = 1 – Sum( Pj*Pj )

Gini Index(S) = Gini(S1)*n1/n + Gini(S2)*n2/n

split on continuous attributes
Split on Continuous Attributes
  • Threshold value : Cabove and Cbelow
  • Sorted Once and Sequential Scan
  • Deallocation of Cabove and Cbelow
split on categorical attributes
Split on Categorical Attributes
  • Create Count-Matrix
  • All subsets of attribute values as possible split point
  • Compute Gini Index
  • Gini from Count Matrix only
  • Memory deallocation
perform split and partitioning
Perform Split and Partitioning
  • Select splitting attribute and splitting value
  • Create two child nodes and divide data on RIDs
  • Optimization using Hashing <RID,child-ptr>
  • Optimization depending on number of RIDs
  • Partitioned Hashing for large hash-table
  • Create new histogram and count-matrix of children
parallel sprint
Parallel SPRINT
  • Environment : Shared nothing
  • Data placement and workload balancing
  • Parallel computation of categorical attribute lists
repartition of continuous attributes
Repartition of Continuous Attributes
  • Global Sort
  • Equal re-partitioning
  • Relation between Cabove and Cbelow and processor number
  • Parallel computation of split index
split point for categorical attributes
Split point for Categorical Attributes
  • Create global matrix at coordinator
  • Compute split-index
partitioning
Partitioning
  • Collect RIDs of splitting attributes from processors
  • Exchange RIDs
slide15

Attribute List

H

L

Cbelow

Position 0

Cabove

H

L

Position 3

Cbelow

Cabove

slide16

Count Matrix

Attribute List

H

L

family

sport

truck

slide21

Example:Decision Tree

Age < 25

CarType=sports

High

High

Low

ad