sliq and sprint for disk resident data n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
SLIQ and SPRINT for disk resident data PowerPoint Presentation
Download Presentation
SLIQ and SPRINT for disk resident data

Loading in 2 Seconds...

play fullscreen
1 / 25

SLIQ and SPRINT for disk resident data - PowerPoint PPT Presentation


  • 157 Views
  • Uploaded on

SLIQ and SPRINT for disk resident data. SLIQ. SLIQ is a decision tree classifier that can handle both numerical and categorical attributes Builds compact and accurate trees Uses a pre-sorting technique in the tree growing phase Suitable for classification of large disk-resident datasets.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'SLIQ and SPRINT for disk resident data' - zed


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide2
SLIQ
  • SLIQ is a decision tree classifier that can handle both numerical and categorical attributes
  • Builds compact and accurate trees
  • Uses a pre-sorting technique in the tree growing phase
  • Suitable for classification of large disk-resident datasets.
issues
Issues
  • There are two major, criticalperformance, issues in the tree-growth phase:
    • How to find splitpoints
    • How to partition the data
  • The well-knowndecision treeclassifiers:
    • Grow trees depth-first
    • Repeatedly sort the data at every node
  • SLIQ:
    • Replace this repeated sorting with one-time sort
    • Use new a data structure call class-list
    • Class-list must remain memory resident at all time
sliq attribute lists
SLIQ - Attribute Lists

These are projections on (rid, attribute).

sliq histograms
SLIQ - Histograms

N1

age25 ?

age30 ?

Evaluate each split, using GINI or Entropy.

...

sliq histograms1
SLIQ - Histograms

N1

age25

age30

Evaluate each split, using GINI or Entropy.

...

sliq histograms2
SLIQ - Histograms

N1

salary20

salary30

Evaluate each split, using GINI or Entropy.

...

sliq histograms3
SLIQ - Histograms

N1

Married

Single

Evaluate each split, using GINI or Entropy.

sliq histograms4

N1

salary60

N2

N3

SLIQ - Histograms

N1

N2

N1

age25 ?

N2

Evaluate each split, using GINI or Entropy.

...

sliq histograms5

N1

salary60

N2

N3

SLIQ - Histograms

N1

N2

N1

age25

N2

Evaluate each split, using GINI or Entropy.

...

sliq pseudocode
SLIQ - Pseudocode
  • Split evaluation:

EvaluateSplits()

for each numeric attribute A do

for each value v in the attribute list do

find the corresponding entry in the class list, and hence the corresponding class and the leaf node Ni

update the class histogram in leaf Ni

compute splitting score for test (A ≤ v) for Ni

for each categorical attribute A do

for each leaf of the tree do

find subset of A with best split

sliq pseudocode1
SLIQ - Pseudocode
  • Updating the class list

UpdateLabels()

for each split leaf Nido

Let A be the split attribute for Ni.

for each (rid,v)in the attribute list for Ado

find the corresponding entry in the class list e (using the rid)

if the leaf referenced by e is Ni then

find the new leaf Nj to which (rid,v)belongs

(by applying the splitting test)

update the leaf pointer for e to Nj

sliq bottleneck
SLIQ - bottleneck
  • Class-list must remain memory resident at all time!
    • Although not a big problem with today's memories, still there might be cases where this is a bottleneck.
  • So, what can we do when the class-list doesn't fit in main memory?
    • SPRINT is a solution...
sprint
SPRINT

The main data structures used in SPRINT are:

Attribute lists and Class histograms

sprint histograms
SPRINT - Histograms

age25

age30

Evaluate each split, using GINI or Entropy.

...

sprint histograms1
SPRINT - Histograms

salary20

salary30

Evaluate each split, using GINI or Entropy.

...

sprint histograms2
SPRINT - Histograms

Married

Single

Evaluate each split, using GINI or Entropy.

sprint performing best split
SPRINT - Performing Best Split
  • Once the best split point has been found for a node, we execute the split by creating child nodes.
  • Requires splitting the node’s lists for every attribute into two.
  • Partitioning the attribute list of the winning attribute (salary) is easy.
    • We scan the list, apply the split test, and move the records to two new attribute lists - one for each new child.
sprint performing best split1
SPRINT - Performing Best Split
  • Unfortunately, for the remaining attribute lists of the node (age and marital), we have no test that we can apply to the attribute values to decide how to divide the records.
  • Solution: use the rids.
    • As we partition the list of the splitting attribute (i.e. salary), we insert the rids of each record into a probe structure (hash table), noting to which child the record was moved.
  • Once we have collected all the rids, we scan the lists of the remaining attributes and probe the hash table with the rid of each record.
    • The retrieved information tells us with which child to place the record.
sprint performing best split2
SPRINT - Performing Best Split
  • If the hash-table is too large for the memory, splitting is done in more than one step.
    • The attribute list for the splitting attribute is partitioned up to the attribute record for which the hash table will fit in memory;
    • Portions of attribute lists of non-splitting attributes are partitioned; and the process is repeated for the remainder of the attribute list of the splitting attribute.