Sliq a fast scalable classifier for data mining l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

SLIQ: A Fast Scalable Classifier for Data Mining PowerPoint PPT Presentation


  • 158 Views
  • Uploaded on
  • Presentation posted in: General

SLIQ: A Fast Scalable Classifier for Data Mining. Manish Mehta, Rakesh Agrawal, Jorma Rissanen 1996. Presentation by: Vladan Radosavljevic. Outline. Introduction Motivation SLIQ Algorithm Building tree Pruning Example Results Conclusion. Introduction.

Download Presentation

SLIQ: A Fast Scalable Classifier for Data Mining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sliq a fast scalable classifier for data mining l.jpg

SLIQ: A Fast Scalable Classifier for Data Mining

Manish Mehta, Rakesh Agrawal, Jorma Rissanen

1996.

Presentation by: Vladan Radosavljevic


Outline l.jpg

Outline

  • Introduction

  • Motivation

  • SLIQ Algorithm

    • Building tree

    • Pruning

    • Example

  • Results

  • Conclusion


Introduction l.jpg

Introduction

  • Most of the classification algorithms are designed for memory resident data – limited suitability for mining large datasets

  • Solution – build a scalable classifier - SLIQ

  • SLIQ – Supervised Learning in Quest, Quest was the data mining project at the IBM


Motivation l.jpg

Motivation

  • Recall (ID3, C4.5, CART):


Motivation5 l.jpg

Motivation

  • NON SCALABLE DECISION TREES:

  • The complexity lies in determining the best split for each attribute

  • The cost of evaluating splits for numerical attributes is dominated by the cost of sorting values at each node

  • The cost of evaluating splits for categorical attributes is dominated by the cost of searching for the best subset

  • Pruning

    • crossvalidation inapplicable for large datasets

    • divide data in two parts - training and test set - sizes, distribution???


Motivation6 l.jpg

Motivation

  • Improve scalability of tree classifiers

  • Previous proposals:

    • Sampling data at each node

    • Discretization of numerical attributes

    • Partitioning input data and build tree for each partition

    • All methods achieve low accuracy!

  • SLIQ – improve learning time without loss in accuracy!


Slide7 l.jpg

SLIQ

  • Key features:

    • Tree classifier, handling both numerical and categorical attributes

    • Presort numerical attributes before tree has been built

    • Breadth first growing strategy

    • Goodness test – Gini index

    • Inexpensive tree pruning algorithm based on Minimum Description Length (MDL)


Sliq algorithm l.jpg

SLIQ - Algorithm

  • Eliminate the need to sort the data at each node

  • Create sorted list for each numerical attribute

  • Create class list


Sliq algorithm9 l.jpg

SLIQ - Algorithm

  • Example:


Sliq algorithm10 l.jpg

SLIQ - Algorithm

  • Split evaluation:


Sliq algorithm11 l.jpg

SLIQ - Algorithm

  • Example:


Sliq algorithm12 l.jpg

SLIQ - Algorithm

  • Update class list:


Sliq algorithm13 l.jpg

SLIQ - Algorithm

  • Example:


Sliq algorithm14 l.jpg

SLIQ - Algorithm

  • For large-cardinality categorical attributes (determined based on threshold) the best split is computed in greedy way, otherwise all possible splits are evaluated

  • When node becomes pure stop splitting it, then condense attribute lists by discarding examples that correspond to the pure node

  • SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical


Sliq pruning l.jpg

SLIQ - Pruning

  • Post pruning algorithm based on Minimum Description Length principle

  • Find a model that minimizes:

    Cost(M,D) = Cost(D|M) + Cost(M)

    Cost(M) - cost of the model

    Cost(D|M) - cost of encoding the data D if

    model M is given


Sliq pruning16 l.jpg

SLIQ - Pruning

  • Cost of the data: classification error

  • Cost of the model:

    • Encoding the tree: number of bits

    • Encoding the splits:

      • numerical attribute - constant (empirically 1)

      • categorical attribute - depends on cardinality

  • The MDL pruning evaluate the code length at each node to determine whether to prune one or both child or leave the node intact


Sliq pruning17 l.jpg

SLIQ - pruning

  • Three pruning strategies:

    • Full – pruning both children and convert node to the leaf

    • Partial – prune into the leaf or prune the left child or prune the right child or leave node intact

    • Hybrid – apply Full method and then partial (prune left, prune right or leave intact)


Results l.jpg

Results

  • SLIQ was tested on the datasets:


Results19 l.jpg

Results

  • Pruning strategy comparison:


Results20 l.jpg

Results

  • Accuracy:


Results21 l.jpg

Results

  • Scalability:


Conclusion l.jpg

Conclusion

  • SLIQ demonstrates to be a fast, low-cost and scalable classifier that builds accurate trees

  • Based on empirical test which compared SLIQ to other tree based classifiers, SLIQ achieves a comparable accuracy while producing smaller decision trees

  • Scalability??? Memory problem when increasing number of attributes or number of classes


References l.jpg

References

[1] M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996.


Slide24 l.jpg

THANK YOU!


  • Login