sliq a fast scalable classifier for data mining
Download
Skip this Video
Download Presentation
SLIQ: A Fast Scalable Classifier for Data Mining

Loading in 2 Seconds...

play fullscreen
1 / 24

SLIQ: A Fast Scalable Classifier for Data Mining - PowerPoint PPT Presentation


  • 200 Views
  • Uploaded on

SLIQ: A Fast Scalable Classifier for Data Mining. Manish Mehta, Rakesh Agrawal, Jorma Rissanen 1996. Presentation by: Vladan Radosavljevic. Outline. Introduction Motivation SLIQ Algorithm Building tree Pruning Example Results Conclusion. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'SLIQ: A Fast Scalable Classifier for Data Mining' - tamyra


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sliq a fast scalable classifier for data mining

SLIQ: A Fast Scalable Classifier for Data Mining

Manish Mehta, Rakesh Agrawal, Jorma Rissanen

1996.

Presentation by: Vladan Radosavljevic

outline
Outline
  • Introduction
  • Motivation
  • SLIQ Algorithm
    • Building tree
    • Pruning
    • Example
  • Results
  • Conclusion
introduction
Introduction
  • Most of the classification algorithms are designed for memory resident data – limited suitability for mining large datasets
  • Solution – build a scalable classifier - SLIQ
  • SLIQ – Supervised Learning in Quest, Quest was the data mining project at the IBM
motivation
Motivation
  • Recall (ID3, C4.5, CART):
motivation5
Motivation
  • NON SCALABLE DECISION TREES:
  • The complexity lies in determining the best split for each attribute
  • The cost of evaluating splits for numerical attributes is dominated by the cost of sorting values at each node
  • The cost of evaluating splits for categorical attributes is dominated by the cost of searching for the best subset
  • Pruning
    • crossvalidation inapplicable for large datasets
    • divide data in two parts - training and test set - sizes, distribution???
motivation6
Motivation
  • Improve scalability of tree classifiers
  • Previous proposals:
    • Sampling data at each node
    • Discretization of numerical attributes
    • Partitioning input data and build tree for each partition
    • All methods achieve low accuracy!
  • SLIQ – improve learning time without loss in accuracy!
slide7
SLIQ
  • Key features:
    • Tree classifier, handling both numerical and categorical attributes
    • Presort numerical attributes before tree has been built
    • Breadth first growing strategy
    • Goodness test – Gini index
    • Inexpensive tree pruning algorithm based on Minimum Description Length (MDL)
sliq algorithm
SLIQ - Algorithm
  • Eliminate the need to sort the data at each node
  • Create sorted list for each numerical attribute
  • Create class list
sliq algorithm10
SLIQ - Algorithm
  • Split evaluation:
sliq algorithm12
SLIQ - Algorithm
  • Update class list:
sliq algorithm14
SLIQ - Algorithm
  • For large-cardinality categorical attributes (determined based on threshold) the best split is computed in greedy way, otherwise all possible splits are evaluated
  • When node becomes pure stop splitting it, then condense attribute lists by discarding examples that correspond to the pure node
  • SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical
sliq pruning
SLIQ - Pruning
  • Post pruning algorithm based on Minimum Description Length principle
  • Find a model that minimizes:

Cost(M,D) = Cost(D|M) + Cost(M)

Cost(M) - cost of the model

Cost(D|M) - cost of encoding the data D if

model M is given

sliq pruning16
SLIQ - Pruning
  • Cost of the data: classification error
  • Cost of the model:
    • Encoding the tree: number of bits
    • Encoding the splits:
      • numerical attribute - constant (empirically 1)
      • categorical attribute - depends on cardinality
  • The MDL pruning evaluate the code length at each node to determine whether to prune one or both child or leave the node intact
sliq pruning17
SLIQ - pruning
  • Three pruning strategies:
    • Full – pruning both children and convert node to the leaf
    • Partial – prune into the leaf or prune the left child or prune the right child or leave node intact
    • Hybrid – apply Full method and then partial (prune left, prune right or leave intact)
results
Results
  • SLIQ was tested on the datasets:
results19
Results
  • Pruning strategy comparison:
results20
Results
  • Accuracy:
results21
Results
  • Scalability:
conclusion
Conclusion
  • SLIQ demonstrates to be a fast, low-cost and scalable classifier that builds accurate trees
  • Based on empirical test which compared SLIQ to other tree based classifiers, SLIQ achieves a comparable accuracy while producing smaller decision trees
  • Scalability??? Memory problem when increasing number of attributes or number of classes
references
References

[1] M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996.

ad