Handling numeric attributes in hoeffding trees
Download
1 / 26

Handling Numeric Attributes in Hoeffding Trees - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Bernhard Pfahringer, Geoff Holmes and Richard Kirkby. Handling Numeric Attributes in Hoeffding Trees. Overview. Hoeffding trees are excellent for classification tasks on data streams.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Handling Numeric Attributes in Hoeffding Trees' - guri


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Handling numeric attributes in hoeffding trees

Bernhard Pfahringer, Geoff Holmes and Richard Kirkby

Handling Numeric Attributes in

Hoeffding Trees


Overview
Overview

  • Hoeffding trees are excellent for classification tasks on data streams.

  • Handling numeric attributes well is crucial to performance in conventional decision trees (for example, C4.5 -> C4.8)

  • Does handling numeric attributes matter for streamed data?

  • We implement a range of methods and empirically evaluate their accuracy and costs.

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Data streams reminder
Data Streams - reminder

  • Idea is that data is being provided from a continuous source:

    • Examples processed one at a time (inspected once)

    • Memory is limited (!)

    • Model construction must scale (NlogN in num examples)

    • Be ready to predict at any time

  • As memory is limited this will have implications for any numeric handling method you might construct

  • Only consider methods that work as the tree is built

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Main assumptions limitations
Main assumptions/limitations

  • Assume a stationary concept, i.e. no concept drift or change

    • may seem very limiting, but …

  • Three-way trade-off:

    • memory

    • speed

    • accuracy

  • Used only artificial data sources

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Hoeffding trees
Hoeffding Trees

  • Introduced by Domingos and Hulten (VFDT)

  • “Extension” of decision trees to streams

  • HT Algorithm:

    • Init tree T to root node

    • For each example from stream

      • Find leaf L for this example

      • Update counts in L with attr values of example and compute split function (eg Info Gain, IG) for each attribute

      • If IG(best attr) – IG(next best attr) > ε then split L on best attr

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Active leaf data structure
Active leaf data structure

  • For each class value:

    • for each nominal attribute:

      • for each possible value:

        • keep sum of counts/weights

    • for each numeric attribute:

      • keep sufficient stats to approximate the distribution

      • various possibilities: here assume normal distribution so estimate/record: n,mean,variance, + min/max

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Numeric handling methods
Numeric Handling Methods

  • VFDT (VFML – Hulten & Domingos, 2003)

    • Summarize the numeric distribution with a histogram made up of a maximum number of bins N (default 1000)

    • Bin boundaries determined by first N unique values seen in the stream.

    • Issues: method sensitive to data order and choosing a good N for a particular problem

  • Exhaustive Binary Tree (BINTREE – Gama et al, 2003)

    • Closest implementation of a batch method

    • Incrementally update a binary tree as data is observed

    • Issues: high memory cost, high cost of split search, data order

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Numeric handling methods1
Numeric Handling Methods

  • Quantile Summaries (GK – Greenwald and Khanna, 2001)

    • Motivation comes from VLDB

    • Maintain sample of values (quantiles) plus range of possible ranks that the samples can take (tuples)

    • Extremely space efficient

    • Issues: use max number of tuples per summary

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Handling numeric methods
Handling Numeric Methods

  • Gaussian Approximation (GAUSS)

    • Assume values conform to Normal Distribution

    • Maintain five numbers (eg mean, variance, weight, max, min)

    • Note: not sensitive to data order

    • Incrementally updateable

    • Using the max, min information per class – split the range into N equal parts

    • For each part use the 5 numbers per class to compute the approx class distribution

      • Use the above to compute the IG of that split

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Gaussian approximation 2 class problem
Gaussian approximation – 2 class problem

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Gaussian approximation 3 class problem
Gaussian approximation – 3 class problem

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Gaussian approximation 4 class problem
Gaussian approximation – 4 class problem

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Empirical evaluation
Empirical Evaluation

  • Use each numeric handling method (8 in total) to build a Hoeffding Tree (HTMC)

  • Vary parameters of some methods (VFML10,100,1000; BT; GK100,1000; GAUSS10,100)

  • Train models for 10 hours – then test on one million (holdout) examples

  • Define three application scenarios

    • Sensor network (100K memory limit)

    • Handheld (32MB)

    • Server (400MB)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Data generators
Data generators

  • Random tree (Domingos&Hulten):

    • (RTS) 10 num, 10 nom 5 values, 2 classes, leaves start at level 3, max level 5, plus version with 10% noise added (RTSN)

    • (RTC) 50 num, 50 nom 5 values, 2 classes, leaves start at level 5, max level 10, plus version with 10% noise added (RTCN)

  • Random RBF (Kirkby):

    • (RRBFS) 10 num, 100 centers, 2 classes

    • (RRBFC) 50 num, 1000 centers, 2 classes

  • Waveform (Aha):

    • (Wave21): 21 noisy num, (Wave40): +19 irrelevant num; 3 classes

  • (GenF1-GenF10) (Agrawal etal):

    • hypothetical loan applications, 10 different rule(s) over 6 num + 3 nom attrs, 5% noise, 2 classes

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Tree measurements
Tree Measurements

  • Accuracy (% correct)

  • Number of training examples processed in 10 hours (in millions)

  • Number of active leaves (in hundreds)

  • Number of inactive leaves (in hundreds)

  • Total nodes (in hundreds)

  • Tree depth

  • Training speed (% of generation speed)

  • Prediction speed (% of generation speed)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Sensor network 100k memory limit
Sensor Network (100K memory limit)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Handheld environment 32mb memory limit
Handheld Environment (32MB memory limit)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Server environment 400mb memory limit
Server Environment (400MB memory limit)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Overall results comments
Overall results - comments

  • VFML10 is superior on average in all environments, followed closely by GAUSS10

  • GK methods are generally competitive

  • BINTREE is only competitive in a server setting

  • Default setting of 1000 for VFML is a poor choice

  • Crude binning provides more space which leads to faster growth and better trees (more room to grow)

  • Higher values for GAUSS leads to very deep trees (in excess of the # of attributes) suggesting repeated splitting (too fine grained)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Remarks sensor network environment
Remarks – sensor network environment

  • Number of training examples low because learning stops when last active leaf is deactivated (mem mgmt freezes nodes – low # examples, low probability of splitting)

  • Most accurate methods VFML10, GAUSS10

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Remarks handheld environment
Remarks – Handheld Environment

  • Generates smaller trees (than server) and can therefore process more examples

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Remarks server environment
Remarks – Server Environment

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Vfml10 vs gauss10 closer analysis
VFML10 vs GAUSS10 – Closer Analysis

  • Recall VFML10 is superior on average

  • Sensor (avg 87.7 vs 86.2)

    • GAUSS10 superior on 10

    • VFML10 superior on 6 (2 no difference)

  • Handheld (avg 91.5 vs 91.4)

    • GAUSS10 superior on 4

    • VFML10 superior on 8 (6 no difference)

  • Server (avg 91.4 vs 91.2)

    • GAUSS10 superior on 6

    • VFML10 superior on 6 (6 no difference)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Data order
Data order

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


Conclusion
Conclusion

  • We have presented a method for handling numeric attributes in data streams that performs well in empirical studies

  • The methods employing the most approximation were superior – they allow greater growth when memory is limited.

  • On a dataset by dataset analysis there is not much to choose between VFML10 and GAUSS10

  • Gains made in handling numeric variables come at a cost in terms of training and prediction speed – the cost is high in some environments

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


All algorithms available
All algorithms available

  • https://sourceforge.net/projects/moa-datastream

  • All methods and an environment

    for experimental evaluation of

    data streams is available from the

    above URL – system is called

    Massive

    Online

    Analysis (MOA)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group


ad