handling numeric attributes in hoeffding trees
Download
Skip this Video
Download Presentation
Handling Numeric Attributes in Hoeffding Trees

Loading in 2 Seconds...

play fullscreen
1 / 26

Handling Numeric Attributes in Hoeffding Trees - PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on

Bernhard Pfahringer, Geoff Holmes and Richard Kirkby. Handling Numeric Attributes in Hoeffding Trees. Overview. Hoeffding trees are excellent for classification tasks on data streams.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Handling Numeric Attributes in Hoeffding Trees' - guri


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
handling numeric attributes in hoeffding trees

Bernhard Pfahringer, Geoff Holmes and Richard Kirkby

Handling Numeric Attributes in

Hoeffding Trees

overview
Overview
  • Hoeffding trees are excellent for classification tasks on data streams.
  • Handling numeric attributes well is crucial to performance in conventional decision trees (for example, C4.5 -> C4.8)
  • Does handling numeric attributes matter for streamed data?
  • We implement a range of methods and empirically evaluate their accuracy and costs.

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

data streams reminder
Data Streams - reminder
  • Idea is that data is being provided from a continuous source:
    • Examples processed one at a time (inspected once)
    • Memory is limited (!)
    • Model construction must scale (NlogN in num examples)
    • Be ready to predict at any time
  • As memory is limited this will have implications for any numeric handling method you might construct
  • Only consider methods that work as the tree is built

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

main assumptions limitations
Main assumptions/limitations
  • Assume a stationary concept, i.e. no concept drift or change
    • may seem very limiting, but …
  • Three-way trade-off:
    • memory
    • speed
    • accuracy
  • Used only artificial data sources

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

hoeffding trees
Hoeffding Trees
  • Introduced by Domingos and Hulten (VFDT)
  • “Extension” of decision trees to streams
  • HT Algorithm:
    • Init tree T to root node
    • For each example from stream
      • Find leaf L for this example
      • Update counts in L with attr values of example and compute split function (eg Info Gain, IG) for each attribute
      • If IG(best attr) – IG(next best attr) > ε then split L on best attr

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

active leaf data structure
Active leaf data structure
  • For each class value:
    • for each nominal attribute:
      • for each possible value:
        • keep sum of counts/weights
    • for each numeric attribute:
      • keep sufficient stats to approximate the distribution
      • various possibilities: here assume normal distribution so estimate/record: n,mean,variance, + min/max

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

numeric handling methods
Numeric Handling Methods
  • VFDT (VFML – Hulten & Domingos, 2003)
    • Summarize the numeric distribution with a histogram made up of a maximum number of bins N (default 1000)
    • Bin boundaries determined by first N unique values seen in the stream.
    • Issues: method sensitive to data order and choosing a good N for a particular problem
  • Exhaustive Binary Tree (BINTREE – Gama et al, 2003)
    • Closest implementation of a batch method
    • Incrementally update a binary tree as data is observed
    • Issues: high memory cost, high cost of split search, data order

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

numeric handling methods1
Numeric Handling Methods
  • Quantile Summaries (GK – Greenwald and Khanna, 2001)
    • Motivation comes from VLDB
    • Maintain sample of values (quantiles) plus range of possible ranks that the samples can take (tuples)
    • Extremely space efficient
    • Issues: use max number of tuples per summary

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

handling numeric methods
Handling Numeric Methods
  • Gaussian Approximation (GAUSS)
    • Assume values conform to Normal Distribution
    • Maintain five numbers (eg mean, variance, weight, max, min)
    • Note: not sensitive to data order
    • Incrementally updateable
    • Using the max, min information per class – split the range into N equal parts
    • For each part use the 5 numbers per class to compute the approx class distribution
      • Use the above to compute the IG of that split

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

gaussian approximation 2 class problem
Gaussian approximation – 2 class problem

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

gaussian approximation 3 class problem
Gaussian approximation – 3 class problem

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

gaussian approximation 4 class problem
Gaussian approximation – 4 class problem

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

empirical evaluation
Empirical Evaluation
  • Use each numeric handling method (8 in total) to build a Hoeffding Tree (HTMC)
  • Vary parameters of some methods (VFML10,100,1000; BT; GK100,1000; GAUSS10,100)
  • Train models for 10 hours – then test on one million (holdout) examples
  • Define three application scenarios
    • Sensor network (100K memory limit)
    • Handheld (32MB)
    • Server (400MB)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

data generators
Data generators
  • Random tree (Domingos&Hulten):
    • (RTS) 10 num, 10 nom 5 values, 2 classes, leaves start at level 3, max level 5, plus version with 10% noise added (RTSN)
    • (RTC) 50 num, 50 nom 5 values, 2 classes, leaves start at level 5, max level 10, plus version with 10% noise added (RTCN)
  • Random RBF (Kirkby):
    • (RRBFS) 10 num, 100 centers, 2 classes
    • (RRBFC) 50 num, 1000 centers, 2 classes
  • Waveform (Aha):
    • (Wave21): 21 noisy num, (Wave40): +19 irrelevant num; 3 classes
  • (GenF1-GenF10) (Agrawal etal):
    • hypothetical loan applications, 10 different rule(s) over 6 num + 3 nom attrs, 5% noise, 2 classes

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

tree measurements
Tree Measurements
  • Accuracy (% correct)
  • Number of training examples processed in 10 hours (in millions)
  • Number of active leaves (in hundreds)
  • Number of inactive leaves (in hundreds)
  • Total nodes (in hundreds)
  • Tree depth
  • Training speed (% of generation speed)
  • Prediction speed (% of generation speed)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

sensor network 100k memory limit
Sensor Network (100K memory limit)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

handheld environment 32mb memory limit
Handheld Environment (32MB memory limit)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

server environment 400mb memory limit
Server Environment (400MB memory limit)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

overall results comments
Overall results - comments
  • VFML10 is superior on average in all environments, followed closely by GAUSS10
  • GK methods are generally competitive
  • BINTREE is only competitive in a server setting
  • Default setting of 1000 for VFML is a poor choice
  • Crude binning provides more space which leads to faster growth and better trees (more room to grow)
  • Higher values for GAUSS leads to very deep trees (in excess of the # of attributes) suggesting repeated splitting (too fine grained)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

remarks sensor network environment
Remarks – sensor network environment
  • Number of training examples low because learning stops when last active leaf is deactivated (mem mgmt freezes nodes – low # examples, low probability of splitting)
  • Most accurate methods VFML10, GAUSS10

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

remarks handheld environment
Remarks – Handheld Environment
  • Generates smaller trees (than server) and can therefore process more examples

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

remarks server environment
Remarks – Server Environment

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

vfml10 vs gauss10 closer analysis
VFML10 vs GAUSS10 – Closer Analysis
  • Recall VFML10 is superior on average
  • Sensor (avg 87.7 vs 86.2)
    • GAUSS10 superior on 10
    • VFML10 superior on 6 (2 no difference)
  • Handheld (avg 91.5 vs 91.4)
    • GAUSS10 superior on 4
    • VFML10 superior on 8 (6 no difference)
  • Server (avg 91.4 vs 91.2)
    • GAUSS10 superior on 6
    • VFML10 superior on 6 (6 no difference)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

data order
Data order

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

conclusion
Conclusion
  • We have presented a method for handling numeric attributes in data streams that performs well in empirical studies
  • The methods employing the most approximation were superior – they allow greater growth when memory is limited.
  • On a dataset by dataset analysis there is not much to choose between VFML10 and GAUSS10
  • Gains made in handling numeric variables come at a cost in terms of training and prediction speed – the cost is high in some environments

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

all algorithms available
All algorithms available
  • https://sourceforge.net/projects/moa-datastream
  • All methods and an environment

for experimental evaluation of

data streams is available from the

above URL – system is called

Massive

Online

Analysis (MOA)

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

ad