Powerpoint Templates

Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi Powerpoint Templates

Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As Outlines

Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

Introduction • In today’s information society, extraction of knowledge is becoming a very important task for many people. We live in an age of knowledge revolution. • Many organizations have more than very large data bases that grow at a rate of several million records per day. • Opportunities • Challenges • Main limited resources in knowledge discovery systems: • Time • Memory • Sample size

Introduction—cont. • Traditional systems: • Small amount of data is available • Using a fraction of available computational power • Current systems: • The bottleneck is time and memory • Using a fraction of available samples of data • Try to mine databases that don’t fit in main memory • Available algorithms: • Efficient, but not guarantee a similar learned model to the batch mode. • Never recover from an unfavorable set of early examples. • Sensitive to example ordering. • Produce the same model as batch version, but not efficiently. • Slower than the batch algorithm.

Introduction—cont. • Requirements of algorithms to overcome these problems: • Operate continuously and indefinitely • Incorporate examples as they arrive • Never loosing potentially valuable information • Build a model using at most one scan of the data. • Use only a fixed amount of main memory. • Require small constant time per record. • Make a usable model available at any point in time. • Produce a model equivalent to the one obtained by ordinary database mining algorithm. • By changing the data-generating over time, the model at any time should be up-to-date.

Introduction—cont. • Such requirements are fulfilled by: • Incremental learning methods • Online methods • Successive methods • Sequential methods

Hoeffding Trees • Classic decision tree learners: • CART, ID3, C4.5 • All examples simultaneously in main memory. • Disk based decision tree learners: • SLIQ, SPRINT • Examples are stored on disk. • Expensive to learn complex trees or very large datasets. • Consider a subset of training examples to find the best attribute: • For extremely large datasets. • Read each examples at most once. • Directly mine online data sources. • Build complex trees with acceptable computational cost.

Hoeffding Trees—cont. • Given a set of examples of the form • : number of examples • : discrete class label • : a vector of attributes (symbolic or numeric) • Goal : produce • A model that will predict the classes of future examples with high accuracy.

Hoeffding Trees—cont. • Given a stream of examples: • Use first ones to choose the root test. • Pass succeeding ones to corresponding leaves. • Pick best attributes there. • … And so on recursively • How many examples are necessary at each node? • Hoeffding Bound • Additive Chernof Bound • A statistical result

Hoeffding Trees—cont. • Hoeffding bound: • : heuristic measure used to choose test attributes • C4.5 information gain • CART Gini index • Assume is to be maximized • : heuristic measure after seeing examples • : attribute with highest observed • : second-best attribute • : difference between and • : probability of choosing the wrong attribute • Hoeffding bound guarantees that is the correct choice with probability if: • examples have been seen at this node

Hoeffding Trees—cont. • Hoeffding bound: • If • is the best attribute with probability • Node needs to accumulate examples from the stream until becomes smaller than • It is independent of the probability distribution generating the observations. • More conservative than distribution dependent ones.

Hoeffding Tree algorithm • Inputs: • : is a sequence of examples. • : is a set of discrete attributes. • : is a split evaluation function. • : desired probability of choosing the wrong attribute at any given node. • Output: • : is a decision tree.

Hoeffding Tree algorithm—cont. • Procedure HoeffdingTree () • Let be a tree with a single leaf (the root). • Let • Let predict most frequent class in . • For each class • For each value of each attribute • Let

Hoeffding Tree algorithm—cont. • For each example in • Sort into a leaf using • For each and each such that • Increment . • Label with majority class among examples seen at . • Compute for each attribute . • Let be the attribute with highest . • Let be the attribute with second-highest . • Compute . • If , then • Replace by an internal node that split on . • For each branch of the split • Add a new leaf , Let . • Let predict most frequent class. • For each class and each that • Let • Return .

Hoeffding Trees—cont. • : leaf probability (assume this is constant). • : tree produced by Hoeffding tree algorithm with desired given an infinite sequence of examples . • : decision tree induced by choosing at each node the attribute with true greatest . • : intentional disagreement between two decision trees: • : probability that the attribute vector will be observed. • : indicator function (1:true argument, 0:otherwise) • THEOREM :

Hoeffding Trees—cont. • Suppose that the best and second-best attribute differ by 10% • According to • requires 380 examples • requires 345 more examples • An exponential improvement in can be obtained with a linear increase in the number of examples

The VFDT System • Very Fast Decision Tree learner (VFDT). • A decision tree learning system. • based on the Hoeffding tree algorithm. • Either uses information gain or Gini index as attribute evaluation measure. • Includes a number of refinements to Hoeffding tree algorithm: • Ties. • computation. • Memory. • Poor attributes. • Initialization. • Rescans.

The VFDT System—cont. • Ties • Two or more attributes have very similar ’s • Potentially many examples will be required to decide between them with high confidence. • It makes little difference which attribute is chosen. • If : split on the current best attribute.

The VFDT System—cont. • computation • The most significant part of the time cost per example is recomputing. • Computing for every new example is inefficient. • new examples must be accumulated at a leaf before recomputing.

The VFDT System—cont. • Memory • VFDT’s memory use is dominated by the memory required to keep counts for all growing leaves. • If the maximum available memory reached, VFDT deactivates the least promising leaves. • The least promising leaves are considered to be the ones with the lowest values of .

The VFDT System—cont. • Poor attributes • VFDS’s memory usage is also minimized by dropping early on attributes that do not look promising. • As soon as the difference between an attribute’s and the best one’s becomes greater than , the attribute can be dropped. • The memory used to store the corresponding counts can be freed.

The VFDT System—cont. • Initialization • VFDT can be initialized with the tree produced by a conventional RAM-based learner on a small subset of the data. • The tree can either be input as it is, or over-pruned. • Gives VFDT a “head start”.

The VFDT System—cont. • rescans • VFDT can rescan previously-seen examples. • Can be activate if: • The data arrives slowly enough that there is time for it. • The dataset is finite and small enough that it is feasible.

Synthetic Data Study • Comparing VFDT with C4.5 release 8. • Restricted two systems to using the same amount of RAM. • VFDT used information gain as the function. • 14 concepts were used, all with 2 classes and 100 attributes. • For each level after the first 3 • A fraction of the nodes was replaced by leaves • The rest become splits on a random attribute • At depth of 18, all the nodes were replaced with leaves. • Each leaf was randomly assigned a class • Stream of training examples was then generated • Sampling uniformly from the instance space. • Assigning classes according to the target tree. • Various levels of class and attribute noise was added.

Synthetic Data Study—cont. • Accuracy as a function of the number of training examples.

Synthetic Data Study—cont. • Tree size as a function of the number of training examples.

Synthetic Data Study—cont. • Accuracy as a function of the noise level. • 4 runs on same concept (C4.5:100k,VFDT:20million examples)

Lesion Study • Effect of initializing VFDT with C4.5 with and without over-pruning.

Web Data • Applying VFDT to mining the steam of Web page requests. • From the whole University of Washington mail campus. • To mine 1.6 million examples: • VFDT took 1540 seconds to do one pass over the training data. • 983 seconds was spent reading data from disk. • C4.5 took 24 hours to mine 1.6 million examples.

Web Data—cont. • Performance on Web data

Conclusion • Hoeffding trees: • A method for learning online. • Learns the high-volume data streams. • Allows learning in very small constant time per example. • Guarantees high similarity to the corresponding batch trees. • VFDT system: • A high performance data mining system. • Based on Hoeffding trees. • Effective in taking advantage of massive number of examples.

Qs & As • Name 4 requirements of algorithms to overcome current disk-based available algorithms? • Operate continuously and indefinitely • Incorporate examples as they arrive • Never loosing potentially valuable information • Build a model using at most one scan of the data. • Use only a fixed amount of main memory. • Require small constant time per record. • Make a usable model available at any point in time. • Produce a model equivalent to the one obtained by ordinary database mining algorithm. • By changing the data-generating over time, the model at any time should be up-to-date

Qs & As • What are the benefits of considering a subset of training examples to find the best attribute: • For extremely large datasets. • Read each examples at most once. • Directly mine online data sources. • Build complex trees with acceptable computational cost.

Qs & As • How does VFDT’s tie refinement to Hoeffding tree algorithm works? • Two or more attributes have very similar ’s • Potentially many examples will be required to decide between them with high confidence. • It makes little difference which attribute is chosen. • If : split on the current best attribute.

Powerpoint Templates