Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Motivation Linear SVM The Algorithm • Restrict to use at most 2GB memory. • Compared methods: • Online methods: VW, Perceptron, MIRA, CW. • Block Minimization Framework [Yu et al. 2010]: BMD BM-PEGASOS. • Proposed methods: SBM with various settings. • Split data into blocks and store them in compressed files. • At each time, load and train on a data block by solving a sub-problem. • The drawback: spends the same time and memory on important and unimportant samples. • Intuitive: try to only focus on informative samples. • If we have an oracle of support vectors, we only need to train the model on the support vectors. • At step it considers a data block consisting of • new data load from disk Bj , • informative samples cached in memoryt, j • it then solves: • Sub-problems can be solved by a linear classification package such as LIBLINEAR. • Finally, it updates the model and select cached samples t, j+1from {t, j ∪Bj}. • Related to selective sampling and shrinking strategy. • The samples that are close to the separator are usually important – cache these samples. • Define a scoring function and keep the samples with higher scores in cache  Disk level shrinking. • Linear Classifiers handle large-scale data well. • But, a key challenge for large-scale classification is dealing with data sets that cannot fit in memory. • Examples include: Spam Filtering, Classifying Query Logs, Classifying Web Data and Images. • Recent public challenges are also large scale: • Spam filtering data in Pascal challenge: 20GB. • ImageNetchallenge: ~130GB. • Existing methods for learning large-scale data: • Data smaller than memory: efficient training methods are well-developed. • Data beyond disk size: distributed solution. • When data cannot fit into memory: • Batch Learner: suffers due to disk swapping. • Online Learner: requires a large number of iterations to converge  large amount of I/O. • Block minimization[Yu et al. 2010]: requires many block loads since it treats all training examples uniformly. • Our Challenge: How to efficiently handle data larger than memory capacity in one machine. • Training time = running time in memory (training) + accessing data from disk. • Two orthogonal directions to reduce disk access: • Apply compression to lessen loading time. • Better learning from data in memory. • E.g., better utilizing memory in learning • Selective Block Minimization (SBM) algorithm: select informative examples and cache them in memory. • The SBM algorithm has following properties: • Our method significantly improves the I/O cost: • Spam filtering data: SBM obtained an accurate model with just a single pass over data set (the current best method requires 10 rounds). • As a result, SBM efficiently gives a stable model. • SBM maintains good convergence properties: • Usually a method that selects to cache data and treats samples non-uniformly can only converge to an approximate solution. • SBM can be proved to converge linearly to the global optimal solution on the entire data regardless of the cache selection strategy used. • Has been implemented in a branch of liblinear. Block Minimization [Yu et al. 10’] Training Time and Testing Accuracy Selective Block Minimization Learning From Stream Data • We can treat a large data set as a stream and process the data in an online manner. • Most online learners cannot obtain a good model with only single pass over data because they use a simple update rule on one sample at a time. • Apply SBM with only one iteration: considering a data block at a time can be better. • No need to store if the corresponding instances are removed from memory •  Keep more data in memory. • The model is more stable and can adjust to the available resources. Training Time Analysis Selecting Cached data Contributions Implementation Issues • Store cached samples in memory: need a careful design to avoid duplicating data. • Deal with large : • We need to store all the variables even the corresponding instances are not in memory. • Can be very large if #instances is large: • If is sparse (support vectors << #instances) we can store nonzero in a hash table. • Otherwise, we can store unused in disk. Experiments on streaming data This research is sponsored by DARPA Machine Reading Program and Bootstrapping Learning Program The code to regenerate the experiments can be download at: http://cogcomp.cs.illinois.edu/page/publication_view/660 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAA

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models

Presentation Transcript

Time-Based Management of Large Scale Network Models

Models of Memory

Models of Memory

Large scale models of the brain

Distributed Message Passing for Large Scale Graphical Models

Large-Scale Biologically Realistic Models of Cortical

Models of Memory

Models of memory

Models of memory

LARGE SCALE

The Power of Selective Memory

The Construction and Use of Linear Models in Large-scale Data Assimilation

MODELS OF MEMORY

Models of Memory

Models of memory

Problems & Solutions for Large-Scale Models

Use of satellite altimeter data for validating large scale hydraulic models

Models of Memory

Interpretation of large-scale stochastic epidemic models

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models