Progressive Sampling

Progressive Sampling Instance Selection and Construction for Data Mining Ch 9. F. Provost, D. Jensen, and T. Oates 2001.5.16 신수용

Introduction • Increasing the amount of data leads to greater computational cost • Progressive sampling attempts to maximize accuracy as efficiently as possible, starting with a small sample and using progressively larger ones until model accuracy no longer improves. • A central component of progressive sampling is a sampling schedule S = {n0, n1, …, nk} • ni : the size of a sample (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Three fundamental question for progressive sampling • What is an efficient sampling schedule? • How can converge be detected effectively and efficiently? • As sampling progresses, can the schedule be adapted to be more efficient? (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Def. 1. • Given a data set, a sampling procedure, and an induction algorithm, nmin is the size of the smallest sufficient training set. Models built with smaller training sets have lower accuracy than models built with from training sets of size nmin, and models built with larger training sets have no higher accuracy. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Determining an efficient schedule • Static sampling • Computes without progressive sampling, based on a subsample’s statistical similarity to the entire sample. • Arithmetic sampling (John & Langley 1996) • (Drawback) if nmin is large multiple of n, then the approach will require many runs of the underlying induction algorithm (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Asymptotic optimality of geometric sampling • For induction algorithms with polynomial time complexity (f(n)), no better than O(n), if convergence also can be detected in O(f(n)), then geometric progressive sampling is asymtotically optimal among progressive sampling methods in terms of run time. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Optimality with Respect to Expectations of Convergence • In many cases there may be no prior information about the likelihood of convergence occurring for any given n. • But, since also in many cases nmin << N, it would be more reasonable to assume a more concentrated distribution. (roughly log-normal) • Identification of the optimal schedule in terms of dynamic programming, requires O(N2) space and O(N3) time. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Detecting convergence • Linear regression with local sampling (LRLS) • Begins at the latest scheduled sample size ni and samples l additional points in the local neighborhood of ni. • These points are then used to estimate a linear regression line, whose slope is compared to zero. • If the slope is sufficiently close to zero, convergence is detected. • LRLS takes advantage of a common property of learning curves. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Progressive Sampling

Progressive Sampling

Presentation Transcript

Sampling

Sampling

Sampling

Sampling

Sampling and Sampling Distributions

SAMPLING

The Progressive Movement Progressive = Change

Sampling

Sampling

progressive

PROGRESSIVE

Progressive Science air sampling

Sampling Designs Systematic Sampling Cluster Sampling Multistage Sampling

Progressive

Sampling and Sampling Distributions

Wrapped Progressive Sampling Search for Optimizing Learning Algorithm Parameters

Sampling

Sampling dan Distribusi Sampling()

SAMPLING

Sampling

Sampling