batch online learning

transductive i.i.d. i.i.d. [Littlestone89] batch online learning Toyota Technological Institute (TTI) Adam Kalai Sham Kakade

Batch learning vs. Agnostic model[Kearns,Sch- apire,Sellie94] (x1,y1)…(xn,yn) 2X£ {–,+} – X + dist  – + + + + + + + Alg.H – + – + (x1,y1),…,(xn,yn) h 2F – – + – + + – + Def. H learns F if, 8: E[err(h)]·minf2Ferr(f)+n-c and H runs in time poly(n) – – + + x1 – – + + + – – + + – – – – + + – – + – – + + Online learning arbitrary dist.  over X£ {–,+} X h h1 ERM = “best on data” Familyof functions F(e.g. halfspaces)

Batch learning vs. (x1,y1)…(xn,yn) 2X£ {–,+} – X + – + + + + + + + – + – + – – + – + + – + + – – + + x2 x1 – – + + + – – + + – – – – h2 + + – – + – – + + Online learning arbitrary dist.  over X£ {–,+} X h ERM = “best on data” Familyof functions F(e.g. halfspaces)

Batch learning vs. (x1,y1)…(xn,yn) 2X£ {–,+} – X + – + + + + + + + – + – + – – + – + + – + – + + – + + x3 x2 x1 – – + + + – – + + – – – – + + – – + – – + + Online learning arbitrary dist.  over X£ {–,+} X h h3 ERM = “best on data” Goal: err(alg) · minf2F err(f) + Familyof functions F(e.g. halfspaces)

Batch learning vs. Analogous definition: (x1,y1)…(xn,yn) 2X£ {–,+} – X + {x1,x2,…,xn} – + + + Alg.H + + + hi2F + (x1,y1),…,(xi-1,yi-1) – + – + – – + – + + – H learns F if,8(x1,y1),…,(xn,yn): E[err(H)]·minf2Ferr(f)+n-c and H runs in time poly(n) + + + – – + + x2 x1 x3 – – + + + – – + + – – – – h2 + + h3 – – + – – + + Transductive Online learning [Ben-David,Kushilevitz,Mansour95] arbitrary dist.  over X£ {–,+} . . X . . . . . . . . . . . . . . . . . equivalent . . . . . . . . . . . . . . . . . . . h “proper” learning outputh(i)2F h1 . . . . . . . . . ERM = “best on data” Goal: err(alg) · minf2F err(f) + Familyof functions F(e.g. halfspaces)

Our results H 4 HERM = Hallucination + ERM Theorem 1. In online trans. setting, HERM requires one ERM computation per sample. Theorem 2. These are equivalent for proper learning: • F is agnostically learnable • ERM agnostically learns F (ERM can be done efficiently and VC(F) is finite) • F is online transductively learnable • HERM online transductively learns F

Online ERM algorithm (sucks) Choose hi2F with minimal errors on (x1,y1),…,(xi-1,yi-1) hi = argminf2F|{ j<i| f(xj)yj}| x1 = (0,0) y1 = – x2 = (0,0) y2 = + x3 = (0,0) y3 = – x4 = (0,0) y4 = + … F= {–,+}X = { (0,0) } h1(x) = + h2(x) = – h3(x) = + h4(x) = – …

Online ERM algorithm Choose hi2F with minimal errors on (x1,y1),…,(xi-1,yi-1) hi = argminf2F|{ j<i| f(xj)yj}| err(ERM) · minf2Ferr(f) + Pi2{1,…,n}[hihi+1] Online “stability” lemma: [KVempala01] Proof by induction on n = #examples easy!

Online HERM algorithm random from {1,2,…,R} Prxi,rxi[hi hi+1] · R-1 Stability: 8i, - + James Hannan … (xi,+),(xi,+),…,(xi,+) rxi + Inputs: ={x1,x2,…,xn}, int R For each x2, hallucinate rx copies of (x,+) & rx copies of (x,–) Choose hi2F that minimizes errors onhallucinated data + (x1,y1),…,(xi-1,yi-1) + - , (xi,+)

Online HERM algorithm random from {1,2,…,R} Prxi,rxi[hi hi+1] · R-1 Stability: 8i, - + H 4 Theorem 1 For R=n¼: It requires one ERM computation per example. Inputs: ={x1,x2,…,xn}, int R For each x2, hallucinate rx copies of (x,+) & rx copies of (x,–) Choose hi2F that minimizes errors onhallucinated data + (x1,y1),…,(xi-1,yi-1) + - Online “stability” lemma Hallucination cost

Being more adaptive(shifting bounds) (x1,y1),…,(xi,yi),…(xi+W,yi+W),…(xn,yn) window 4

Related work • Inequivalence of batch and online learning in noiseless setting • ERM black box is noiseless • For computational reasons! • Inefficient alg. for online trans. learning: • List all · (n+1)VC(F) labelings (Sauer’s lemma) • Run weighted majority [Blum90,Balcan06] [Ben-David,Kushilevitz,Mansour95] [Littlestone,Warmuth92]

Conclusions • Alg. for removing iid assumption, efficiently, using unlabeled data • Interesting way to use unlabeled data online, reminiscent of bootstrap/bagging • Adaptive version: can do well on every window • Find “right” algorithm/analysis

batch online learning

batch online learning

Presentation Transcript

Batch Distillation

Batch Distillation

RiskMeter Online Batch Overview

MIMOSA online - batch searching

Online Learning:

Online Learning

ONLINE LEARNING

Online Learning

Online Learning

Online Learning

Online Learning

Online Learning

“Batch, Batch, Batch:” What Does It Really Mean?

ONLINE LEARNING

Online and Batch Learning of Pseudo-Metrics

Batch Geocoding Online

Online Learning

Learning Online

Batch to Batch shade variation

RiskMeter Online Batch Overview

MIMOSA online - batch searching

Batch Geocoding Online