batch online learning

1 / 13

# batch online learning - PowerPoint PPT Presentation

transductive. i.i.d. i.i.d. [Littlestone89]. batch online learning. Toyota Technological Institute (TTI). Adam Kalai. Sham Kakade. Batch learning vs. Agnostic model [Kearns,Sch- apire,Sellie94]. (x 1 ,y 1 )…(x n ,y n ) 2 X £ { – , + }. –. X. +.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'batch online learning' - sawyer

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

transductive

i.i.d.

i.i.d.

[Littlestone89]

### batch online learning

Toyota Technological Institute (TTI)

Batch learning vs.

Agnostic model[Kearns,Sch-

apire,Sellie94]

(x1,y1)…(xn,yn) 2X£ {–,+}

X

+

dist 

+

+

+

+

+

+

+

Alg.H

+

+

(x1,y1),…,(xn,yn)

h 2F

+

+

+

+

Def. H learns F if, 8:

E[err(h)]·minf2Ferr(f)+n-c

and H runs in time poly(n)

+

+

x1

+

+

+

+

+

+

+

+

+

+

Online learning

arbitrary

dist.  over X£ {–,+}

X

h

h1

ERM = “best on data”

Familyof functions F(e.g. halfspaces)

Batch learning vs.

(x1,y1)…(xn,yn) 2X£ {–,+}

X

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

x2

x1

+

+

+

+

+

h2

+

+

+

+

+

Online learning

arbitrary

dist.  over X£ {–,+}

X

h

ERM = “best on data”

Familyof functions F(e.g. halfspaces)

Batch learning vs.

(x1,y1)…(xn,yn) 2X£ {–,+}

X

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

x3

x2

x1

+

+

+

+

+

+

+

+

+

+

Online learning

arbitrary

dist.  over X£ {–,+}

X

h

h3

ERM = “best on data”

Goal: err(alg) ·

minf2F err(f) +

Familyof functions F(e.g. halfspaces)

Batch learning vs.

Analogous definition:

(x1,y1)…(xn,yn) 2X£ {–,+}

X

+

{x1,x2,…,xn}

+

+

+

Alg.H

+

+

+

hi2F

+

(x1,y1),…,(xi-1,yi-1)

+

+

+

+

+

H learns F if,8(x1,y1),…,(xn,yn):

E[err(H)]·minf2Ferr(f)+n-c

and H runs in time poly(n)

+

+

+

+

+

x2

x1

x3

+

+

+

+

+

h2

+

+

h3

+

+

+

Transductive

Online learning

[Ben-David,Kushilevitz,Mansour95]

arbitrary

dist.  over X£ {–,+}

.

.

X

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

equivalent

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

h

“proper” learning

outputh(i)2F

h1

.

.

.

.

.

.

.

.

.

ERM = “best on data”

Goal: err(alg) ·

minf2F err(f) +

Familyof functions F(e.g. halfspaces)

Our results

H

4

HERM = Hallucination + ERM

Theorem 1. In online trans. setting,

HERM requires one ERM computation per sample.

Theorem 2. These are equivalent for proper learning:

• F is agnostically learnable
• ERM agnostically learns F

(ERM can be done efficiently and VC(F) is finite)

• F is online transductively learnable
• HERM online transductively learns F
Online ERM algorithm

(sucks)

Choose hi2F with minimal errors on (x1,y1),…,(xi-1,yi-1)

hi = argminf2F|{ j<i| f(xj)yj}|

x1 = (0,0) y1 = –

x2 = (0,0) y2 = +

x3 = (0,0) y3 = –

x4 = (0,0) y4 = +

F= {–,+}X = { (0,0) }

h1(x) = +

h2(x) = –

h3(x) = +

h4(x) = –

Online ERM algorithm

Choose hi2F with minimal errors on (x1,y1),…,(xi-1,yi-1)

hi = argminf2F|{ j<i| f(xj)yj}|

err(ERM) · minf2Ferr(f) + Pi2{1,…,n}[hihi+1]

Online “stability” lemma:

[KVempala01]

Proof by induction on n = #examples

easy!

Online HERM algorithm

random from {1,2,…,R}

Prxi,rxi[hi hi+1] · R-1

Stability: 8i,

-

+

James Hannan

(xi,+),(xi,+),…,(xi,+)

rxi

+

Inputs: ={x1,x2,…,xn}, int R

For each x2, hallucinate rx copies of (x,+) & rx copies of (x,–)

Choose hi2F that minimizes errors onhallucinated data + (x1,y1),…,(xi-1,yi-1)

+

-

, (xi,+)

Online HERM algorithm

random from {1,2,…,R}

Prxi,rxi[hi hi+1] · R-1

Stability: 8i,

-

+

H

4

Theorem 1

For R=n¼:

It requires one ERM computation per example.

Inputs: ={x1,x2,…,xn}, int R

For each x2, hallucinate rx copies of (x,+) & rx copies of (x,–)

Choose hi2F that minimizes errors onhallucinated data + (x1,y1),…,(xi-1,yi-1)

+

-

Online “stability” lemma

Hallucination cost

(x1,y1),…,(xi,yi),…(xi+W,yi+W),…(xn,yn)

window

4

Related work
• Inequivalence of batch and online learning in noiseless setting
• ERM black box is noiseless
• For computational reasons!
• Inefficient alg. for online trans. learning:
• List all · (n+1)VC(F) labelings (Sauer’s lemma)
• Run weighted majority

[Blum90,Balcan06]

[Ben-David,Kushilevitz,Mansour95]

[Littlestone,Warmuth92]

Conclusions
• Alg. for removing iid assumption, efficiently, using unlabeled data
• Interesting way to use unlabeled data online, reminiscent of bootstrap/bagging
• Adaptive version: can do well on every window
• Find “right” algorithm/analysis