And Other Cool Stuff. Online Learning. Your guide:. Avrim Blum. Carnegie Mellon University. [Machine Learning Summer School 2012]. Itinerary. Stop 1: M inimizing regret and combining advice . Randomized Wtd Majority / Multiplicative Weights alg Connections to game theory
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
And Other Cool Stuff
Online Learning
Your guide:
Avrim Blum
Carnegie Mellon University
[Machine Learning Summer School 2012]
Itinerary
Powerful tools for learning:Kernels and Similarity Functions
Powerful tools for learning:Kernels and Similarity Functions
+
+
+

+





+


+

+
+



w ¢ x = (x(1) + x(2) – x(5) + x(9)) ¢ x.
Kernel should be pos. semidefinite (PSD)
z2
x2
X
X
X
X
X
X
X
X
X
X
X
X
O
X
X
O
O
O
X
X
O
X
x1
O
O
X
O
O
z1
O
O
O
O
O
X
X
O
O
X
X
X
X
X
O
X
X
X
X
z3
X
X
X
X
X
X
X
+
+
+
+
+

+




Assume F(x)· 1.
x
y
Moreover, generalize well if good margin
But there is a little bit of a disconnect...
Ey~D[K(x,y)l(y)=l(x)] ¸Ey~D[K(x,y)l(y)l(x)]+
Average similarity to points of the same label
Average similarity to points of opposite label
gap
“most x are on average more similar to points y of their own type than to points y of the other type”
Ey~D[K(x,y)l(y)=l(x)] ¸ Ey~D[K(x,y)l(y)l(x)]+
Average similarity to points of the same label
Average similarity to points of opposite label
gap
Note: it’s possible to satisfy this and not be PSD.
Ey~D[K(x,y)l(y)=l(x)] ¸ Ey~D[K(x,y)l(y)l(x)]+
Average similarity to points of the same label
Average similarity to points of opposite label
gap
How can we use it?
At least a 1 prob mass of x satisfy:
Ey~D[K(x,y)l(y)=l(x)] ¸ Ey~D[K(x,y)l(y)l(x)]+
At least a 1 prob mass of x satisfy:
Ey~D[K(x,y)l(y)=l(x)] ¸ Ey~D[K(x,y)l(y)l(x)]+
+
+
_
Avg simil to negs is ½, but to pos is only ½¢1+½¢(½) = ¼.
+
+
_
Ey[K(x,y)l(y)=l(x),R(y)]¸Ey[K(x,y)l(y)l(x), R(y)]+
Ey[K(x,y)l(y)=l(x),R(y)]¸Ey[K(x,y)l(y)l(x), R(y)]+
Ey[K(x,y)l(y)=l(x),R(y)]¸Ey[K(x,y)l(y)l(x), R(y)]+
could be unlabeled
If K is (,,)good, then can learn to error ’ = O() with O((1/(’2)) log(n)) labeled examples.
Algorithm
F(x) = [K1(x,y1), …,Kr(x,y1), …, K1(x,yn),…,Kr(x,yn)].
Algorithm
F(x) = [K1(x,y1), …,Kr(x,y1), …, K1(x,yn),…,Kr(x,yn)].
Guarantee:Whp the induced distribution F(P) in Rnrhas a separator of error · + at L1 margin at least
Sample complexity is roughly: O((1/(2)) log(nr))
Only increases by log(r) factor!
Learning with Multiple Similarity Functions
Itinerary
Distributed PAC Learning
MariaFlorina Balcan
Avrim Blum
Shai Fine
Yishay Mansour
Georgia Tech
CMU
IBM
TelAviv
[In COLT 2012]
Distributed Learning
Many ML problems today involve massive amounts of data distributed across multiple locations.
Distributed Learning
Many ML problems today involve massive amounts of data distributed across multiple locations.
Distributed Learning
Many ML problems today involve massive amounts of data distributed across multiple locations.
Distributed Learning
Many ML problems today involve massive amounts of data distributed across multiple locations.
Distributed Learning
Many ML problems today involve massive amounts of data distributed across multiple locations.
Distributed Learning
Many ML problems today involve massive amounts of data distributed across multiple locations.
Distributed Learning
Many ML problems today involve massive amounts of data distributed across multiple locations.
Distributed Learning
Many ML problems today involve massive amounts of data distributed across multiple locations.
The distributed PAC learning model
+
+
+

+



The distributed PAC learning model
D = (D1 + D2 + … + Dk)/k
1 2 … k
D1 D2 … Dk
The distributed PAC learning model
Goal: learn good h over D, using as little communication as possible.
1 2 … k
D1 D2 … Dk
The distributed PAC learning model
Interesting special case to think about:
1 2
+
+
+
+
+
+
+
+
+
+
+
+




+
+
+
+












The distributed PAC learning model
Assume learning a class C of VCdimension d.
Some simple baselines. [viewing k << d]
D1 D2 … Dk
The distributed PAC learning model
D1 D2 … Dk
Dependence on 1/²
Had linear dependence in d and 1/², or M and no dependence on 1/².
D1 D2 … Dk
Recap of Adaboost
+
+
+
+
+
+
+

+







Key points:
Distributed Adaboost
Si
Sj
wi,t
wj,t
Si
Sj
ht
ht
+
ht
+
+
+
+
+
+
+

+

+






(ht may do better on some than others)




Distributed Adaboost
Final result:
Agnostic learning
Recent result of [BalcanHanneke] gives robust halving alg that can be implemented in distributed setting.
Can we do better for specific classes of interest?
E.g., conjunctions over {0,1}d. f(x) = x2x5x9x15.
Can we do better for specific classes of interest?
E.g., conjunctions over {0,1}d. f(x) = x2x5x9x15.
1101111011010111
1111110111001110
1100110011001111
1100110011000110
Can we do better for specific classes of interest?
E.g., conjunctions over {0,1}d. f(x) = x2x5x9x15.
Only O(k) examples sent. O(kd) bits.
Can we do better for specific classes of interest?
General principle: can learn any intersection closed class (welldefined “tightest wrapper” around positives) this way.
+
+


+
+






Interesting class: parity functions
Examples x 2 {0,1}d. f(x) = x¢vf mod 2, for unknown vf.
Interesting class: parity functions
Examples x 2 {0,1}d. f(x) = x¢vf mod 2, for unknown vf.
S
h2 C
S
f(x)
x
??
Interesting class: parity functions
Examples x 2 {0,1}d. f(x) = x¢vf mod 2, for unknown vf.
Linear Separators
Linear separators over nearuniform D over Bd.
Can one do better?
+
+
+

+



Linear Separators
Thm: Over any nonconcentrated D [density bounded by c¢unif], can achieve #vectors communicated of O((d log d)1/2) rather than O(d) (for constant k, ²).
Algorithm:
Linear Separators
Thm: Over any nonconcentrated D [density bounded by c¢unif], can achieve #vectors communicated of O((d log d)1/2) rather than O(d) (for constant k, ²).
Algorithm:
Proof idea:
Conclusions and Open Questions
As we move to large distributed datasets, communication becomes increasingly crucial.
Open questions: