Distributed Machine Learning: Communication, Efficiency, and Privacy. Avrim Blum. Carnegie Mellon University. Joint work with MariaFlorina Balcan, Shai Fine, and Yishay Mansour. [RaviKannan60]. Happy birthday Ravi!. And thank you for many enjoyable years working together
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Avrim Blum
Carnegie Mellon University
Joint work with MariaFlorina Balcan, Shai Fine, and Yishay Mansour
[RaviKannan60]
And thank you for many enjoyable years working together
on challenging problems where machine learning meets highdimensional geometry
Algorithms for machine learning in distributed, cloudcomputing context.
Related to interest of Ravi’s in algorithms for cloudcomputing.
For full details see [BalcanBFineMansour COLT’12]
What is Machine Learning about?
Typical ML problems:
Given sample of images, classified as male or female, learn a rule to classify new images.
What is Machine Learning about?
Typical ML problems:
Given set of protein sequences, labeled by function, learn rule to predict functions of new proteins.
Many ML problems today involve massive amounts of data distributed across multiple locations.
Many ML problems today involve massive amounts of data distributed across multiple locations.
Many ML problems today involve massive amounts of data distributed across multiple locations.
Many ML problems today involve massive amounts of data distributed across multiple locations.
Many ML problems today involve massive amounts of data distributed across multiple locations.
Many ML problems today involve massive amounts of data distributed across multiple locations.
Many ML problems today involve massive amounts of data distributed across multiple locations.
Many ML problems today involve massive amounts of data distributed across multiple locations.
Distributed Learning: Scenarios
Two natural highlevel scenarios:
The distributed PAC learning model
+
+
+

+



The distributed PAC learning model
D = (D1 + D2 + … + Dk)/k
1 2 … k
D1 D2 … Dk
The distributed PAC learning model
Goal: learn good rule over combined D.
1 2 … k
D1 D2 … Dk
The distributed PAC learning model
Interesting special case to think about:
1 2
+
+
+
+
+
+
+
+
+
+
+
+




+
+
+
+












The distributed PAC learning model
Some simple baselines.
D1 D2 … Dk
The distributed PAC learning model
Some simple baselines.
E.g., Perceptron algorithm learns linear separators of margin ° with mistakebound O(1/°2).
+
+
+

+



D1 D2 … Dk
The distributed PAC learning model
Some simple baselines.
D1 D2 … Dk
Had linear dependence in d and 1/², or M and no dependence on 1/². [² = final error rate]
Distributed boosting
D1 D2 … Dk
Idea:
D1 D2 … Dk
Idea:
+
+
+
+
+
+
+
+

+

+










D1 D2 … Dk
Final result:
D1 D2 … Dk
Agnostic learning (no perfect h)
[BalcanHanneke] give robust halving alg that can be implemented in distributed setting.
D1 D2 … Dk
Agnostic learning (no perfect h)
[BalcanHanneke] give robust halving alg that can be implemented in distributed setting.
D1 D2 … Dk
Can we do better for specific classes of functions?
D1 D2 … Dk
Interesting class: parity functions
Examples x 2 {0,1}d. f(x) = x¢vf mod 2, for unknown vf.
D1 D2 … Dk
D1 D2
Interesting class: parity functions
Examples x 2 {0,1}d. f(x) = x¢vf mod 2, for unknown vf.
(b) Can be learned using in reliableuseful model of RivestSloan’88.
S
vector vh
S
f(x)
x
??
Interesting class: parity functions
Examples x 2 {0,1}d. f(x) = x¢vf mod 2, for unknown vf.
g1
h1
g2
h2
D1D2
Linear separators thru origin. (can assume pts on sphere)
Can one do better?
+
+
+

+



Idea: Use marginversion of Perceptron alg[update until f(x)(w ¢ x) ¸ 1 for all x]and run roundrobin.
+
+
+
+




Idea: Use marginversion of Perceptron alg[update until f(x)(w ¢ x) ¸ 1 for all x] and run roundrobin.
Idea: Use marginversion of Perceptron alg[update until f(x)(w ¢ x) ¸ 1 for all x] and run roundrobin.
Get similar savings for general distributions?
Natural also to consider privacy in this setting.
S1 ~ D1 S2 ~ D2… Sk ~ Dk
10110110111010111011001
Natural also to consider privacy in this setting.
For all sequences of interactions ¾,
e²·Pr(A(Si)=¾)/Pr(A(Si’)=¾) · e²
¼ 1²
probability over randomness in A
¼ 1+²
S1 ~ D1 S2 ~ D2… Sk ~ Dk
10110110111010111011001
Natural also to consider privacy in this setting.
S1 ~ D1 S2 ~ D2… Sk ~ Dk
10110110111010111011001
Another notion that is natural to consider in this setting.
S1 ~ D1 S2 ~ D2… Sk ~ Dk
Another notion that is natural to consider in this setting.
Di
Si
Protocol
S1 ~ D1 S2 ~ D2… Sk ~ Dk
Another notion that is natural to consider in this setting.
Di
Si
Protocol
S’i
Actual sample
“Ghost sample” sample
PrSi,S’i[8¾, Pr(A(Si)=¾)/Pr(A(S’i)=¾) 2 1 §²] ¸ 1  ±.
Can get algorithms with this guarantee
As we move to large distributed datasets, communication issues become important.
Quite a number of open questions.