Distributed machine learning communication efficiency and privacy
Download
1 / 43

Distributed Machine Learning: Communication, Efficiency, and Privacy - PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on

Distributed Machine Learning: Communication, Efficiency, and Privacy. Avrim Blum. Carnegie Mellon University. Joint work with Maria-Florina Balcan, Shai Fine, and Yishay Mansour. [RaviKannan60]. Happy birthday Ravi!. And thank you for many enjoyable years working together

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Distributed Machine Learning: Communication, Efficiency, and Privacy' - mabyn


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Distributed machine learning communication efficiency and privacy

Distributed Machine Learning: Communication, Efficiency, and Privacy

Avrim Blum

Carnegie Mellon University

Joint work with Maria-Florina Balcan, Shai Fine, and Yishay Mansour

[RaviKannan60]



And thank you for many enjoyable years working together

on challenging problems where machine learning meets high-dimensional geometry


This talk

Algorithms for machine learning in distributed, cloud-computing context.

Related to interest of Ravi’s in algorithms for cloud-computing.

For full details see [Balcan-B-Fine-Mansour COLT’12]


Machine Learning

What is Machine Learning about?

  • Making useful, accurate generalizations or predictions from data.

  • Given access to sample of some population, classified in some way, want to learn some rule that will have high accuracy over population as a whole.

Typical ML problems:

Given sample of images, classified as male or female, learn a rule to classify new images.


Machine Learning

What is Machine Learning about?

  • Making useful, accurate generalizations or predictions from data.

  • Given access to sample of some population, classified in some way, want to learn some rule that will have high accuracy over population as a whole.

Typical ML problems:

Given set of protein sequences, labeled by function, learn rule to predict functions of new proteins.


Distributed Learning

Many ML problems today involve massive amounts of data distributed across multiple locations.


Distributed Learning

Many ML problems today involve massive amounts of data distributed across multiple locations.

  • Click data


Distributed Learning

Many ML problems today involve massive amounts of data distributed across multiple locations.

  • Customer data


Distributed Learning

Many ML problems today involve massive amounts of data distributed across multiple locations.

  • Scientific data


Distributed Learning

Many ML problems today involve massive amounts of data distributed across multiple locations.

  • Each has only a piece of the overall data pie


Distributed Learning

Many ML problems today involve massive amounts of data distributed across multiple locations.

  • In order to learn over the combined D, holders will need to communicate.


Distributed Learning

Many ML problems today involve massive amounts of data distributed across multiple locations.

  • Classic ML question: how much data is needed to learn a given type of function well?


Distributed Learning

Many ML problems today involve massive amounts of data distributed across multiple locations.

  • These settings bring up a new question: how much communication?

  • Plus issues like privacy, etc.

  • That is the focus of this talk.


Distributed Learning: Scenarios

Two natural high-level scenarios:

  • Each location has data from same distribution.

    • So each could in principle learn on its own.

    • But want to use limited communication to speed up – ideally to centralized learning rate. [Dekel, Giliad-Bachrach, Shamir, Xiao]

  • Overall distribution arbitrarily partitioned.

    • Learning without communication is impossible.

    • This will be our focus here.


The distributed PAC learning model

  • Goal is to learn unknown function f 2 C given labeled data from some prob. distribution D.

  • However, D is arbitrarily partitioned among k entities (players) 1,2,…,k. [k=2 is interesting]

+

+

+

-

+

-

-

-


The distributed PAC learning model

  • Goal is to learn unknown function f 2 C given labeled data from some prob. distribution D.

  • However, D is arbitrarily partitioned among k entities (players) 1,2,…,k. [k=2 is interesting]

  • Players can sample (x,f(x)) from their own Di.

D = (D1 + D2 + … + Dk)/k

1 2 … k

D1 D2 … Dk


The distributed PAC learning model

  • Goal is to learn unknown function f 2 C given labeled data from some prob. distribution D.

  • However, D is arbitrarily partitioned among k entities (players) 1,2,…,k. [k=2 is interesting]

  • Players can sample (x,f(x)) from their own Di.

Goal: learn good rule over combined D.

1 2 … k

D1 D2 … Dk


The distributed PAC learning model

Interesting special case to think about:

  • k=2.

  • One has the positives and one has the negatives.

  • How much communication to learn, e.g., a good linear separator?

  • In general, view k as small compared to sample size needed for learning.

1 2

+

+

+

+

+

+

+

+

+

+

+

+

-

-

-

-

+

+

+

+

-

-

-

-

-

-

-

-

-

-

-

-


The distributed PAC learning model

Some simple baselines.

  • Baseline #1: based on fact that can learn any class of VC-dim d to error ² from O(d/² log 1/²) samples

    • Each player sends 1/k fraction of this to player 1

    • Player 1 finds good rule h over sample. Sends h to others.

    • Total: 1 round, O(d/² log 1/²) examples sent.

D1 D2 … Dk


The distributed PAC learning model

Some simple baselines.

  • Baseline #2: Suppose function class has an online algorithm A with mistake-bound M.

    E.g., Perceptron algorithm learns linear separators of margin ° with mistake-bound O(1/°2).

+

+

+

-

+

-

-

-

D1 D2 … Dk


The distributed PAC learning model

Some simple baselines.

  • Baseline #2: Suppose function class has an online algorithm A with mistake-bound M.

    • Player 1 runs A, broadcasts current hypothesis.

    • If any player has a counterexample, sends to player 1. Player 1 updates, re-broadcasts.

    • At most M examples and rules communicated.

D1 D2 … Dk


Dependence on 1/²

Had linear dependence in d and 1/², or M and no dependence on 1/². [² = final error rate]

  • Can you get O(d log 1/²) examples of communication?

  • Yes.

    Distributed boosting

D1 D2 … Dk


Distributed Boosting

Idea:

  • Run baseline #1 for ² = ¼. [everyone sends a small amount of data to player 1, enough to learn to error ¼]

  • Get initial rule h1, send to others.

D1 D2 … Dk


Distributed Boosting

Idea:

  • Players then reweight their Di to focus on regions h1 did poorly.

  • Repeat

+

+

+

+

  • Distributed implementation of Adaboost Algorithm.

  • Some additional low-order communication needed too (players send current performance level to #1, so can request more data from players where h doing badly).

  • Key point: each round uses only O(d) samples and lowers error multiplicatively.

+

+

+

+

-

+

-

+

-

-

-

-

-

-

-

-

-

-

D1 D2 … Dk


Distributed Boosting

Final result:

  • O(d) examples of communication per round + low order extra bits.

  • O(log 1/²) rounds of communication.

  • So, O(d log 1/²) examples of communication in total plus low order extra info.

D1 D2 … Dk


Agnostic learning (no perfect h)

[Balcan-Hanneke] give robust halving alg that can be implemented in distributed setting.

  • Based on analysis of a generalized active learning model.

  • Algorithms especially suited to distributed setting.

D1 D2 … Dk


Agnostic learning (no perfect h)

[Balcan-Hanneke] give robust halving alg that can be implemented in distributed setting.

  • Get error 2*OPT(C) + ² using total of only O(k log|C| log(1/²)) examples.

  • Not computationally efficient, but says logarithmic dependence possible in principle.

D1 D2 … Dk



Interesting class: parity functions

Examples x 2 {0,1}d. f(x) = x¢vf mod 2, for unknown vf.

  • Interesting for k=2.

  • Classic communication LB for determining if two subspaces intersect.

  • Implies (d2) bits LB to output good v.

  • What if allow rules that “look different”?

D1 D2 … Dk

D1 D2


Interesting class: parity functions

Examples x 2 {0,1}d. f(x) = x¢vf mod 2, for unknown vf.

  • Parity has interesting property that:

    • Can be learned using .[Given dataset S of size O(d/²), just solve the linear system]

      (b) Can be learned using in reliable-useful model of Rivest-Sloan’88.

S

vector vh

  • [if x in subspace spanned by S, predict accordingly, else say “??”]

S

f(x)

x

??


Interesting class: parity functions

Examples x 2 {0,1}d. f(x) = x¢vf mod 2, for unknown vf.

  • Algorithm:

    • Each player iPAC-learns over Di to get parity function gi. Also R-U learns to get rule hi. Sends gi to other player.

    • Uses rule: “if hi predicts, use it; else use g3-i.”

    • Can one extend to k=3?

g1

h1

g2

h2

D1D2


Linear Separators

Linear separators thru origin. (can assume pts on sphere)

Can one do better?

  • Say we have a near-uniform prob. distrib. D over Sd.

  • VC-bound, margin bound, Perceptron mistake-bound all give O(d) examples needed to learn, so O(d) examples of communication using baselines (for constant k, ²).

+

+

+

-

+

-

-

-


Linear Separators

Idea: Use margin-version of Perceptron alg[update until f(x)(w ¢ x) ¸ 1 for all x]and run round-robin.

+

+

+

+

-

-

-

-


Linear Separators

Idea: Use margin-version of Perceptron alg[update until f(x)(w ¢ x) ¸ 1 for all x] and run round-robin.

  • So long as examples xi of player i and xj of player j are reasonably orthogonal, updates of player j don’t mess too much with data of player i.

    • Few updates ) no damage.

    • Many updates ) lots of progress!


Linear Separators

Idea: Use margin-version of Perceptron alg[update until f(x)(w ¢ x) ¸ 1 for all x] and run round-robin.

  • If overall distrib. D is near uniform [density bounded by c¢unif], then total communication (for constant k, ²) is O((d log d)1/2) rather than O(d).

    Get similar savings for general distributions?


Preserving Privacy of Data

Natural also to consider privacy in this setting.

  • Data elements could be patient records, customer records, click data.

  • Want to preserve privacy of individuals involved.

  • Compelling notion of differential privacy: if replace any one record with fake record, nobody else can tell. [Dwork, Nissim, …]

S1 ~ D1 S2 ~ D2… Sk ~ Dk

10110110111010111011001


Preserving Privacy of Data

Natural also to consider privacy in this setting.

For all sequences of interactions ¾,

e-²·Pr(A(Si)=¾)/Pr(A(Si’)=¾) · e²

¼ 1-²

probability over randomness in A

¼ 1+²

S1 ~ D1 S2 ~ D2… Sk ~ Dk

10110110111010111011001


Preserving Privacy of Data

Natural also to consider privacy in this setting.

  • A number of algorithms have been developed for differentially-private learning in centralized setting.

  • Can ask how to maintain without increasing communication overhead.

S1 ~ D1 S2 ~ D2… Sk ~ Dk

10110110111010111011001


Preserving Privacy of Data

Another notion that is natural to consider in this setting.

  • A kind of privacy for data holder.

  • View distrib Dias non-sensitive (statistical info about population of people who are sick in city i).

  • But the sample Si» Diis sensitive (actual patients).

  • Reveal no more about Si other than inherent in Di?

S1 ~ D1 S2 ~ D2… Sk ~ Dk


Preserving Privacy of Data

Another notion that is natural to consider in this setting.

Di

Si

Protocol

  • Want to reveal no more info about Si than is inherent in Di.

S1 ~ D1 S2 ~ D2… Sk ~ Dk


Preserving Privacy of Data

Another notion that is natural to consider in this setting.

Di

Si

Protocol

S’i

Actual sample

“Ghost sample” sample

PrSi,S’i[8¾, Pr(A(Si)=¾)/Pr(A(S’i)=¾) 2 1 §²] ¸ 1 - ±.

Can get algorithms with this guarantee


Conclusions

As we move to large distributed datasets, communication issues become important.

  • Rather than only ask “how much data is needed to learn well”, also ask “how much communication do we need?”

  • Also issues like privacy become more critical.

    Quite a number of open questions.


ad