New Theoretical Frameworks for Machine Learning. MariaFlorina Balcan. Thesis Proposal. 05/15/2007. Thanks to My Committee. Avrim Blum. Manuel Blum. Tom Mitchell. Yishay Mansour. Santosh Vempala. The Goal of the Thesis. New Theoretical Frameworks for Modern Machine Learning Paradigms.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
New Theoretical Frameworks for Modern Machine Learning Paradigms
Connections between Machine Learning Theory and Algorithmic Game Theory
New Frameworks for Modern Learning Paradigms
Modern Learning Paradigms
Incorporating UnlabeledData in the Learning Process
Kernel based Learning
Qualitative gapbetween theory and practice
Semisupervised Learning
Unified theoretical
treatment is lacking
Active Learning
Our Contributions
Our Contributions
Semisupervised learning
A theory of learning with general similarity functions
 a unified PAC framework
Active Learning
Extensions to clustering
 new positive theoretical results
With Avrim and Santosh
New Frameworks for Modern Learning Paradigms
Modern Learning Paradigms
Incorporating UnlabeledData in the Learning Process
Kernel, Similarity based Learning and Clustering
Qualitative gapbetween theory and practice
Unified theoretical
treatment is lacking
Our Contributions
Our Contributions
Semisupervised learning
A theory of learning with general similarity functions
 a unified PAC framework
Active Learning
Extensions to clustering
 new positive theoretical results
With Avrim and Santosh
Machine Learning Theory and Algorithmic Game Theory
Brief Overview of Our Results
Mechanism Design, ML, and Pricing Problems
Generic Framework for reducing problems of incentivecompatible mechanism design to standard algorithmic questions.
[BalcanBlumHartlineMansour, FOCS 2005, JCSS 2007]
Approximation Algorithms for Item Pricing.
[BalcanBlum, EC 2006]
New Theoretical Frameworks for Modern Machine Learning Paradigms
Connections between Machine Learning Theory and Algorithmic Game Theory
New Theoretical Frameworks for Modern Machine Learning Paradigms
Incorporating UnlabeledData in the Learning Process
Kernel, Similarity based learning and Clustering
Semisupervised learning (SSL)
 Connections between kernels,
margins and feature selection
 An Augmented PAC model for SSL
[BalcanBlum, COLT 2005; book chapter, “SemiSupervised Learning”, 2006]
[BalcanBlumVempala, MLJ 2006]
 A general theory of learning with similarity functions
Active Learning (AL)
 Generic agnostic AL procedure
[BalcanBlum, ICML 2006]
[BalcanBeygelzimerLangford, ICML 2006]
 Extensions to Clustering
 Margin based AL of linear separators
[BalcanBlumVempala, work in progress]
[BalcanBroderZhang, COLT 2007]
New Theoretical Frameworks for Modern Machine Learning Paradigms
Incorporating UnlabeledData in the Learning Process
Kernel, Similarity based learning and Clustering
Semisupervised learning (SSL)
 Connections between kernels,
margins and feature selection
 An Augmented PAC model for SSL
[BalcanBlum, COLT 2005; book chapter, “SemiSupervised Learning”, 2006]
[BalcanBlumVempala, MLJ 2006]
 A general theory of learning with similarity functions
Active Learning (AL)
 Generic agnostic AL procedure
[BalcanBlum, ICML 2006]
[BalcanBeygelzimerLangford, ICML 2006]
 Extensions to Clustering
 Margin based AL of linear separators
[BalcanBlumVempala, work in progress]
[BalcanBroderZhang, COLT 2007]
Part I, Incorporating Unlabeled Data in
the Learning Process
SemiSupervised Learning
A unified PACstyle framework
[BalcanBlum, COLT 2005; book chapter, “SemiSupervised Learning”, 2006]
Standard Supervised Learning Setting
Sample Complexity
Hot topic in recent years in Machine Learning.
Scattered Theoretical Results…
Extends PAC naturally to fit SSL.
Can generically analyze:
Key Insight
Unlabeled data is useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.
Different algorithms are based on different assumptions about how data should behave.
Challenge – how to capture many of the assumptions typically used.
+
_
_
+
+
+
_
+
+
_
_
SVM
Transductive SVM
Labeled data only
Example of “typical” assumption: MarginsThe separator goes throughlowdensity regions of the space/large margin.
My Advisor
Prof. Avrim Blum
My Advisor
x  Link info & Text info
x2 Link info
x1 Text info
Another Example: SelfconsistencyAgreement between two parts : cotraining [BM98].
 examples contain twosufficient sets of features, x = hx1, x2i
 thebeliefis that the two parts of the example are consistent, i.e. 9 c1, c2 such that c1(x1)=c2(x2)=c*(x)
For example, if we want to classify web pages:
x = hx1, x2i
Su={xi} unlabeledexamples drawn i.i.d. from D
Sl={(xi, yi)} – labeled examples drawn i.i.d. from D and labeled by some target concept c*.
PAC model talks of learning a class C under (known or unknown) distribution D.
We extend the PAC model to capture these (and more) uses of unlabeled data.
+
+
_
Proposed Model, Main Idea (1)Augment the notion of a concept classC with a notion of compatibilitybetween a concept and the data distribution.
“learn C” becomes “learn (C,)” (i.e. learn class C under compatibility notion )
Express relationships that one hopes the target function and underlying distribution will possess.
Idea: use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}.
Idea: use unlabeled data & our belief toreduce size(C) down to size(highly compatible functions in C) in our sample complexity bounds.
Need to be able to analyze how much unlabeled data is needed to uniformly estimate compatibilities well.
Require that the degree of compatibility be something that can be estimated from a finite sample.
+
Highly compatible
+
_
Margins, CompatibilityMargins: belief is that should exist a large margin separator.
Incompatibility of h and D (unlabeled error rate of h) – the probability mass within distance of h.
Can be written as an expectation over individual examples(h,D)=Ex 2 D[(h,x)] where:(h,x)=0 if dist(x,h) ·(h,x)=1 if dist(x,h) ¸
+
Highly compatible
+
_
Margins, CompatibilityMargins: belief is that should exist a large margin separator.
If do not want to commit to in advance, define (h,x) to be a smooth function of dist(x,h), e.g.:
Illegal notion of compatibility: the largest s.t. D has
probability mass exactly zero within distance of h.
Cotraining: examples come as pairs hx1, x2i and the goal is to learn a pair of functionshh1,h2i.
Hope is that the two parts of the example are consistent.
Legal (and natural)notion of compatibility:
 the compatibility of hh1,h2iand D:
 can be written as an expectation over examples:
As in PAC, can discuss algorithmic and sample complexity issues.
Sample Complexity issues that we can address:
 Ability of unlabeled data to reduce # of labeled examples needed:
Finite Hypothesis Spaces, Doubly Realizable Case
ALG: pick a compatible concept that agrees with the labeled sample.
CD,() = {h 2 C :errunl(h) ·}
Bound the # of labeled examples as a measure of the helpfulness of D with respect to
_
Highly compatible
+
_
Examples of results:Sample Complexity, Uniform Convergence BoundsFinite Hypothesis Spaces, Doubly Realizable Case
ALG: pick a compatible concept that agrees with the labeled sample.
CD,() = {h 2 C :errunl(h) ·}
_
Highly compatible
+
_
Sample Complexity SubtletiesUniform Convergence Bounds
Depends both on the complexity of C and on the complexity of
Distr. dependent measure of complexity
Cover boundsmuch better than Uniform Convergence bounds.
Ways in which unlabeled data can help
Subsequent Work, E.g.:
P. Bartlett, D. Rosenberg, AISTATS 2007
J. ShaweTaylor et al., Neurocomputing 2007
Idea: use unlabeled data to generate poly # of candidate hyps s.t. at least one is weaklyuseful (uses Outlier Removal Lemma). Plug into [BM98].
Modern Learning Paradigms: Our Contributions
Modern Learning Paradigms
Incorporating Unlabeled Data in the Learning Process
Kernel, Similarity based learning and Clustering
Semisupervised learning (SSL)
 Connections between kernels,
margins and feature selection
 An Augmented PAC model for SSL
[BalcanBlumVempala, MLJ 2006]
[BalcanBlum, COLT 2005]
[BalcanBlum, book chapter,
“SemiSupervised Learning”, 2006]
 A general theory of learning with similarity functions
Active Learning (AL)
[BalcanBlum, ICML 2006]
 Generic agnostic AL procedure
 Extensions to Clustering
[BalcanBeygelzimerLangford, ICML 2006]
[BalcanBlumVempala, work in progress]
 Margin based AL of linear separators
[BalcanBroderZhang, COLT 2007]
Modern Learning Paradigms: Our Contributions
Modern Learning Paradigms
Incorporating Unlabeled Data in the Learning Process
Kernel, Similarity based learning and Clustering
Semisupervised learning (SSL)
 Connections between kernels,
margins and feature selection
 An Augmented PAC model for SSL
[BalcanBlumVempala, MLJ 2006]
[BalcanBlum, COLT 2005]
[BalcanBlum, book chapter,
“SemiSupervised Learning”, 2006]
 A general theory of learning with similarity functions
Active Learning (AL)
[BalcanBlum, ICML 2006]
 Generic agnostic AL procedure
 Extensions to Clustering
[BalcanBeygelzimerLangford, ICML 2006]
[BalcanBlumVempala, work in progress]
 Margin based AL of linear separators
[BalcanBroderZhang, COLT 2007]
for Learning
[BalcanBlum, ICML 2006]
Extensions to Clustering
(With Avrim and Santosh, work in progress)
Kernels have become a powerful tool in ML.
Our Work: analyze more general similarity functions.
(x)
1
w
KernelsIf margin in space, only need 1/2 examples to learn well.
We provide:characterization ofgood similarity functionsfor a learning problem that:
1) Talks in terms of natural direct properties:
2) If K satisfies these properties for our given problem, then has implications to learning.
3) Is broad: includes usual notion of “good kernel”.
(induces a large margin separator in space)
B
C

A
+
A First Attempt: Definition satisfying properties (1) and (2)Let P be a distribution over labeled examples (x, l(x))
Ey~P[K(x,y)l(y)=l(x)] ¸ Ey~P[K(x,y)l(y)l(x)]+
Note: this might not be a legal kernel.
Ey~P[K(x,y)l(y)=l(x)] ¸ Ey~P[K(x,y)l(y)l(x)]+
Algorithm
Ey~P[K(x,y)l(y)=l(x)] ¸ Ey~P[K(x,y)l(y)l(x)]+
Algorithm
Guarantee: with probability ¸1, error · + .
Proof
more similar to negs than to typical pos (2). How to use it?
+
+
+
+
+
+






A First Attempt: Not Broad EnoughEy~P[K(x,y)l(y)=l(x)] ¸ Ey~P[K(x,y)l(y)l(x)]+
Ey~P[K(x,y)l(y)=l(x)] ¸ Ey~P[K(x,y)l(y)l(x)]+
R
+
+
+
+
+
+






Idea: would work if we didn’t pick y’s from topleft.
Broaden to say:OK if 9 nonnegligable region R s.t. most x are on average more similar to y2R of same label than to y2 R of other label.
Ey~P[w(y)K(x,y)l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)l(y)l(x)]+
Ey~P[w(y)K(x,y)l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)l(y)l(x)]+
Algorithm
F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
Point is: with probability ¸ 1, exists linear separator of error · + at margin /4.
(w = [w(y1), …,w(yd),w(zd),…,w(zd)])
Algorithm
F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
Guarantee: with prob. ¸ 1, exists linear separator of error · + at margin /4.
legal kernel
Implications
K arbitrary sim. function
(,)goodsim. function
(+,/4)goodkernelfunction
Main Definition: K:(x,y) ! [1,1] is an(,)good similarityfor P if exists a weighting functionw(y) 2 [0,1] at leasta 1probability mass of x satisfy:
Ey~P[w(y)K(x,y)l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)l(y)l(x)]+
Theorem
Our proofs incurred some penalty:
’ = + extra, ’ = 3extra.
Nati Srebro (COLT 2007) has improved the bounds.
Sample complexity is roughly (2). How to use it?
Learning with Multiple Similarity FunctionsAlgorithm
F(x) = [K1(x,y1), …,Kr(x,yd), K1(x,zd),…,Kr(x,zd)].
Guarantee: The induced distribution F(P) in R2dr has a separator of error · + at margin at least
Mugizi has proposed on this
Clustering via Similarity Functions (2). How to use it?
(Work in Progress, with Avrim and Santosh)
Consider the following setting:
[documents,
web pages]
[topic]
People have traditionally considered mixture models here.
Can we say something in our setting?
For all clusters C, C’, for all A in C, A’ in C’:
A and A’ are not both more attracted to each other than to their own clusters.
K(x,y) is attraction between x and y
For all clusters C, C’, for all A in C, A’ in C’:
A and A’ are not both more attracted to each other than to their own clusters.
K(x,y) is attraction between x and y
fashion
sports
volleyball
Dolce & Gabbana
soccer
Cocco Chanel
gymnastics
Modern Learning Paradigms: Future Work (2). How to use it?
Modern Learning Paradigms
Incorporating Unlabeled Data in the Learning Process
Kernel, Similarity based learning and Clustering
Active Learning
Learning with Sim. Functions
 Margin based AL of linear separators
Alternative/tighter definitions
and connections.
Extend the analysis to a more general
class of distributions, e.g. logconcave.
Clustering via Sim. Functions
Can we get an efficient alg. for the stability of large subsets property
Interactive Feedback
MLA and Algorithmic Game Theory, (2). How to use it?Future Work
Mechanism Design, ML, and Pricing Problems
Revenue maximizationin comb. auctions with general preferences.
Extend BBHM’05 to the limited supply setting.
Approximation algorithms for the case of pricing below cost.
Summer 07
 Revenue Maximization in General Comb. Auctions, limited and unlimited supply.
Fall 07
Spring 08
Wrapup; writing; job search!
Thank you ! (2). How to use it?