Sampling Lower Bounds via Information Theory

Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden

Standard Approach to Hardness of Approximation • 8a 2 A, b 2 B, f(a) is “far” from f(b). • Given x 2 A [ B, decide if x 2 A. Hardness of approximation for f: Xn! Y Hardness of a decision “promise problem” “Promise problem”: Xn B A

The “Election Problem” • input: a sequence x ofn votes to k parties • Want to get  s.t. || - x|| < . • How big a poll should we conduct? Vote Distribution x (n = 18, k = 6) • 8 S µ [k], easy to decide between: A = { x | x(S) ¸ ½ +  } and B = { x | x(S) · ½ -  }. • Hardness due to the abundance of such decision problems ! poll has to be of size (k). 7/18 1/18 4/18 3/18 2/18 1/18

Similarity Hardness vs. Abundance Hardness In this talk: • A lower bound technique that captures both types of hardness in the context of sampling algorithms. Similarity hardness Hardness of a decision “promise problem” Hardness of approximation for f: Xn! Y Abundance of decision “promise problems” Abundance hardness

Why Sampling? Input Data Set A small number of queries • Queries can be chosen randomly • Output is typically approximate • Sub-linear time & space Algorithm

Some Examples Statistics • Statistical decision and estimation • Statistical learning • … CS • PAC and machine learning • Property testing • Sub-linear time approximation algorithms • Extractors and dispersers • …

Query Complexity Query complexity of a function f: # of queries required to approximate f Examples: • High query complexity: • Parity • # of distinct elements • Low query complexity: • Mean in [0,1] • Median

Our Main Result • A technique for obtaining lower bounds on the query complexity of approximating functions • Template for obtaining specific lower bounds • Arbitrary domain and range • All types of approximation • Usable for wide classes of functions with symmetry properties • Outperforms previous techniques for functions with “abundance hardness” • Matches previous techniques for functions with “similarity hardness”

Previous Work • Statistics • Crámer-Rao inequality • VC dimension • Optimality of the sequential probability ratio test • CS • Lower bounds via the Hellinger distance [B., Kumar, Sivakumar 01] • Specific lower bounds [Canetti, Even, Goldreich 95], [Radhakrishnan, Ta-Shma 96], [Dagum, Karp, Luby, Ross 95], [Schulman, Vazirani 99], [Charikar, Chaudhuri, Motwani, Narasayya 00] None addresses abundance hardness!

Multi-Way Reduction from a Binary Promise Problem f: Xn! Y f(a) “disjoint inputs” pairwise Y f(b) f(c) Multi-way Binary promise problem: Given x 2 { a, b }, decide whether x = a or x = b or x = c , c Can be solved by any sampling algorithm approximating f

Main Result The lower bound “recipe” f: Xn! Y: a function with an appropriate symmetry property • Identify a set S = { x1,…,xm } of “pairwise disjoint” inputs. • Calculate the “dissimilarity” D(x1,…,xm) among x1,…,xm. (D(¢,…,¢) is a distance measure taking values in [0,log m]). Theorem: Any algorithm approximating f requires q queries, where Tradeoff between “similarity hardness” and “abundance hardness”

Measure of Dissimilarity i : distribution of the value of a uniformly chosen entry of xi Then: • Jensen-Shannon divergence 1 m 2 

Application I: The Election Problem Previous bounds on the query complexity: • (1/2) [BKS01] • (k) [Batu et al. 00] • O(k/2) [BKS01] Theorem [This paper] (k/2)

Combinatorial Designs t-design: B1 B3 B2 [k] Proposition For all k and for all t ¸ 12, there exists a t-design of size m = 2(k).

Proof of the Lower Bound Step 1: Identification of a set S of pairwise disjoint inputs: B1,…,Bmµ [k]: a t-design of size m = 2(k). S = { x1,…,xm }, where Bi [k]nBi Step 2: Dissimilarity calculation: D(x1,…,xm) = O(2). By main theorem, # of queries is at least (k/2).

Application II: Low Rank Matrix Approximation Exact low rank approximation: • Given an m £ n real matrix M and k · m,n, find the m £ n matrix Mkof rank k for which ||M – Mk||F is minimized. • Solution: SVD. Requires querying all of M. Approximate low rank approximation (LRMk): • Get a rank k martix A, s.t. ||M – A||F· ||M – Mk||F + ||M||F. Theorem[This paper] Computing LRMk requires (m + n) queries.

Proof of the Lower Bound Step 1: Identification of a set S of pairwise disjoint inputs: B1,…,Btµ [2k]: a combinatorial design of size t = 2(k). S = { M1,…,Mt }, where 2k Mi is all-zero, except for the diagonal, which is the characteristic vector of Bi. 0 0 Bi 0 2k 0 0 • Mi is of rank k  (Mi)k = Mi. • ||Mi||F = k1/2. • ||Mi – Mj||F¸ (|Bin Bj|)1/2¸ (k/12)1/2 ¸ (||Mi||F + ||Mj||F). Step 2: Dissimilarity calculation: D(M1,…,Mt) = 2k/m. By main theorem, # of queries is at least (m).

Low Rank Matrix Approximation (cont.) Theorem[Frieze, Kannan, Vempala 98] By querying an s £ s submatrix of M chosen using any distributions which “approximate” the row and column weight distributions of M, one can solve LRMk with s = O(k4/3). Theorem[This paper] Solving LRMk by querying an s £ s submatrix of M chosen even according to the exact row and column weight distributions of M requires s = (k/2).

Oblivious Sampling • Query positions are independent of the given input. • Algorithm has a fixed query distribution  on [n]q. • i.i.d. queries: queries are independent and identically distributed:  = q, where  is a distribution on [n]. Phase 1: Choose query positions i1,…,iq Phase 2: Query xi1,…,xiq

Main Theorem: Outline of the Proof Adaptive sampling (For functions with symmetry properties) Oblivious sampling with i.i.d queries Statistical classification Lower bounds via information theory

Statistical Classification 1 q i.i.d. samples Classifier 2 i 2 [m] Black Box m • 1,…,m are distributions on Z. • Classifier is required to be correct with probability ¸ 1 - .

From Sampling to Classification • T : oblivious algorithm with query distribution  = q that approximatesf: Xn! Y. • x : joint distribution of a query and its answer when T runs on input x (distribution on [n] £X). • S= {x1,…,xm} : set of pairwise disjoint inputs. x1 q i.i.d. samples T x2 Black Box Decide i iff T’s output 2A(xi) xm

Jensen-Shannon Divergence [Lin 91] • KL divergence between distributions , on Z: • Jensen-Shannon divergence among distributions 1,…,m on Z: ( = (1/m) ii) 1 8 7 2 3  6 4 5

Main Result Theorem [Classification lower bound] Any -error classifier for 1,…,m requires q queries, where Corollary [Query complexity lower bound] For any oblivious algorithm with query distribution  = q that (,)-approximates f, and for any set S = {x1,…,xm} of “pairwise disjoint” inputs, the number of queries q is at least

Outline of the Proof Lemma 1[Classification error lower bound] Proof: by Fano’s inequality. Lemma 2[Decomposition of Jensen-Shannon] Proof: By subadditivity of entropy and conditional independence.

Conclusions • General lower bound technique for the query complexity • Template for obtaining specific bounds • Works for wide classes of functions • Captures both “similarity hardness” and “abundance hardness” • Applications • The “Election Problem” • Low rank matrix approximation • Matrix reconstruction • Also proved • A lower bound technique for the expected query complexity • Tightly captures similarity hardness but not abundance hardness • Open problems • Tight bounds for low rank matrix approximation • Better lower bounds on the expected query complexity • Lower bounds for non-symmetric functions

Simulation of Adaptive Sampling by Oblivious Sampling Definition f: Xn!Y is symmetric, if 8x and 82Sn, f((x)) = f(x). f is -symmetric, if 8x8, A((x)) = A(x). Lemma [BKS01] Any q-query algorithm approximating an -symmetricf can be simulated by a q-query oblivious algorithm whose queries are uniform without replacement. Corollary If q < n/2, can be simulated by a 2q-query oblivious algorithm whose queries are uniform with replacement.

Simulation Lemma: Outline of the Proof • T: q-query sampling algorithm approximating f • WLOG, T never queries the same location twice. Simulation: • Pick a random permutation . • Run T on (x). • By -symmetry, output is likely to be in A((x)) = A(x). • Queries to x are uniform without replacement.

Extensions Definitions • f is (g,)-symmetric if 8x,8,8y2 A((x)), g(,y)2 A(x). • A function f on m £ n matrices is -row-symmetric, if for all matrices M, and for all row-permutation matrices , A(¢M) = A(M). Similarly: -column-symmetry, and (g,)-row- and column-symmetry. We prove: similar simulations hold for all of the above.

Sampling Lower Bounds via Information Theory