Lower Bounds for Property Testing

Lower Bounds for Property Testing Luca Trevisan U.C. Berkeley Joint work with Andrej Bogdanov and Kenji Obata

Sub-linear Time Algorithms • Want to design algorithms that run in less than linear time (and so cannot read entire input). • Must be probabilistic and approximate • For optimization problems: • Compute numerical apx of optimum cost (and implicit representation of apx solution?) • For decision problems: • What is approximation for decision problems?

(Graph) Property Testing Testing a property P with accuracy ein adjacency matrix representation: • Given graph G that has property P, accept with probability >3/4 • Given graph G that is e-far from property P accept with probability <1/4 • e-far = must change e–fraction of adjacency matrix to get property P(add/remove > en2edges)

Example [GGR,AK] Testing bipartiteness of a given graph G • Pick (1/e)polylog(1/e) vertices, and check if they induce a bipartite graph; if so accept otherwise reject • If G is bipartite then alg accepts with prob 1 • If G is e-far from bipartite, then whp algorithm discovers an odd cycle (non-trivial to prove) • Running time: O ((1/e2)polylog(1/e)) • We will discuss matching lower bound if time allows

Paleontologist’s approach

Bounded Degree Graphs Testing a property P with accuracy ein adjacency lists representation: • Given graph G that has property P, accept with probability >3/4 • Given graph G that is e-far from property P accept with probability <1/4 • e-far = must change e–fraction of adjacency lists entries to get property P(add/remove > ednedges)

Bipartiteness [GR] Testing bipartiteness • Repeat polylog n times: • Start at random point, and pick sqrt(n) random walks of length polylog n, if two of them combine to form an odd cycle reject, otherwise accept • Analysis: • in a graph where you need to remove constant fraction of edges to make it bipartite, algorithm finds odd cycle

Matching Lower Bound [GR] • Define two distributions of graphs: • Gfar: a random hamiltonian circuit, plus a random matching(whp 1/100-far from bipartite) • Gbip: a random hamiltonian circuit, plus a random matching conditioned on making the graph bipartite • Gfar and Gbip are indistinguishable to algorithms of query complexity o(sqrt(n)).

Sub-linear Time Approximation • Minimum spanning tree • given a connected weighted graph of degree d with weights in range {1,…,w}, can approximate MST weight within (1+e) in time about O(dw/e2)[Chazelle, Rubinfeld, T] • Max SAT • Given a CNF where every variable occurs at most d times, can approximate Max SAT optimum within .618, presumably also 2/3, in O(d) time[work in progress, hopefully will get 3/4-d]

Sublinear Time Approximation • Problems restricted to dense instances: • Max CUT and other graph problems can be approximated within (1+e) in graphs with at least an2 edges in time 2poly(1/ea)[GGR] • Max 3SAT can be approximated within (1+e) in instances with at least an3 clauses in time 2poly(1/ea) and similar results for other satisfiability problems[AFKK]

General Goals • When looking for polynomial-time algorithms: • Several algorithmic techniques of general applicability • A general technique to “prove” impossibility (NP-completeness) • For sublinear-time algorithms: • General algorithmic techniques? • Impossibility results?

Testing 3-Colorability • Easy in adjacency matrix representation • NP-hard in adjacency list representation • Only for small enough e • Can find 3-coloring good for 80% of the edges in a 3-colorable graph using SDP • NP-hard to find 3-coloring good for 98% (?) fraction of edges • Non-tight, and conditional lower bound for query complexity

Other problems • The query complexity of following problems is equivalent to query complexity of testing 3col • Testing satisfiability of 3SAT instance • Every variable occurs in O(1) clauses, “adjacency list” representation • Approximating max cut, vertex cover, independentset, . . ., in bounded-degree graphs • Approximating Max SAT, Max 2SAT, . . . • Lower bound of sqrt(n) for all problems • Nothing better except with complexity assumptions

Our Results • For one-sided error algorithms: • W(n) query complexity to distinguish 3-colorable graphs from graphs that are (1/3 – d)-far • Lower bound applies to testing problems that are solvable in polynomial time • For two-sided error algorithms: • For some e, W(n) query complexity to distinguish 3-colorable graphs from graphs that are e-far.

Additional Results • Unconditionally, algorithms running in time o(n) cannot: • Approximate Max 3SAT better than 7/8 • Approximate Max Cut in bounded-degree graphs better than 16/17 • . . . • Hastad’97 proved above problems are NP-hard

The 3-Coloring Lower Bound • Consider first one-sided error algorithms • It’s enough to find a graph G that is (1/3 – d)-far from 3-colorable, but every subgraph of size < an is 3-colorable • (for every d there is an asuch that . . .) • Then an algorithm of query complexity < an either accepts G (which is wrong) or rejects some 3-colorable graph (which means the algorithm has not one-sided error)

The Graph • Pick a graph of degree O(1/d2) at random (pick so many random matchings) • Then it is (1/3 – d)-far whp • But, for some a, whp, every subgraph induced by k < an vertices contains <1.5k edges • In a minimal non-3-colorable graph, every vertex has degree at least 3 • Every subgraph induced by < an vertices is 3-colorable [Erdos]

Explicit Construction • Can the previous construction be derandomized? • For constants d, e, a, and for every suff large n, we can explicitly construct a graph on n vertices, max degree d, e-far from 3-colorable, and such that every subset of an vertices induces a 3-colorable subgraph.

Explicit Construction • We construct a 3SAT formula such that for constants k, e’, a’ • Every variable occurs k times • No assignment satisfies more than 1-e’ fraction of clauses • Every a’ fraction of clauses is satisfiable • Then we use (slightly new) reduction from 3SAT to 3Coloring

The Formula • Fix a degree-d expander graph G=(V,E) such that for every cut (S,V-S) at least min{|S|,|V-S|} edges cross the cut(enough d=14) • Have two variables xuv and xvu for each egde (u,v) • For every vertex v have the (3SAT equivalent of) the constraint • Su xuv = 1 + Sw xvw

Structure of the Analysis • Impossible to satisfy more than a fraction 1/(d+1) of the constraints • Can always satisfy half of the constraint • define an auxiliary network • show that the auxiliary network has no smallcut because of expansion • then there is a large flow • use large flow to find assignment for subset of constraint

Flow Argument • Want to satisfy constraints corresponding to vertices in C, with |C| < |V|/2 Construct flow network with new source s, sink t obtained by collapsing V-C, and vertices in C V-C s t C

Flow Argument |A| edges A t • Every cut has size at least |C| • There is a 0/1 flow of cost at least |C| • Interpreted as an assignment, satisfies all constraints in C s |C-A| edges C-A

Two-Sided Error Algorithms • Need to define two distributions of graphs Gcol and Gfar such that • Graphs in Gcol are (almost) always 3-colorable • Graphs in Gfar are (almost) always far from 3-colorable • To an algorithm of bounded query complexity, Gcol and Gfar look (almost) the same

Main Step • Define two distributions Dsat and Dfar of instances of E3LIN-2(systems over GF(2) with 3 variables per equation) • Systems in Dsat are always satisfiable • Systems in Dfar are (almost) always (1/2-d)-far from satisfiable • To an algorithm of bounded query complexity, Dsat and Dfarlook the same • We get Gcol and Gfar using reduction fromapproximate E3LIN-2 to approximate 3-coloring

E3LIN-2 X1 + X3 + X10 = 0 mod 2 X2 + X3 + X4 = 1 mod 2 X1 + X2 + X9 = 0 mod 2 . . .

Main Building Block • We show that for every c there is a such that there exists a left-hand side with • n variables, cn equations, 3 variables per equations, every variable occurs in 3c equations • every an equations are linearly independent • Pick the left-hand side at random • repeat 3c times: pick at random a set of n/3 disjoint triples of variables • Explicit construction?

Distributions • The left-hand side is always as before • In Dsat, we pick a random assignment to the variables, and set right-hand side consistently • always satisfiable • In Dfar, we pick the right-hand side uniformly at random • With high probability, (1/2 – O(1/sqrt c))-far

Indistinguishability • Two distributions differ only in right-hand side • In Dfar uniformly distributed • In Dsat, an-wise independent • Linear independence implies statistical independence • Look the same to algorithm that sees less than an equations

Conclusion of the Argument • No algorithm of “query complexity” o(n) can distinguish satisfiable instances of E3LIN-2 from instances that are (1/2-d)-far from satisfiable • For some e, no algorithm of query complexity o(n) can distinguish 3-colorable graphs from graphs that e–far from 3-col. • No algorithm of query complexity o(n) can approximate Max 3SAT better than 7/8 . . .

Open Questions • Show that distinguishing 3-colorable graphs from (1/3-d)-far graphs requires query complexity W(n) • we can only prove it for one-sided error • Show that approximating Max SAT better than ¾ and Max CUT bettter than ½ requires query complexity W(n) • we only know W(sqrt(n)) [implicit in GR] • would “explain” why we need SDP

Back to Dense Graphs • Recall Alon-Krivelevich bipartiteness test for the adjacency matrix representation: • pick (1/e)polylog(1/e) vertices and look at induced subgraph • if see odd cycle reject, otherwise accept • Running time (1/e2)polylog(1/e) • We prove: • W(1/e2) for non-adaptive algorithms • W(1/e1.5) for adaptive algorithms

Two Distributions • Gfar: every edge exists with probability e • whp it is e/3-far from bipartite • Gbip: pick a random partition, then every edge that crosses the partition exists with probability 2e • Thm1: look the same to non-adaptive algorithms making o(1/e2) queries • Thm2: look the same to adaptive algorithms making o(1/e1.5) queries

Proof of a Weaker Statement • Thm1 (weaker): a non-adaptive algorithm making q=o(1/e2) queries in Gfar is unlikely to see an odd cycle • Proof: • a non-adaptive algorithm asks about some subgraph with q edges. • There are at most about qt/2 cycles of length t, and each one exists with probability etqt/2, exponentially small in t. • Summing over all t, it’s still unlikely that there is a cycle

Proof of a Weaker Statement • Thm2 (weaker): an adaptive algorithm making q=o(1/e1.5) queries in Gfar is unlikely to see an odd cycle • Proof: • the algorithm sees an edge only once in 1/e queries • the algorithm sees a cycle only after querying a pair that it already sees as connects • It takes 1/e.5 edges to have 1/e pairs of connected vertices • It takes 1/e1.5 queries to have so many edges

Some more open questions • In adjacency matrix representation, most interesting problems solvable in constant (in e) time • For some problems (eg testing triangle-freeness) analysis uses Szemeredy’s regularity lemma, and constant is hyper-exponential in e • Lower bound (1/e)log 1/ e and only and for one-sided error • Alternative analysis / stronger lower bounds?

Lower Bounds for Property Testing