1 / 34

Some Recent Progress in Combinatorial Statistics

Some Recent Progress in Combinatorial Statistics. Elchanan Mossel, UC Berkeley + Weizmann Institute http://www.stat.berkeley.edu/~mossel. What is Combinatorial Statistics. “ Combinatorial Statistics ” deals with Inference of Discrete parameters such as Graph, Trees or Partitions.

Download Presentation

Some Recent Progress in Combinatorial Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Some Recent Progress in Combinatorial Statistics Elchanan Mossel, UC Berkeley + Weizmann Institute http://www.stat.berkeley.edu/~mossel

  2. What is Combinatorial Statistics • “Combinatorial Statistics” deals with Inference of • Discrete parameters such as Graph, Trees or Partitions. • With the goal of providing algorithms with guarantees on: • Sampling Complexity, Computational Complexity and Success Probability. • Alternatively: Sampling or Complexity Lower bounds.

  3. Example 1: Graphical Model reconstruction

  4. Markov random fields / Graphical Models • A common model for stochastic networks bounded degree graph G = (V, E) weight functions C:| C|R≥0for every cliqueC 1 (1, 2) 2 (1, 3) (2, 3) 3 (3, 4) 4

  5. (1, 3)( , ) (1, 2)( , ) (2, 3)( , ) (3, 4)( , ) Markov Random Fields / Graphical Models • A common model for stochastic networks bounded degree graph G = (V, E) weight functions C:| C|R≥0 for every edge clique C 1 nodes v are assigned values av in alphabet 2 distribution over statesV given by 3 Pr[] ~ CC(au, u 2 C) 4 where the product goes over all Cliques C

  6. Reconstruction task for Markov random fields • Suppose we can obtain independent samples from the Markov random field • Given observed data at the nodes, is it possible to reconstruct the model (network)? • Important: Do we see data at all of the nodes or only a subset?

  7. Problem: Given k independent samples of= (1,…n)at all the nodes, find the graphG (Given activity at all the nodes, can network be reconstructed?) Restrict attention to graphs with max degree d: A structure estimator is a map Questions: 1. How many samples k are required (asymptotically) in order to reconstruct MRFs with number of nodes n, max degree d with probability approaching 1, I.e. 2. Want an efficient algorithm for reconstruction. Reconstruction problem

  8. Tree Markov Fields can be reconstructed efficiently (even with hidden nodes). [Erdös,Steel,Szekely,Warnow,99], [Daskalakis,Mossel,Roch,06]. PAC Setup: [Abbeel,Koller,Ng, ‘06] produce a factorized distribution that is  n close in Kullback-Leibler divergence to the true distribution. No guarantee to reconstruct the correct graph Running time and sampling complexity is nO(d) More restricted problem studied by [Wainwright,Ravikumar,Lafferty, ‘06] Restricted to Ising model, sample complexity (d5 log n), difficult to verify convergence condition – technique based on L1 regularization. Moreover works for graphs not for graphical models! (clique potentials not allowed). Subsequent to our results, [Santhanam,Wainwright, ‘08] determine information theoretic sampling complexity and [Wainwright,Ravikumar,Lafferty, ‘08] get (d log n) sampling (restricted to Ising models; still no checkable guarantee for convergence). Related work

  9. Related work

  10. Theorem (Bresler-M-Sly-08; Lower bound on sample complexity): In order to recover G of max-deg d need at least c d log n samples, for some constant c. Pf follows by “counting # of networks”; information theory lower bounds. More formally: Given any prior distribution which is uniform over degree d graphs (no restrictions on the potentials), in order to recover correct graph with probability ¸ 2-n need at least c d log n samples. Reconstructing General Networks - New Results • Theorem (Bresler-M-Sly-08; Asymptotically optimal algorithm): • If distribution is “non-degenerate” c d log n samples suffice to reconstruct • the model with probability ¸ 1 – 1/n100, for some (other) constant c. • Running time is nO(d) • (sampling complexity tight up to a constant factor; running time – unknown)

  11. Observation: Knowing graph is same as knowing neighborhoods But neighborhood is determined by Markov property Same intuition behind work of Abeel et. al. Intuition Behind Algorithms • “Algorithm”: • Step 1. Compute empirical probabilities, • which are concentrated around • true probabilities • Step 2. For each node, simply test • Markov property of each • candidate neighborhood Main Challenge: Show non-degeneracy ) algorithm works

  12. (1, 3)( , ) (1, 2)( , ) (2, 3)( , ) (3, 4)( , ) Example 2: Reconstruction of MRF with Hidden nodes • In many applications only some of the nodes can be observed visible nodesWV Markov random field over visible nodes is 1 2 W = (w : w W) 3 • Is reconstruction still possible? • What does “reconstruction” even mean? 4

  13. Reconstruction versus distinguishing • Easy to construct many models that lead to the same distribution (statistical unidentifiable) • Assuming “this is not a problem” are there computational obstacles for reconstruction? • In particular: how hard is it to distinguish statistically different models?

  14. Distinguishing problems • Let M1, M2 be two models with hidden nodes • Can you tell if M1 and M2 are statistically close or far apart (on the visible nodes)? • Assuming M1 and M2 are statistically far apart and given access to samples from one of them, can you tell where the samples came from? PROBLEM 1 PROBLEM 2

  15. Hardness result with hidden nodes • In Bogdanov-M-Vadhan-08: Problems 1 and 2 are intractable (in the worst case) unless NP = RP • Conversely, if NP = RP then distinguishing (and other forms of reconstruction) are achievable

  16. A possible objection • The “hard” models M1, M2 describe distributions that are not efficiently samplable • But if nature is efficient, we never need to worry about such distributions!

  17. Distinguishing problem for samplable distributions • If M1 and M2 are statistically far apart and given access to samples from one of them, can you tell where the samples came from, assuming M1 and M2 are efficiently samplable? • Theorem • We don’t know if this is tight PROBLEM 3 Problem 3 is intractable unless computational zero knowledge is trivial

  18. Example 3: Consensus Ranking and the Mallows Model • Problem Given a set of rankings {1, 2, ... N} find the consensus ranking (or central ranking) for d = distance on the set of permutations of n objects Most natural is dK which is the Kendall distance.

  19. The Mallows Model • Exponential family model in : • P(  | 0) = Z()-1 exp(- dK(,0)) • ´ 0 : uniform distribution over permutations •  > 0 : 0 is the unique mode of P,0 • Theorem [Meila,Phadnis,Patterson,Bilmes 07] Consensus ranking (i.e ML estimation of 0 for constant ) can be solved exactly by a branch and bound (B&B) algorithm. • The B&B algorithm can take (super-) exponential time in some cases • Seem to perform well on simulated data.

  20. Related work ML estimation • [Fligner&Verducci 86] introduce generalized Mallows model, ML estimation • [Fligner&Verducci 88] (FV) heuristic for 0 estimation Consensus ranking • [Cohen,Schapire,Singer 99] Greedy algorithm (CSS) • + improvement by finding strongly connected components • + missing data (not all i rank all n items) • [Ailon,Newman,Charikar 05] Randomized algorithm • guaranteed 11/7 factor approximation (ANC) • [Mathieu, 07] (1+) approximation, time O(n6/+2^2O(1/)) • [Davenport,Kalagnanan 03] Heuristics based on edge-disjoint cycles • [Conitzer,D,K 05] Exact algorithm based on integer programming, better bounds

  21. Efficient Sorting of the Mallow’s model • [Braverman-Mossel-09]: • Given r independent samples from the Mallows Model, find ML solution exactly! in time nb,where • b = 1 + O(( r)-1), • where r is the number of samples • with high probability (say ¸1-n-100)

  22. Sorting the Mallow’s Model • [Braverman-M-09]: Optimal order can be found in polynomial time and O(n log n) queries. • Proof Ingredient 1: “statistical properties” of generated permutations i in terms of the original order 0: • With high probability:x |i(x)-(x)| = O(n), max |i(x)-(x)| = O(log n) • Additional ingredient: A dynamic programming • algorithm to find  given a starting point where each elements is at most k away with running time O(n 26k)

  23. Example 4: MCMC Diagnostics • Sampling from a joint probability distribution π on Ω. • Construct Markov chain X0, X1, …which converges to π. • Multidimensional Integration • Bayesian inference (sampling from “posterior distributions” of the model) • Computational physics, • Computational biology, • Vision, Image segmentation.

  24. Convergence Diagnostics • A method to determine t such that Pt(X0, ) is “sufficiently close” to π. • - MCMC in Practice, Gilks, Richardson & Spiegelhalter Are we there yet?

  25. Convergence Diagnostics in the Literature • Visual inspection of functions of samples. • - MCMC in Practice, Gilks, Richardson & Spiegelhalter • [Raftery-Lewis] • Bounding the variance of estimates of quantiles of functions of parameters. • [Gelman-Rubin] • Convergence achieved if the separate empirical distributions are approximately the same as the combined distribution. • [Cowles-Carlin] • Survey of 13 diagnostics designed to identify specific types of convergence failure. • Recommend combinations of strategies – running parallel chains, computing autocorrelations.

  26. Diagnostic Algorithm X Y Diagnostic algorithm A: • Input: • An MC P on Ω {0,1}n and a time t such that either • T(1/4) < t • T(1/4) > 100t • Output: • “Yes” if T(1/4) < t • “No” T(1/4) > 100t

  27. Results - The General Case • Theorem [Bhatnagar-Bogdanov-M-09]: Given • A MC P on Ω {0,1}n • A time t such that either • T(1/4) < t • T(1/4) > 100t • It is a PSPACE-complete problem to distinguish the two.

  28. PSPACE complete problems [Crasmaru-Tromp] Ladder problems in GO – does black have a winning strategy? PSPACE hard problems are “much harder” the NP-hard problems.

  29. Results - Polynomial Mixing • Theorem 2 [Bhatnagar-Bogdanov-M-09]: Given • A MC P on Ω {0,1}n with a promise thatT(1/4) < poly(n) • A time t such that either • T(1/4) < t • T(1/4) > 100t • It is as hard as any coNP problem to distinguish the two. (x0+x3+x7)^(x2+x7+xn) …^(…) T(1/4) < t (x6+x3+x2)^(x0+x7+x1) …^(…) T(1/4) > 100t

  30. Results - Mixing from a Given State • Theorem 3 [Bhatnagar-Bogdanov-M-09]: : Given • A MC P on Ω {0,1}n such that T(1/4) < poly(n) • A starting state X0, • A time t such that either • TX0(1/4) < t • TX0(1/4) > 100t • A diagnostic algorithm can be used to distinguish whether two efficiently sampleable distributions are close or far in statistical distance. • [Sahai-Vadhan] “Statistical Zero Knowledge” - complete … … r1 rn r’1 r’n C1 C2 … … a1 am a1 am

  31. Comb. Stat. vs. Common ML Approaches • Some features of Combinatorial Statistics in comparison to ML: • +: Using explicit well defined models. • +: Explicit non-asymptotic running time / sampling bounds. • - : Often not “statistically efficient”. • - : Inputs must be generated from correct dist. or close to it (but: often: NP hard otherwise; so true for of all methods) • +/- : Need to understand combinatorial structure for each new problem vs: • Applying general statistical techniques and structure is hidden in programming/design etc.

  32. Conclusions and Questions • [Cowles-Carlin] Survey of 13 diagnostics designed to identify specific types of convergence failure. Each can fail to detect the type of convergence it was designed for. • [Arora-Barak] Efficient algorithms are not believed to exist for PSPACE-complete, coNP-complete or SZK-complete problems. • Diagnostic algorithms do not exist for large classes of MCMC algorithms, like Gibbs samplers unless there are efficient algorithms for PSPACE or coNP or SZK. • For what classes of samplers are convergence diagnostics possible? • Hardness of testing convergence for more restricted classes of samplers.

  33. Collaborators Nayantara Bhatnagar, Berkeley Guy Bresler Berkeley Mark Braverman Microsoft Andrej Bogdanov: Hong-Kong Allan Sly Berkeley Salil Vadhan: Harvard

  34. Thanks !!

More Related