Fast Probabilistic Modeling for Combinatorial Optimization

Fast Probabilistic Modeling for Combinatorial Optimization Scott Davies School of Computer Science Carnegie Mellon University (and Justsystem Pittsburgh Research Center) Joint work with Shumeet Baluja

Combinatorial Optimization • Maximize “evaluation function” f(x) • input: fixed-length bitstring x = (x1, x2, …,xn) • output: real value • x might represent: • job shop schedules • TSP tours • discretized numeric values • etc. • Our focus: “Black Box” optimization • No domain-dependent heuristics x=1001001... f(x) 37.4

Commonly Used Approaches • Hill-climbing, simulated annealing • Generate candidate solutions “neighboring” single current working solution (e.g. differing by one bit) • Typically make no attempt to model how particular bits affect solution quality • Genetic algorithms • Attempt to implicitly capture dependency of solution quality on bit values by maintaining a population of candidate solutions • Use crossover and mutation operators on population members to generate new candidate solutions

Using Explicit Probabilistic Models • Maintain an explicit probability distribution P from which we generate new candidate solutions • Initialize P to uniform distribution • Until termination criteria met: • Stochastically generate K candidate solutions from P • Evaluate them • Update P to make it more likely to generate solutions “similar” to the “good” solutions • Return best bitstring ever evaluated • Several different choices for what sorts of P to use and how to update it after candidate solution evaluation

Population-Based Incremental Learning • Population-Based Incremental Learning (PBIL) [Baluja, 1995] • Maintains a vector of probabilities: one independent probability P(xi=1) for each bit xi. (Each initialized to 0.5) • Until termination criteria met: • Generate a population of K bitstrings from P. Evaluate them. • For each of the best M bitstrings of these K, adjust P to make that bitstring more likely: • Also sometimes uses “mutation” and/or pushes P away from bad solutions • Often works very well (compared to hillclimbing and GAs) despite its independence assumption if xi is set to 1 if xi is set to 0 or

Modeling Inter-Bit Dependencies • MIMIC [De Bonet, et al., 1997]: sort bits into Markov Chain in which each bit’s distribution is conditionalized on value of previous bit in chain • e.g.: P(x1, x2, …, xn) = P(x3)*P(x6|x3)* P(x2|x6)*… • Generate chain from best N% of bitstrings evaluated so far; generate new bitstrings from chain; repeat • More general framework: Bayesian networks

x3 x12 x6 x2 x17 x9 Bayesian Networks P(x1,…,xn) = P(x3)*P(x12)*P(x2|x3,x12)*P(x6|x3)*... • Specified by: • Network structure (must be a DAG) • Probability distribution for each variable (bit) given each possible combination of its parents’ values

Optimization with Bayesian Networks • Two operations we need to perform during optimization: • Given a network, generate bitstrings from the probability distribution represented by that network • Trivial with Bayesian networks • Given a set of “good” bitstrings, find the network most likely to have generated it • Given a fixed network structure, trivial to fill in the probability tables optimally • But what structure should we use?

Learning Network Structures from Data • Problem statement: given a dataset D and a set of allowable Bayesian networks, find the network B with the maximum posterior probability:

Learning Network Structures from Data Maximizing log (D|B) reduces to maximizing where • Pi is the set of xi’s parents in B • H’ (…) is the average entropy of an empirical distribution exhibited in D

x17 x22 x5 x2 x13 x19 x6 x14 x8 Single-Parent Tree-Shaped Networks • Let’s allow each bit to be conditioned on at most one other bit. • Adding an arc from xj to xi increases network score by H(xi) - H(xi|xj) (xj’s “information gain” with xi) = H(xj) - H(xj|xi) (not necessarily obvious, but true) = I(xi, xj)Mutual information between xi and xj

Optimal Single-Parent Tree-Shaped Networks • To find optimal single-parent tree-shaped network, just find maximum spanning tree using I(xi, xj) as the weight for the edge between xi and xj. [Chow and Liu, 1968]. • Can be done in O(n2) time (assuming D has already been reduced to sufficient statistics)

Optimal Dependency Trees for Combinatorial Optimization [Baluja & Davies, 1997] • Start with a dataset D initialized from the uniform distribution • Until termination criteria met: • Build optimal dependency tree T with which to model D. • Generate K bitstrings from probability distribution represented by T. Evaluate them. • Add best M bitstrings to D after decaying the weight of all datapoints already in D by a factor a between 0 and 1. • Return best bitstring ever evaluated. • Running time: O(K*n + M*n2) per iteration

Graph-Coloring Example • Noisy Graph-Coloring example • For each edge connected to vertices of different colors, add +1 to evaluation function with probability 0.5

Optimal Dependency Tree Results • Tested following optimization algoirthms on variety of small problems, up to 256 bits long: • Optimization with optimal dependency trees • Optimization with chain-shaped networks (ala MIMIC) • PBIL • Hillclimbing • Genetic algorithms • General trend: • Hillclimbing & GAs did relatively poorly • Among probabilistic methods, PBIL < Chains < Trees: more accurate models are better

Modeling Higher-Order Dependencies • The maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent. • What about finding the best network in which each node has at most K parents for K>1? • NP-complete problem! [Chickering, et al., 1995] • However, can use search heuristics to look for “good” network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.

Bayesian Network-Based Combinatorial Optimization • Initialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges • Until termination criteria met: • Perform steepest-ascent hillclimbing from B to find locally optimal network B’. Repeat until no changes increase score: • Evaluate how each possible edge addition, removal or deletion would affect penalized log-likelihood score • Perform change that maximizes increase in score. • Set B to B’. • Generate and evaluate K bitstrings from B. • Decay weight of datapoints in D by a. • Add best M of the K recently generated datapoints to D. • Return best bit string ever evaluated.

Bayesian Networks Results • Does better than Tree-based optimization algorithm on some toy problems • Roughly the same as Tree-based algorithm on others, • Significantly more computation than Tree-based algorithm, however, despite efficiency hacks • Why not much better results? • Too much emphasis on exploitation rather than exploration? • Steepest-ascent hillclimbing over network structures not good enough, particularly when starting from old networks?

Using Probabilistic Models for Intelligent Restarts • Tree-based algorithm’s O(n2) execution time per iteration very expensive for large problems • Even more so for algorithm based on more complicated Bayesian networks • One possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing or PBIL

COMIT • Combining Optimizers with Mutual Information Trees [Baluja and Davies, 1998] • Initialize dataset D with bitstrings drawn from uniform distribution • Until termination criteria met: • Build optimal dependency tree T with which to model D. • Use T to stochastically generate K bitstrings. Evaluate them. • Execute a fast-search procedure, initialized with the best solutions generated from T. • Replace up to M of the worst bitstrings in D with the best bitstrings found during the fast-search run just executed. • Return best bitstring ever evaluated

COMIT with Hill-climbing • Use single best bitstring generated by tree as starting point of a stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before giving up • |D| kept at 1000; M set to 100. • We compare COMIT vs.: • Hillclimbing with restarts from bitstrings chosen randomly from uniform distribution • Hillclimbing with restarts from best bitstring out of K chosen randomly from uniform distribution

COMIT w/Hillclimbing: Example of Behavior • TSP domain: 100 cities, 700 bits Hillclimber Tour Length * 103 COMIT Evaluation Number * 103

COMIT w/Hillclimbing: Results • Each number is average over at least 25 runs • Each algorithm given 200,000 evaluation function calls • Highlighted: better than each non-COMIT hillclimber with confidence > 95% • AHCxxx: pick best of xxx randomly generated starting points before hillclimbing • COMITxxx: pick best of xxx starting points generated by tree before hillclimbing

COMIT with PBIL • Generate K samples from T; evaluate them • Initialize PBIL’s Pvector according to the unconditional distributions contained in the best C% of these examples • PBIL run terminated after 5000 evaluations without improvement • |D| kept at 1000; M set to 100 (as before) • PBIL parameters: a=.15; update P with single best example out of 50 each iteration; some “mutation” of P used as well

COMIT w/PBIL: Results • Each number is average over at least 50 runs • Each algorithm given 600,000 evaluation function calls • Highlighted: better than opposing algorithm(s) with confidence > 95% • PBIL+R: PBIL with restart after 5000 evaluations without improvement (P reinitialized to 0.5)

Conclusions & Future Work • COMIT makes probabilistic modeling applicable to much larger problems • COMIT led to significant improvements over baseline search algorithm in most problems tested • Future work • Applying COMIT to more interesting problems • Using COMIT to combine results of multiple search algorithms • Incorporating domain knowledge

Future Work • Making algorithm based on complex Bayesian Networks more practical • Combine w/simpler search algorithms, ala COMIT? • Applying COMIT to more interesting problems and other baseline search algorithms • WALKSAT? • Using more sophisticated probabilistic models • Using COMIT to combine results of multiple search algorithms

Fast Probabilistic Modeling for Combinatorial Optimization

Fast Probabilistic Modeling for Combinatorial Optimization

Presentation Transcript

Combinatorial Optimization

Approximation Algorithms for Stochastic Combinatorial Optimization

Combinatorial Optimization and Combinatorial Structure in Computational Biology

Approximations for Combinatorial Optimization

Distributed Combinatorial Optimization

Combinatorial Optimization for Graphical Models

Guiding Combinatorial Optimization with UCT

Solving Probabilistic Combinatorial Games

Probabilistic Modeling for Combinatorial Optimization

EAs for Combinatorial Optimization Problems

Combinatorial Benders’ Cuts for Sports Scheduling Optimization

Combinatorial Optimization for Text Layout

Combinatorial Optimization

Probabilistic (=stochastic) modeling

Building “ Problem Solving Engines ” for Combinatorial Optimization

IE 635 Combinatorial Optimization

IE 635 Combinatorial Optimization

Approximation Algorithms for Stochastic Combinatorial Optimization

Combinatorial Optimization and Combinatorial Structure in Computational Biology

Solving Probabilistic Combinatorial Games

Problems in Combinatorial Optimization

Introduction to combinatorial optimization, modeling and complexity theory