Fast probabilistic modeling for combinatorial optimization
1 / 27

Fast Probabilistic Modeling for Combinatorial Optimization - PowerPoint PPT Presentation

  • Uploaded on

Fast Probabilistic Modeling for Combinatorial Optimization. Scott Davies School of Computer Science Carnegie Mellon University (and Justsystem Pittsburgh Research Center) Joint work with Shumeet Baluja. Combinatorial Optimization. Maximize “evaluation function” f(x)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Fast Probabilistic Modeling for Combinatorial Optimization' - dagan

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Fast probabilistic modeling for combinatorial optimization

Fast Probabilistic Modeling for Combinatorial Optimization

Scott Davies

School of Computer Science

Carnegie Mellon University

(and Justsystem Pittsburgh Research Center)

Joint work with Shumeet Baluja

Combinatorial optimization
Combinatorial Optimization

  • Maximize “evaluation function” f(x)

    • input: fixed-length bitstring x = (x1, x2, …,xn)

    • output: real value

  • x might represent:

    • job shop schedules

    • TSP tours

    • discretized numeric values

    • etc.

  • Our focus: “Black Box” optimization

    • No domain-dependent heuristics




Commonly used approaches
Commonly Used Approaches

  • Hill-climbing, simulated annealing

    • Generate candidate solutions “neighboring” single current working solution (e.g. differing by one bit)

    • Typically make no attempt to model how particular bits affect solution quality

  • Genetic algorithms

    • Attempt to implicitly capture dependency of solution quality on bit values by maintaining a population of candidate solutions

    • Use crossover and mutation operators on population members to generate new candidate solutions

Using explicit probabilistic models
Using Explicit Probabilistic Models

  • Maintain an explicit probability distribution P from which we generate new candidate solutions

    • Initialize P to uniform distribution

    • Until termination criteria met:

      • Stochastically generate K candidate solutions from P

      • Evaluate them

      • Update P to make it more likely to generate solutions “similar” to the “good” solutions

    • Return best bitstring ever evaluated

  • Several different choices for what sorts of P to use and how to update it after candidate solution evaluation

Population based incremental learning
Population-Based Incremental Learning

  • Population-Based Incremental Learning (PBIL) [Baluja, 1995]

    • Maintains a vector of probabilities: one independent probability P(xi=1) for each bit xi. (Each initialized to 0.5)

    • Until termination criteria met:

      • Generate a population of K bitstrings from P. Evaluate them.

      • For each of the best M bitstrings of these K, adjust P to make that bitstring more likely:

      • Also sometimes uses “mutation” and/or pushes P away from bad solutions

    • Often works very well (compared to hillclimbing and GAs) despite its independence assumption

if xi is set to 1

if xi is set to 0


Modeling inter bit dependencies
Modeling Inter-Bit Dependencies

  • MIMIC [De Bonet, et al., 1997]: sort bits into Markov Chain in which each bit’s distribution is conditionalized on value of previous bit in chain

    • e.g.: P(x1, x2, …, xn) = P(x3)*P(x6|x3)* P(x2|x6)*…

    • Generate chain from best N% of bitstrings evaluated so far; generate new bitstrings from chain; repeat

  • More general framework: Bayesian networks

Bayesian networks







Bayesian Networks

P(x1,…,xn) = P(x3)*P(x12)*P(x2|x3,x12)*P(x6|x3)*...

  • Specified by:

    • Network structure (must be a DAG)

    • Probability distribution for each variable (bit) given each possible combination of its parents’ values

Optimization with bayesian networks
Optimization with Bayesian Networks

  • Two operations we need to perform during optimization:

    • Given a network, generate bitstrings from the probability distribution represented by that network

      • Trivial with Bayesian networks

    • Given a set of “good” bitstrings, find the network most likely to have generated it

      • Given a fixed network structure, trivial to fill in the probability tables optimally

      • But what structure should we use?

Learning network structures from data
Learning Network Structures from Data

  • Problem statement: given a dataset D and a set of allowable Bayesian networks, find the network B with the maximum posterior probability:

Learning network structures from data1
Learning Network Structures from Data

Maximizing log (D|B) reduces to maximizing


  • Pi is the set of xi’s parents in B

  • H’ (…) is the average entropy of an empirical distribution exhibited in D

Single parent tree shaped networks










Single-Parent Tree-Shaped Networks

  • Let’s allow each bit to be conditioned on at most one other bit.

  • Adding an arc from xj to xi increases network score by

    H(xi) - H(xi|xj) (xj’s “information gain” with xi)

    = H(xj) - H(xj|xi) (not necessarily obvious, but true)

    = I(xi, xj)Mutual information between xi and xj

Optimal single parent tree shaped networks
Optimal Single-Parent Tree-Shaped Networks

  • To find optimal single-parent tree-shaped network, just find maximum spanning tree using I(xi, xj) as the weight for the edge between xi and xj. [Chow and Liu, 1968].

  • Can be done in O(n2) time (assuming D has already been reduced to sufficient statistics)

Optimal dependency trees for combinatorial optimization
Optimal Dependency Trees for Combinatorial Optimization

[Baluja & Davies, 1997]

  • Start with a dataset D initialized from the uniform distribution

  • Until termination criteria met:

    • Build optimal dependency tree T with which to model D.

    • Generate K bitstrings from probability distribution represented by T. Evaluate them.

    • Add best M bitstrings to D after decaying the weight of all datapoints already in D by a factor a between 0 and 1.

  • Return best bitstring ever evaluated.

  • Running time: O(K*n + M*n2) per iteration

Graph coloring example
Graph-Coloring Example

  • Noisy Graph-Coloring example

    • For each edge connected to vertices of different colors, add +1 to evaluation function with probability 0.5

Optimal dependency tree results
Optimal Dependency Tree Results

  • Tested following optimization algoirthms on variety of small problems, up to 256 bits long:

    • Optimization with optimal dependency trees

    • Optimization with chain-shaped networks (ala MIMIC)

    • PBIL

    • Hillclimbing

    • Genetic algorithms

  • General trend:

    • Hillclimbing & GAs did relatively poorly

    • Among probabilistic methods, PBIL < Chains < Trees: more accurate models are better

Modeling higher order dependencies
Modeling Higher-Order Dependencies

  • The maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent.

  • What about finding the best network in which each node has at most K parents for K>1?

    • NP-complete problem! [Chickering, et al., 1995]

    • However, can use search heuristics to look for “good” network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.

Bayesian network based combinatorial optimization
Bayesian Network-Based Combinatorial Optimization

  • Initialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges

  • Until termination criteria met:

    • Perform steepest-ascent hillclimbing from B to find locally optimal network B’. Repeat until no changes increase score:

      • Evaluate how each possible edge addition, removal or deletion would affect penalized log-likelihood score

      • Perform change that maximizes increase in score.

    • Set B to B’.

    • Generate and evaluate K bitstrings from B.

    • Decay weight of datapoints in D by a.

    • Add best M of the K recently generated datapoints to D.

  • Return best bit string ever evaluated.

Bayesian networks results
Bayesian Networks Results

  • Does better than Tree-based optimization algorithm on some toy problems

  • Roughly the same as Tree-based algorithm on others,

  • Significantly more computation than Tree-based algorithm, however, despite efficiency hacks

  • Why not much better results?

    • Too much emphasis on exploitation rather than exploration?

    • Steepest-ascent hillclimbing over network structures not good enough, particularly when starting from old networks?

Using probabilistic models for intelligent restarts
Using Probabilistic Models for Intelligent Restarts

  • Tree-based algorithm’s O(n2) execution time per iteration very expensive for large problems

  • Even more so for algorithm based on more complicated Bayesian networks

  • One possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing or PBIL


  • Combining Optimizers with Mutual Information Trees [Baluja and Davies, 1998]

    • Initialize dataset D with bitstrings drawn from uniform distribution

    • Until termination criteria met:

      • Build optimal dependency tree T with which to model D.

      • Use T to stochastically generate K bitstrings. Evaluate them.

      • Execute a fast-search procedure, initialized with the best solutions generated from T.

      • Replace up to M of the worst bitstrings in D with the best bitstrings found during the fast-search run just executed.

    • Return best bitstring ever evaluated

Comit with hill climbing
COMIT with Hill-climbing

  • Use single best bitstring generated by tree as starting point of a stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before giving up

  • |D| kept at 1000; M set to 100.

  • We compare COMIT vs.:

    • Hillclimbing with restarts from bitstrings chosen randomly from uniform distribution

    • Hillclimbing with restarts from best bitstring out of K chosen randomly from uniform distribution

Comit w hillclimbing example of behavior
COMIT w/Hillclimbing: Example of Behavior

  • TSP domain: 100 cities, 700 bits


Tour Length * 103


Evaluation Number * 103

Comit w hillclimbing results
COMIT w/Hillclimbing: Results

  • Each number is average over at least 25 runs

  • Each algorithm given 200,000 evaluation function calls

  • Highlighted: better than each non-COMIT hillclimber with confidence > 95%

  • AHCxxx: pick best of xxx randomly generated starting points before hillclimbing

  • COMITxxx: pick best of xxx starting points generated by tree before hillclimbing

Comit with pbil

  • Generate K samples from T; evaluate them

  • Initialize PBIL’s Pvector according to the unconditional distributions contained in the best C% of these examples

  • PBIL run terminated after 5000 evaluations without improvement

  • |D| kept at 1000; M set to 100 (as before)

  • PBIL parameters: a=.15; update P with single best example out of 50 each iteration; some “mutation” of P used as well

Comit w pbil results
COMIT w/PBIL: Results

  • Each number is average over at least 50 runs

  • Each algorithm given 600,000 evaluation function calls

  • Highlighted: better than opposing algorithm(s) with confidence > 95%

  • PBIL+R: PBIL with restart after 5000 evaluations without improvement (P reinitialized to 0.5)

Conclusions future work
Conclusions & Future Work

  • COMIT makes probabilistic modeling applicable to much larger problems

  • COMIT led to significant improvements over baseline search algorithm in most problems tested

  • Future work

    • Applying COMIT to more interesting problems

    • Using COMIT to combine results of multiple search algorithms

    • Incorporating domain knowledge

Future work
Future Work

  • Making algorithm based on complex Bayesian Networks more practical

    • Combine w/simpler search algorithms, ala COMIT?

  • Applying COMIT to more interesting problems and other baseline search algorithms

    • WALKSAT?

  • Using more sophisticated probabilistic models

  • Using COMIT to combine results of multiple search algorithms