Fast probabilistic modeling for combinatorial optimization
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Fast Probabilistic Modeling for Combinatorial Optimization PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on
  • Presentation posted in: General

Fast Probabilistic Modeling for Combinatorial Optimization. Scott Davies School of Computer Science Carnegie Mellon University (and Justsystem Pittsburgh Research Center) Joint work with Shumeet Baluja. Combinatorial Optimization. Maximize “evaluation function” f(x)

Download Presentation

Fast Probabilistic Modeling for Combinatorial Optimization

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Fast probabilistic modeling for combinatorial optimization

Fast Probabilistic Modeling for Combinatorial Optimization

Scott Davies

School of Computer Science

Carnegie Mellon University

(and Justsystem Pittsburgh Research Center)

Joint work with Shumeet Baluja


Combinatorial optimization

Combinatorial Optimization

  • Maximize “evaluation function” f(x)

    • input: fixed-length bitstring x = (x1, x2, …,xn)

    • output: real value

  • x might represent:

    • job shop schedules

    • TSP tours

    • discretized numeric values

    • etc.

  • Our focus: “Black Box” optimization

    • No domain-dependent heuristics

x=1001001...

f(x)

37.4


Commonly used approaches

Commonly Used Approaches

  • Hill-climbing, simulated annealing

    • Generate candidate solutions “neighboring” single current working solution (e.g. differing by one bit)

    • Typically make no attempt to model how particular bits affect solution quality

  • Genetic algorithms

    • Attempt to implicitly capture dependency of solution quality on bit values by maintaining a population of candidate solutions

    • Use crossover and mutation operators on population members to generate new candidate solutions


Using explicit probabilistic models

Using Explicit Probabilistic Models

  • Maintain an explicit probability distribution P from which we generate new candidate solutions

    • Initialize P to uniform distribution

    • Until termination criteria met:

      • Stochastically generate K candidate solutions from P

      • Evaluate them

      • Update P to make it more likely to generate solutions “similar” to the “good” solutions

    • Return best bitstring ever evaluated

  • Several different choices for what sorts of P to use and how to update it after candidate solution evaluation


Population based incremental learning

Population-Based Incremental Learning

  • Population-Based Incremental Learning (PBIL) [Baluja, 1995]

    • Maintains a vector of probabilities: one independent probability P(xi=1) for each bit xi. (Each initialized to 0.5)

    • Until termination criteria met:

      • Generate a population of K bitstrings from P. Evaluate them.

      • For each of the best M bitstrings of these K, adjust P to make that bitstring more likely:

      • Also sometimes uses “mutation” and/or pushes P away from bad solutions

    • Often works very well (compared to hillclimbing and GAs) despite its independence assumption

if xi is set to 1

if xi is set to 0

or


Modeling inter bit dependencies

Modeling Inter-Bit Dependencies

  • MIMIC [De Bonet, et al., 1997]: sort bits into Markov Chain in which each bit’s distribution is conditionalized on value of previous bit in chain

    • e.g.: P(x1, x2, …, xn) = P(x3)*P(x6|x3)* P(x2|x6)*…

    • Generate chain from best N% of bitstrings evaluated so far; generate new bitstrings from chain; repeat

  • More general framework: Bayesian networks


Bayesian networks

x3

x12

x6

x2

x17

x9

Bayesian Networks

P(x1,…,xn) = P(x3)*P(x12)*P(x2|x3,x12)*P(x6|x3)*...

  • Specified by:

    • Network structure (must be a DAG)

    • Probability distribution for each variable (bit) given each possible combination of its parents’ values


Optimization with bayesian networks

Optimization with Bayesian Networks

  • Two operations we need to perform during optimization:

    • Given a network, generate bitstrings from the probability distribution represented by that network

      • Trivial with Bayesian networks

    • Given a set of “good” bitstrings, find the network most likely to have generated it

      • Given a fixed network structure, trivial to fill in the probability tables optimally

      • But what structure should we use?


Learning network structures from data

Learning Network Structures from Data

  • Problem statement: given a dataset D and a set of allowable Bayesian networks, find the network B with the maximum posterior probability:


Learning network structures from data1

Learning Network Structures from Data

Maximizing log (D|B) reduces to maximizing

where

  • Pi is the set of xi’s parents in B

  • H’ (…) is the average entropy of an empirical distribution exhibited in D


Single parent tree shaped networks

x17

x22

x5

x2

x13

x19

x6

x14

x8

Single-Parent Tree-Shaped Networks

  • Let’s allow each bit to be conditioned on at most one other bit.

  • Adding an arc from xj to xi increases network score by

    H(xi) - H(xi|xj) (xj’s “information gain” with xi)

    = H(xj) - H(xj|xi) (not necessarily obvious, but true)

    = I(xi, xj)Mutual information between xi and xj


Optimal single parent tree shaped networks

Optimal Single-Parent Tree-Shaped Networks

  • To find optimal single-parent tree-shaped network, just find maximum spanning tree using I(xi, xj) as the weight for the edge between xi and xj. [Chow and Liu, 1968].

  • Can be done in O(n2) time (assuming D has already been reduced to sufficient statistics)


Optimal dependency trees for combinatorial optimization

Optimal Dependency Trees for Combinatorial Optimization

[Baluja & Davies, 1997]

  • Start with a dataset D initialized from the uniform distribution

  • Until termination criteria met:

    • Build optimal dependency tree T with which to model D.

    • Generate K bitstrings from probability distribution represented by T. Evaluate them.

    • Add best M bitstrings to D after decaying the weight of all datapoints already in D by a factor a between 0 and 1.

  • Return best bitstring ever evaluated.

  • Running time: O(K*n + M*n2) per iteration


Graph coloring example

Graph-Coloring Example

  • Noisy Graph-Coloring example

    • For each edge connected to vertices of different colors, add +1 to evaluation function with probability 0.5


Optimal dependency tree results

Optimal Dependency Tree Results

  • Tested following optimization algoirthms on variety of small problems, up to 256 bits long:

    • Optimization with optimal dependency trees

    • Optimization with chain-shaped networks (ala MIMIC)

    • PBIL

    • Hillclimbing

    • Genetic algorithms

  • General trend:

    • Hillclimbing & GAs did relatively poorly

    • Among probabilistic methods, PBIL < Chains < Trees: more accurate models are better


Modeling higher order dependencies

Modeling Higher-Order Dependencies

  • The maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent.

  • What about finding the best network in which each node has at most K parents for K>1?

    • NP-complete problem! [Chickering, et al., 1995]

    • However, can use search heuristics to look for “good” network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.


Bayesian network based combinatorial optimization

Bayesian Network-Based Combinatorial Optimization

  • Initialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges

  • Until termination criteria met:

    • Perform steepest-ascent hillclimbing from B to find locally optimal network B’. Repeat until no changes increase score:

      • Evaluate how each possible edge addition, removal or deletion would affect penalized log-likelihood score

      • Perform change that maximizes increase in score.

    • Set B to B’.

    • Generate and evaluate K bitstrings from B.

    • Decay weight of datapoints in D by a.

    • Add best M of the K recently generated datapoints to D.

  • Return best bit string ever evaluated.


Bayesian networks results

Bayesian Networks Results

  • Does better than Tree-based optimization algorithm on some toy problems

  • Roughly the same as Tree-based algorithm on others,

  • Significantly more computation than Tree-based algorithm, however, despite efficiency hacks

  • Why not much better results?

    • Too much emphasis on exploitation rather than exploration?

    • Steepest-ascent hillclimbing over network structures not good enough, particularly when starting from old networks?


Using probabilistic models for intelligent restarts

Using Probabilistic Models for Intelligent Restarts

  • Tree-based algorithm’s O(n2) execution time per iteration very expensive for large problems

  • Even more so for algorithm based on more complicated Bayesian networks

  • One possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing or PBIL


Comit

COMIT

  • Combining Optimizers with Mutual Information Trees [Baluja and Davies, 1998]

    • Initialize dataset D with bitstrings drawn from uniform distribution

    • Until termination criteria met:

      • Build optimal dependency tree T with which to model D.

      • Use T to stochastically generate K bitstrings. Evaluate them.

      • Execute a fast-search procedure, initialized with the best solutions generated from T.

      • Replace up to M of the worst bitstrings in D with the best bitstrings found during the fast-search run just executed.

    • Return best bitstring ever evaluated


Comit with hill climbing

COMIT with Hill-climbing

  • Use single best bitstring generated by tree as starting point of a stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before giving up

  • |D| kept at 1000; M set to 100.

  • We compare COMIT vs.:

    • Hillclimbing with restarts from bitstrings chosen randomly from uniform distribution

    • Hillclimbing with restarts from best bitstring out of K chosen randomly from uniform distribution


Comit w hillclimbing example of behavior

COMIT w/Hillclimbing: Example of Behavior

  • TSP domain: 100 cities, 700 bits

Hillclimber

Tour Length * 103

COMIT

Evaluation Number * 103


Comit w hillclimbing results

COMIT w/Hillclimbing: Results

  • Each number is average over at least 25 runs

  • Each algorithm given 200,000 evaluation function calls

  • Highlighted: better than each non-COMIT hillclimber with confidence > 95%

  • AHCxxx: pick best of xxx randomly generated starting points before hillclimbing

  • COMITxxx: pick best of xxx starting points generated by tree before hillclimbing


Comit with pbil

COMIT with PBIL

  • Generate K samples from T; evaluate them

  • Initialize PBIL’s Pvector according to the unconditional distributions contained in the best C% of these examples

  • PBIL run terminated after 5000 evaluations without improvement

  • |D| kept at 1000; M set to 100 (as before)

  • PBIL parameters: a=.15; update P with single best example out of 50 each iteration; some “mutation” of P used as well


Comit w pbil results

COMIT w/PBIL: Results

  • Each number is average over at least 50 runs

  • Each algorithm given 600,000 evaluation function calls

  • Highlighted: better than opposing algorithm(s) with confidence > 95%

  • PBIL+R: PBIL with restart after 5000 evaluations without improvement (P reinitialized to 0.5)


Conclusions future work

Conclusions & Future Work

  • COMIT makes probabilistic modeling applicable to much larger problems

  • COMIT led to significant improvements over baseline search algorithm in most problems tested

  • Future work

    • Applying COMIT to more interesting problems

    • Using COMIT to combine results of multiple search algorithms

    • Incorporating domain knowledge


Future work

Future Work

  • Making algorithm based on complex Bayesian Networks more practical

    • Combine w/simpler search algorithms, ala COMIT?

  • Applying COMIT to more interesting problems and other baseline search algorithms

    • WALKSAT?

  • Using more sophisticated probabilistic models

  • Using COMIT to combine results of multiple search algorithms


  • Login