- 84 Views
- Uploaded on
- Presentation posted in: General

Fast Probabilistic Modeling for Combinatorial Optimization

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Fast Probabilistic Modeling for Combinatorial Optimization

Scott Davies

School of Computer Science

Carnegie Mellon University

(and Justsystem Pittsburgh Research Center)

Joint work with Shumeet Baluja

- Maximize “evaluation function” f(x)
- input: fixed-length bitstring x = (x1, x2, …,xn)
- output: real value

- x might represent:
- job shop schedules
- TSP tours
- discretized numeric values
- etc.

- Our focus: “Black Box” optimization
- No domain-dependent heuristics

x=1001001...

f(x)

37.4

- Hill-climbing, simulated annealing
- Generate candidate solutions “neighboring” single current working solution (e.g. differing by one bit)
- Typically make no attempt to model how particular bits affect solution quality

- Genetic algorithms
- Attempt to implicitly capture dependency of solution quality on bit values by maintaining a population of candidate solutions
- Use crossover and mutation operators on population members to generate new candidate solutions

- Maintain an explicit probability distribution P from which we generate new candidate solutions
- Initialize P to uniform distribution
- Until termination criteria met:
- Stochastically generate K candidate solutions from P
- Evaluate them
- Update P to make it more likely to generate solutions “similar” to the “good” solutions

- Return best bitstring ever evaluated

- Several different choices for what sorts of P to use and how to update it after candidate solution evaluation

- Population-Based Incremental Learning (PBIL) [Baluja, 1995]
- Maintains a vector of probabilities: one independent probability P(xi=1) for each bit xi. (Each initialized to 0.5)
- Until termination criteria met:
- Generate a population of K bitstrings from P. Evaluate them.
- For each of the best M bitstrings of these K, adjust P to make that bitstring more likely:
- Also sometimes uses “mutation” and/or pushes P away from bad solutions

- Often works very well (compared to hillclimbing and GAs) despite its independence assumption

if xi is set to 1

if xi is set to 0

or

- MIMIC [De Bonet, et al., 1997]: sort bits into Markov Chain in which each bit’s distribution is conditionalized on value of previous bit in chain
- e.g.: P(x1, x2, …, xn) = P(x3)*P(x6|x3)* P(x2|x6)*…
- Generate chain from best N% of bitstrings evaluated so far; generate new bitstrings from chain; repeat

- More general framework: Bayesian networks

x3

x12

x6

x2

x17

x9

P(x1,…,xn) = P(x3)*P(x12)*P(x2|x3,x12)*P(x6|x3)*...

- Specified by:
- Network structure (must be a DAG)
- Probability distribution for each variable (bit) given each possible combination of its parents’ values

- Two operations we need to perform during optimization:
- Given a network, generate bitstrings from the probability distribution represented by that network
- Trivial with Bayesian networks

- Given a set of “good” bitstrings, find the network most likely to have generated it
- Given a fixed network structure, trivial to fill in the probability tables optimally
- But what structure should we use?

- Given a network, generate bitstrings from the probability distribution represented by that network

- Problem statement: given a dataset D and a set of allowable Bayesian networks, find the network B with the maximum posterior probability:

Maximizing log (D|B) reduces to maximizing

where

- Pi is the set of xi’s parents in B
- H’ (…) is the average entropy of an empirical distribution exhibited in D

x17

x22

x5

x2

x13

x19

x6

x14

x8

- Let’s allow each bit to be conditioned on at most one other bit.

- Adding an arc from xj to xi increases network score by
H(xi) - H(xi|xj) (xj’s “information gain” with xi)

= H(xj) - H(xj|xi) (not necessarily obvious, but true)

= I(xi, xj)Mutual information between xi and xj

- To find optimal single-parent tree-shaped network, just find maximum spanning tree using I(xi, xj) as the weight for the edge between xi and xj. [Chow and Liu, 1968].
- Can be done in O(n2) time (assuming D has already been reduced to sufficient statistics)

[Baluja & Davies, 1997]

- Start with a dataset D initialized from the uniform distribution
- Until termination criteria met:
- Build optimal dependency tree T with which to model D.
- Generate K bitstrings from probability distribution represented by T. Evaluate them.
- Add best M bitstrings to D after decaying the weight of all datapoints already in D by a factor a between 0 and 1.

- Return best bitstring ever evaluated.
- Running time: O(K*n + M*n2) per iteration

- Noisy Graph-Coloring example
- For each edge connected to vertices of different colors, add +1 to evaluation function with probability 0.5

- Tested following optimization algoirthms on variety of small problems, up to 256 bits long:
- Optimization with optimal dependency trees
- Optimization with chain-shaped networks (ala MIMIC)
- PBIL
- Hillclimbing
- Genetic algorithms

- General trend:
- Hillclimbing & GAs did relatively poorly
- Among probabilistic methods, PBIL < Chains < Trees: more accurate models are better

- The maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent.
- What about finding the best network in which each node has at most K parents for K>1?
- NP-complete problem! [Chickering, et al., 1995]
- However, can use search heuristics to look for “good” network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.

- Initialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges
- Until termination criteria met:
- Perform steepest-ascent hillclimbing from B to find locally optimal network B’. Repeat until no changes increase score:
- Evaluate how each possible edge addition, removal or deletion would affect penalized log-likelihood score
- Perform change that maximizes increase in score.

- Set B to B’.
- Generate and evaluate K bitstrings from B.
- Decay weight of datapoints in D by a.
- Add best M of the K recently generated datapoints to D.

- Perform steepest-ascent hillclimbing from B to find locally optimal network B’. Repeat until no changes increase score:
- Return best bit string ever evaluated.

- Does better than Tree-based optimization algorithm on some toy problems
- Roughly the same as Tree-based algorithm on others,
- Significantly more computation than Tree-based algorithm, however, despite efficiency hacks
- Why not much better results?
- Too much emphasis on exploitation rather than exploration?
- Steepest-ascent hillclimbing over network structures not good enough, particularly when starting from old networks?

- Tree-based algorithm’s O(n2) execution time per iteration very expensive for large problems
- Even more so for algorithm based on more complicated Bayesian networks
- One possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing or PBIL

- Combining Optimizers with Mutual Information Trees [Baluja and Davies, 1998]
- Initialize dataset D with bitstrings drawn from uniform distribution
- Until termination criteria met:
- Build optimal dependency tree T with which to model D.
- Use T to stochastically generate K bitstrings. Evaluate them.
- Execute a fast-search procedure, initialized with the best solutions generated from T.
- Replace up to M of the worst bitstrings in D with the best bitstrings found during the fast-search run just executed.

- Return best bitstring ever evaluated

- Use single best bitstring generated by tree as starting point of a stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before giving up
- |D| kept at 1000; M set to 100.
- We compare COMIT vs.:
- Hillclimbing with restarts from bitstrings chosen randomly from uniform distribution
- Hillclimbing with restarts from best bitstring out of K chosen randomly from uniform distribution

- TSP domain: 100 cities, 700 bits

Hillclimber

Tour Length * 103

COMIT

Evaluation Number * 103

- Each number is average over at least 25 runs
- Each algorithm given 200,000 evaluation function calls
- Highlighted: better than each non-COMIT hillclimber with confidence > 95%
- AHCxxx: pick best of xxx randomly generated starting points before hillclimbing
- COMITxxx: pick best of xxx starting points generated by tree before hillclimbing

- Generate K samples from T; evaluate them
- Initialize PBIL’s Pvector according to the unconditional distributions contained in the best C% of these examples
- PBIL run terminated after 5000 evaluations without improvement
- |D| kept at 1000; M set to 100 (as before)
- PBIL parameters: a=.15; update P with single best example out of 50 each iteration; some “mutation” of P used as well

- Each number is average over at least 50 runs
- Each algorithm given 600,000 evaluation function calls
- Highlighted: better than opposing algorithm(s) with confidence > 95%
- PBIL+R: PBIL with restart after 5000 evaluations without improvement (P reinitialized to 0.5)

- COMIT makes probabilistic modeling applicable to much larger problems
- COMIT led to significant improvements over baseline search algorithm in most problems tested
- Future work
- Applying COMIT to more interesting problems
- Using COMIT to combine results of multiple search algorithms
- Incorporating domain knowledge

- Making algorithm based on complex Bayesian Networks more practical
- Combine w/simpler search algorithms, ala COMIT?

- Applying COMIT to more interesting problems and other baseline search algorithms
- WALKSAT?

- Using more sophisticated probabilistic models
- Using COMIT to combine results of multiple search algorithms