Likely-Admissible & Sub-symbolic Heuristics

26-08-2004 Valencia Likely-Admissible & Sub-symbolic Heuristics Marco Ernandes Cognitive Science PhD Student Email: ernandes@dii.unisi.it Web: www.dii.unisi.it/~ernandes Marco Gori Professor of Computer Science Email: marco@dii.unisi.it Web: www.dii.unisi.it/~marco

n g(n) h(n) xo * Heuristic Search • Search algorithms • A*, IDA*, BS*, … • Heuristic information • h(n)  tipically the distance from node n to goal • Heuristic usage policy • How to combine h(n) & g(n) to obtain f(n) Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

Optimal Search for NP Problems • 2 approaches: • Rigid admissibility • requires optimistic heuristics • ALWAYS retrieves optimal solutions C=C* • Relaxed Admissibility • e-admissible search (es: WA*) • retrieves solutions with bounded costs C(1+e)C* • the problem is no more NP-complete Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

Two families of heuristics • “Online” heuristics: • The h(n) value is computed during search when a node is visited. • An AI classic: Manhattan Distance • “Memory-based” heuristics: • Offline phase: resolution of all possibile subproblems and storage of all the results. • Online phase: decomposition of a node in subproblems and database querying. • Successfully used for rigid admissibility. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

1 3 4 5 6 7 8 2 “Online” heuristic research How to improve Manhattan estimations? • Working on its main bias: locality. • Manhattan considers each piece of the problem as completely independent from the rest. • Hence it has no way to determine how tiles influence each other. • hM = h* - GAP • Manhattan does not consider the influence of the blank tile. hM= 3 h*= 11 Tile conflicts Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

“Online” heuristic research How to improve Manhattan estimations? • 1: Manhattan Correction (Hansson et al., 1992) The idea is to increment the estimation with ad hoc techniques, maintaining admissibility. • 2: ABSOLVER approach (Prieditis, 1989) Automatically inventing admissible heurisitcs through constraint elimination. • 3: Higher-Order Heuristics (Korf, 1996) Generalizing Manhattan considering subproblems of a configuration and not the single elements. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

Manhattan Corrections • Linear Conflicts • Corner Tiles • Last Moves • Non-Linear Conflicts • Corner Deduction • First Moves Hansson et al., 1992 Conflict Deduction Ernandes, 2003 Ernandes, 2003 Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

6 8 7 2 3 3 5 4 6 1 3 2 5 7 2 1 4 7 6 3 5 4 5 7 8 4 1 2 1 8 6 8 7 5 4 2 5 1 3 6 7 8 1 8 2 3 4 6 Examples Linear Conflicts: computes conflicts on the same row/coloumn Corner tiles: computes conflicts thanks to corners properties Last Moves: computes the last two moves to complete the puzzle. Non-Linear Conflicts: computes conflicts on a different row/coloumn (two types) Corner deduction: as corner tiles but with correct tiles on the diagonal. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

Conflict Deduction • It is more convenient to implement the various techniques separately. • Cannot add together all corrections: inadmissibility! • If one tile is involved in one conflict it counts only once. • To maximize the estimation: we use, for each tile, the technique that gives the highest contribute. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

4 11 8 2 1 2 3 4 5 15 6 3 5 6 7 8 14 1 10 7 9 10 11 12 9 13 12 13 14 15 Higher-Order Heuristics • “Ad hoc” techniques generate strongly problem-dependent heuristics. • They are not sufficient to attack bigger problems as the 24-puzzle. • Manhattan has to be generalized otherwise: considering the distance-to-goal of more elements (tiles) collected together. • First example Pairwise Distances: instead of computing the distance of 1 indipendent tile, we use couples of tiles. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

Higher-Order Heuristics problems • We can potentiate the Pairwise Distance computing it for all possible tile couples and then seeking the combination that maximizes the estimation: Maximum Weighted Matching Problem • PD remains poorly informed. We need triples of tiles but the Matching Problem becomes NP-Complete (Korf, 1996). • Hence the only possible Higher-Order Heuristic to be efficiently used online is Pairwise Distance, which is to poor  less informed than Conflict Deduction! Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

4 11 8 2 1 2 3 4 5 15 6 3 5 6 7 8 14 1 10 7 9 10 11 12 9 13 12 13 14 15 From Higher-Order Heuristics to Memory-based heuristics • Higher-Order Heuristics could ignore the maximization problem and consider pre-designed tile groups (and increment their dimensions). • Solving subproblems of 3 or more tiles (patterns) is too expensive during search: we need to do this offline. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

1 2 3 4 1 2 3 4 5 6 7 8 5 6 7 8 9 10 11 12 9 10 11 12 13 14 15 13 14 15 Disjoint Pattern Databases(Korf&Taylor 2002) • Additive version of Pattern Databases (Culberson&Schaeffer,’96): where pattern are considered independently. • Manhattan is the simplest Disjoint Pattern DB: 1 tile 1 pattern. DPDBs, unlike PDBs, always dominate Manhattan. • On the 15-puzzle they perform 75 times faster than non-additive PDBs and their DB generation is much easier because distances can be computed backwards by disarranging the patterns. • Different DPDBs can be combined taking argmax: global speedup over Manhattan = 2000, space reduction = 13000. DPDB 1 DPDB 2 Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

DPDBs and the 24-puzzle. • This technique solved the 24-puzzle between 1,1 and 21 times faster than classic Higher-Order Heuristics: avg. 2 days. • But in many cases using more nodes! • This technique evidently does not scale with dimension problems. • Maintaining the same time complexity for the 35-puzzle would mean increase from 1013 to 1028 the number of DB entrances. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

Criticizing the classic approach • We believe that it is more sensible to investigate the combination: “online” heuristics + relaxed admissibility. • A) Because rigid admissibility does not give any chance to face problems of greater & greater dimensions. • “Online” admissible heuristics  NP-Hard in time • “Mem-based” admissible heuristics  NP-Hard in space • B) Because admissibility is a sufficient condition for optimality, not necessary! Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

“Admissible overestimations” • Some overestimations obviously don’t affect optimality: • Constant overestimations • Overestimations outside the optimal path • Optimal path overestimations coupled with overestimations in “brother” sub-branches. • In some domains other overestimations are admissible: • Uniform-cost problems: h < h*+c (“Move games”) • Orthogonal single-piece-move problems: h < h*+2c (“Atomic Manhattan-space problems”  like the sliding-tile puzzle) • Simple experiment with the 8-puzzle and A*: • Use heuristic h’=hM+s with s variable • If s>0 and s<2 search is optimal, but more inefficient while s2. • If s=2 search can be supoptimal, and regain space efficiency. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

Likely-Admissible Search • We relax the optimality requirement in a probabilistic sense (not qualitatively like e-admissible search). • Why is it a better approach than e-admissibility? • It allows to retrieve TRULY OPTIMAL solutions. • It still allows to change the nature of search complexity. • It allows to study the complexity stressing p asymptotically to 1. • Because search can rely on any heuristic, unlike e-admissible search that works only on already-proven-admissible ones. • Because we can better combine search with statistical machine learning techniques. Using universal approximators we can automatically generate heuristics. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Likely-Admissible Search –A statistical framework • Any given non-admissible heuristic can be used. The only requisite is to have a previous statistical analysis of overestimation frequencies. • We refer with P(h$) to the probability that heuristic h underestimates h* for any given state xX. • We refer with ph to the probability of optimally solving a problem using h and A*. • A main goal of the framework is to obtain ph from P(h$) WE WANT TO ESTIMATE OPTIMALITY FROM ADMISSIBILITY Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Likely-Admissible Search –Trivial case: single heuristic. • The overestimations over optimal path p* affect optimality hence, given solution depth d: (eq. 1) • Considering the admissible overestimation theorem, in the sliding-tiles puzzle domain: (eq. 2) Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Likely-Admissible Search –Effect of Admissible Overestimations Th. • Underestimating h*+2 is MUCH EASIER than h*! • Best heuristic generated for the 8-puzzle overestimated h* in 28,4% of cases, but h*+2 in 1,9% !! Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Likely-Admissible Search –Multiple Heuristics • To enrich the heuristic information we can generate many heuristics and use them simultaneously. • With j different heuristics we can take each time the smaller evaluation, in order to stress admissibility: • Thus: (eq. 3) (eq. 3b) Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Likely-Admissible Search –Multiple Heuristics • A common problem: we desire an optimality p… how many heuristics do we have to use to obtain p? We will consider for simplicity that all j heuristics have the same given P(h+2). Hence: (eq. 4) j grows logarithmically with this term, that grows both with d and pHbecause d>1 e pH < 1 Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Likely-Admissible Search –Some Examples • 8-puzzle: how many heuristics? • d 22 • Desired optimality 99,9%  pH = 0,999 • Given heuristics have P(h+2) = 0,95 • log 0,05 (1 – 220,999 ) log 0,05 0,0000455 3,33 • 4 • 15-puzzle: how many heuristics? • d  53 • Same desired optimality • Give heuristics have P(h+2) = 0,93 • log 0,07 (1 – 220,999 ) log 0,07 0,0000189 4,1 • 5 Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Likely-Admissible Search –Main Problems • Equations 3 and 3b assume that: • INDEPENDENT PROBABILITY DISTRIBUTION: Overestimation probability of competitive heuristics hj(x) have an independent distribution over X. • Equations 2 assumes that: • CONSTANT PROBABILITY: Underestimation probability P(h+2$) is constant for all x independently by h*(n). • All these assumptions are very strong: • We observed experimentally that ANN heuristics map X with similar overestimation probabilities. • We observed that avg. error grows with h*, thus P(h+2) too. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Likely-Admissible Search –Prediction capability • Eq.3 is not usable since it requires total independency. • Optimality growth seems more or less linear (not exponential) with the number of heuristics. It sensibly improves with learning over different datasets . • Trivial equation 2 gives a probabilistic lower bound of effective search optimality: • Extremely precise if the estimation is over 80%. • Imprecise (but always pessimistic) for low predictions. • Optimistic predictions are very rare and depend on the CONSTANT PROBABILITY assumption. • Predictions are much more accurate than e-admissible search predictions. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Likely-Admissible Search –Optimality prediction: 8-puzzle Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Likely-Admissible Search –Optimality prediction: 15-puzzle Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

heuristic evaluation error check 7 3 5 BP 2 4 6 01001000010 1 8 request an estimation extraction of an example Sub-symbolic heuristics We used standard MLP networks. h(n) Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Sub-symbolic heuristics –Are sub-symbolic heuristics “online”? • We believe so. Even that there is an offline learning phase. For 2 reasons: • 1. Nodes visited during search are generally UNSEEN. • Exactly like often humans do with learned heuristics: we don’t recover a heuristic value from a database, we compute it employing the inner rules that the heuristic provides. • 2. The learned heuristic should be dimension-independent: learning over small problems could be used for bigger problems (i.e. 8-puzzle  15-puzzle). This is not possible with mem-based heuristics. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Sub-symbolic heuristics –Outputs & Targets • Two options: • A) 1 linear neuron output • B) n [0/1] neuron outputs • A is much better. • Two possible targets: • A) “direct” target function  o(x)=h*(x) • B) “gap” target  o(x)=h*(x)-hM(x) (which takes advantage of Manhattan too) • Experiments: B improves against A only in bigger problems such as the 15-puzzle. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

7 3 5 2 4 6 7 3 5 1 8 7 3 5 2 7 3 5 2 4 6 – Sub-symbolic heuristics –Entrances coding (N x k + t) if square k occupied by value t  N2 • A: 000000100 001000000 000010000 • B: 001 100 100 001 010 010 100 010 • C: -2 0 0 –1 -1 +1 +1 –1 0 +1 0 0 Row/column targets in block k of value t are high if k occupied by value t 2N3/2 For each square compute hort&vert distances 2N Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Sub-symbolic heuristics –Learning Algorithm • Backpropagation with a new error function, instead of classic function Ed = od –td over example d. • We introduce a coefficient of asymmetry in order to stress admissibility: Ed = (1-w) (od –td) if (od –td) < 0 Ed = (1+w) (od –td) if (od –td) > 0 with 0 < w < 1 • The modified backprop minimizes: E(W)= ½ dDrd (od –td)2 with rd = (1+w) or rd = (1-w) • We used a dynamic decreasing w, in order to stress underestimations when learning is simple and to ease it successively. Momentum a=0,8 helped smoothness. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Sub-symbolic heuristics –Asymmetric Regression • This is a general idea for backpropagation learning. • It can suit any regression problem where overestimations harm more than underestimations (or contrary). • Heuristic machine learning is an ideal application field. Symmetric error Asymmetric error Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Sub-symbolic heuristics –Dataset Generation • Examples are previously optimally solved configurations. • Few examples are sufficient for good learning. A few hundreds to have faster search than Manhattan. • Experimental “ideal” 8-puzzle set  10000 examples, 15-puzzle  25000 (1/500x106 of the problem space!). • IMPORTANT: these examples have to be representative of cases present in search trees, not of random cases! [see 15-puzzle search tree distribution] • Hence, avg. h* should stay around d/2. Over 60% of 15-puzzle examples have d < 30,  80% have d < 45. Dataset generation is much easier than expected: and it’s fully parallelizable. • Generating two 25000 15-puzzle dataset, took 100 hours, half than learning. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

– Sub-symbolic heuristics –Modifying estimations “a posteriori” • Using trunc()  mandatory for IDA* • Adapting value to Manhattan’s parity: • Increases by 30% IDA* efficiency. • Does not improve admissibility, due to the “admissible overestimations” theorem. • Shifting to Manhattan* in search endings. • Maintaining dominance over Manhattan*. • Arbitrary estimation reduction. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

Visited nodes Execution time nodes time 21,97 21,97 22,65 22,43 22,26 1400 150 1200 125 1000 100 800 75 600 50 400 25 200 0 ms 0 2 Experimental Results 8-puzzle using A* & single heuristics 21,97 22,91 22,81 22,79 22,73 22,57 22,56 22,43 22,31 Manhattan Conflict Deduction 1 ANN & techniques “a posteriori” 1 ANN & asym-learning 1 3 4 5 1 ANN Test set: 2000 random configurations Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

Experimental Results 8-puzzle using A* and multiple heuristics Test set: 2000 random configurations Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

Experimental Results 15-puzzle using IDA* and multiple heuristics Test set: 700 random configurations (avg d=52,7, nodes with Manh = 3.7 x 108) Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

38 Experimental Results Some comparisons Try the “demo” at: http://www.dii.unisi.it/ ~ernandes/samloyd/ • Compared to e-admissible search: • WIDA* with w=1,25 and h=“conflict deduction”: predicted d=66, factual d=54,49, nodes visited = 42374 • IDA* with 1 ANN: factual d=54,45, nodes = 24711 • Compared to Manhattan: • IDA* with 1 ANN (optimality  30%): 1/1000 execution time, 1/15000 nodes visited • IDA* with 2 ANN (opt.  50%): 1/500 time, 1/13000 nodes. • IDA* with 4 ANN-1 (opt.  90%): 1/70 time, 1/2800 nodes. • Compared to DPDBs: • IDA* with 1 ANN: between -17% and +13% nodes visited, between 1,4 and 3,5 times slower Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

39 Conclusions • We defined a new framework of relaxed-admissible search: likely-admissible search • This statistical framework is more appealing than e-admissibility: • it relaxes the quantity of the solutions, not the quality • it works with any non-admissible heuristic • it can exploit statistical learning techniques • Likely-admissible + sub-symbolic heuristics • performance on 15-puzzle can challenge DPDB heuristics • represent a way to: speed-up solving, avoid memory abuse and still retrieve optimal solutions. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

40 Further Work • 1: Generalization of the input coding. Two goals: • A) reduce the dimension of input representation. • B) allow learning over different problem-dimensions • An idea: using graphs and Recurrent ANN to generate heuristics. • 2: Auto-feed Learning: • The system should be able to generate its own dataset automatically during learning, increasing complexity gradually. • 3. Network specialization: • Train and apply heuristics only over a certain domain of complexity (i.e. guided by Manhattan Distance), during search. Marco Ernandes & Marco Gori Department of Information Engineering University of Siena (Italy)

26-08-2004 Valencia Likely-Admissible & Sub-symbolic Heuristics THANK YOU FOR YOUR ATTENTION

Likely-Admissible & Sub-symbolic Heuristics