High-Performance Computing for Discrete Optimization: Techniques and Algorithms

Search Algorithms for Discrete Optimization Problems • Discrete Optimization - Basics • Sequential Search Algorithms • Parallel Depth-First Search • Parallel Best-First Search • Speedup Anomalies in Parallel Search Algorithms ICS 573: High-Performance Computing

Discrete Optimization Problems • Discrete optimization problems (DOPs) are a class of computationally expensive problems • of significant theoretical and practical interest. • Characterized by a systematic search for solution from among many potential solutions, subject to constraints. • For many such problems, exhaustive search is infeasible • Some form of directed search is performed, instead • Examples from classical computer science problems • The traveling salesman problem • The 0/1 knapsack problem • The n-queens problem • The 15- and 8-puzzles • Examples in commerce, banking and industry • Financial forecasting • Airline fleet and crew assignment • VLSI chip layout ICS 573: High-Performance Computing

Solution Techniques for DOPs • Several search and optimization techniques can be used on DOPs: • Branch-and-bound (B&B) search: • may be described in terms of finding the minimal value of a function f(x) over a set of admissible values of the argument x called feasible region • A B&B procedure requires branching and bounding: • Branching: a smart way of covering the feasible region by several smaller feasible subregions (ideally, splitting into subregions). • Bounding: a fast way of finding upper and lower bounds for the optimal solution within a feasible region • Recursive branching naturally results in a tree of subregions called a state space tree. • The state space tree may need to be pruned for increased efficiency. • Dynamic programming: • Uses a top-down or bottom-up approach to reduce the runtime of algorithms having the following properties: • Overlapping subproblems: the problem can be broken down into subproblems which are reused several times (e.g., Fibonacci). • Optimal substructure: its optimal solution can be constructed efficiently from optimal solutions to its subproblems (e.g., shortest paths). • Memoization may be used in overlapping subproblems for increased efficiency. ICS 573: High-Performance Computing

Solution Techniques for DOPs • Hill climbing • A graph search algorithm where the current path is extended with a successor node which is closer to the solution (without retaining its siblings or parent) • Used widely in AI for reaching a goal state from a starting node. • Limitation: may find only local maxima, failing to reach goal, global maxima. • Simulated annealing (SA) • A generic probabilistic algorithm for locating a good approximation to the global optimum of a given function in a large search space • Can be used to overcome the local maxima problem in hill climbing • Genetic algorithms (GAs) • A search technique used to find approximate solutions to DOPs • GAs use techniques inpired by evolutionary biology such as inheritance, mutation, natural selection, and recombination (or crossover) • Candidate solutions (called individuals) to an optimization problem evolve toward better solutions using fitness analyses. ICS 573: High-Performance Computing

Definitions • A discrete optimization problem (DOP) is defined as a tuple of (S, f) • S : The set of feasible states • f : A cost function f : S -> R • The objective of a DOP is to find a feasible solution xopt, such that f(xopt) ≤ f(x) for all x  S. • Examples • 0/1 Linear integer programming problem: Given an mxn matrix A, vectors b and c, find vector x such that • x contains only 0s and 1s • Ax > b • f(x) = xTc is minimized. • 8-puzzle problem: • Given an initial configuration of an 8-puzzle find the shortest sequence of moves that will lead to the final configuration. ICS 573: High-Performance Computing

Discrete Optimization: Example ICS 573: High-Performance Computing

Formulating DOPs as Graph Search • The feasible space S is typically very large. • Many DOP can be formulated as finding the minimum cost path in a graph. • Nodes in the graph correspond to states. • States are classified as either • terminal or non-terminal • Some of the states correspond to feasible solutions whereas others do not. • Edges correspond to “costs” associated with moving from one state to the other. • These graphs are called state-space graphs. ICS 573: High-Performance Computing

Examples of State-Space Graphs ICS 573: High-Performance Computing

Examples of State-Space Graphs • 0/1 integer linear programming problem • States correspond to partial assignment of values to components of the x vector. ICS 573: High-Performance Computing

Admissible Heuristics • Often, it is possible to estimate the cost to reach the goal state from an intermediate state. • This estimate, called a heuristic estimate, can be effective in guiding search to the solution. • If the estimate is guaranteed to be an underestimate, the heuristic is called an admissible heuristic. • Admissible heuristics have desirable properties in terms of optimality of solution (as we shall see later). • An admissible heuristic for 8-puzzle is as follows: • Assume that each position in the 8-puzzle grid is represented as a pair. • The distance between positions (i,j) and (k,l) is defined as |i - k| + |j - l|. This distance is called the Manhattan distance. • The sum of the Manhattan distances between the initial and final positions of all tiles is an admissible heuristic. ICS 573: High-Performance Computing

Parallel Discrete Optimization: Motivation • DOPs are generally NP-hard problems. Does parallelism really help much? • Cannot reduce their worst-case running time to a polynomial one • For many problems, the average-case runtime is polynomial. • Often, we can find suboptimal solutions in polynomial time. • Bigger problem instances can be solved • Many DOPs have smaller state spaces but require real-time solutions. • Parallel processing may be the only way to obtain acceptable performance! ICS 573: High-Performance Computing

Exploring the State-Space Graphs • The solution is discovered by exploring the state-space search. • Exponentially large • Heuristic estimates of the solution cost are used. • Cost of reaching to a feasible solution from current state x is l(x) = g(x) + h(x) • Admissible heuristics are the heuristics that correspond to lower bounds on the actual cost. • Manhattan distance is an admissible heuristic for the 8-puzzle problem. • Idea is to explore the state-space graph using heuristic cost estimates to guide the search. • Do not spend any time exploring “bad” or “unpromising” states. ICS 573: High-Performance Computing

Search Space Structure: Trees Vs Graphs • Is the search space a tree or a graph? • Trees • Each new successor leads to an unexplored part of the search space. • Example: the space of a 0/1 integer linear programming • Graphs • A state can be reached along multiple paths. • Whenever a state is generated, it is necessary to check if the state has already been generated. • Example: the space of 8-puzzle. • Next, we consider sequential solution strategies for DOPs formulated as tree or graph search problems. ICS 573: High-Performance Computing

Sequential Exploration Strategies • Depth-First • Simple Backtracking • Performs DFS until it finds the first feasible solution and terminates • Uses no heuristic to order the successors of the expanded node • Not guaranteed to find optimal solution • Ordered backtracking uses heuristics to order successors • Depth-First Branch-and-Bound (DFBB) • Partial solutions that are inferior to the current best solutions are discarded. • Does not explore paths leading to inferior solutions • Iterative Deepening • Tree is expanded up to certain depth. • If no feasible solution is found, the depth is increased and the entire process is repeated. • Useful when solution exists close to root on alternative branch • Iterative Deepening A* (IDA*) • Imposes cost-bound rather than a depth-bound • Finds an optimal solution when an admissible heuristic is used ICS 573: High-Performance Computing

DFS Storage Requirements and Data Structures • Suitable primarily for state-graphs that are trees • At each step of DFS, untried alternatives must be stored for backtracking. • If m is the amount of storage required to store a state, and d is the maximum depth, then the total space requirement of the DFS algorithm is O(md). • The state-space tree searched by parallel DFS can be efficiently represented as a stack. • Memory requirement of the stack is linear in depth of tree. ICS 573: High-Performance Computing

Depth-First Search On 8-Puzzle ICS 573: High-Performance Computing

Representing a DFS Tree ICS 573: High-Performance Computing

Sequential Exploration Strategies • Best-First Search (BFS) • BFS algorithms use a heuristic to guide search. • If the heuristic is admissible, the BFS finds the optimal solution. • OPEN/CLOSED lists • A* algorithm • Heuristic estimate is used to order the nodes in the open list. • Suitable for state-space graphs that are either trees or graphs. • Large memory complexity. • Proportional to the number of states visited. ICS 573: High-Performance Computing

Data Structures for BFS Algorithms • The core data structure is a list, called open list, that stores unexplored nodes sorted on their heuristic estimates. • The best node is selected from the list, expanded, and its off-spring are inserted at the right position. • BFS of graphs must be slightly modified to account for multiple paths to the same node. • A closed list stores all the nodes that have been previously seen. • If a newly expanded node exists in the open or closed lists with better heuristic value, the node is not inserted into the open list. ICS 573: High-Performance Computing

The A* Algorithm • A BFS technique that uses admissible heuristics. • Defines function l(x) for each node x as g(x) + h(x). • Here, g(x) is the cost of getting to node x and h(x)is an admissible heuristic estimate of getting from node x to the solution. • The open list is sorted on l(x). • The space requirement of BFS is exponential in depth! ICS 573: High-Performance Computing

Best-First Search On 8-Puzzle ICS 573: High-Performance Computing

Search Overhead Factor • The amount of work done by serial and parallel formulations of search algorithms is often different. • Why? • Let W be serial work and WP be parallel work. • Search overhead factors is defined as WP/W. • Upper bound on speedup is p×(W/WP). • Assumptions for subsequent analyses: • the time to expand each node, tc, is the same • W and Wp are the number of nodes expanded by the serial and the parallel formulations, respectively. • From the above, the sequential run time is given by TS = tcW. • We assume that tc = 1. ICS 573: High-Performance Computing

Parallel Depth-First Search Challenges • Computation is dynamic and unstructured • Why dynamic? • Why unstructured? • Decomposition approaches? • Do we do the same work as the sequential algorithm? • Mapping approaches? • How do we ensure load balance? • Static partitioning of unstructured trees yields poor performance. • Dynamic load balancing is required. ICS 573: High-Performance Computing

Tree Search Can Be Unstructured ICS 573: High-Performance Computing

Parallel DFS: Dynamic Load Balancing • When a processor runs out of work, it gets more work from another processor. • Trades communication cost for load balancing • This is done using • work requests and responses in message passing machines, and • locking and extracting work in shared address space machines. • On reaching final state at a processor, all processors terminate. • Unexplored states can be conveniently stored as local stacks at processors. • The entire space is assigned to one processor to begin with. ICS 573: High-Performance Computing

Parallel Depth-First Search: Dynamic Load Balancing ICS 573: High-Performance Computing

Performance of Parallel DFS • Two important issues determine the performance of parallel DFS: • Work splitting strategies • Which states from my stack should I give away? • Load balancing strategies • Which processor should I ask for work? • Work splitting strategies • Work is split by splitting the stack into two. • Ideally, we do not want either of the split pieces to be small. • Select nodes near the bottom of the stack (node splitting), or • Select some nodes from each level (stack splitting). • The second strategy generally yields a more even split of the space. ICS 573: High-Performance Computing

Parameters in Parallel DFS: Work Splitting ICS 573: High-Performance Computing

Load-Balancing Schemes • Who do you request work from? • Note: we would like to distribute work requests evenly, in a global sense. • Asynchronous round robin: • Each processor maintains a counter and makes requests in a round-robin fashion. • Work requests are generated independently by each processor. • Global round robin: • The system maintains a global counter and requests are made in a round-robin fashion, globally. • Ensures that successive work requests are distributed evenly over all processors. • Random polling: • Request a randomly selected processor for work. ICS 573: High-Performance Computing

Termination Detection • How do you know when everyone's done? • A number of algorithms have been proposed. ICS 573: High-Performance Computing

Dijkstra's Token Termination Detection • Assume that all processors are organized in a logical ring. • Assume, for now that work transfers can only happen from Pi to Pj if j > i. • Processor P0 initiates a token on the ring when it goes idle. • Each intermediate processor receives this token and forwards it when it becomes idle. • When the token reaches processor P0, all processors are done. ICS 573: High-Performance Computing

Dijkstra's Token Termination Detection ICS 573: High-Performance Computing

Dijkstra's Token Termination Detection • When processor P0 goes idle, it colors itself green and initiates a green token. • If processor Pj sends work to processor Pi and j > i then processor Pjbecomes red. • If processor Pi has the token and Pi is idle, it passes the token to Pi+1. • If Pi is red, then the color of the token is set to red before it is sent to Pi+1. • If Pi is green, the token is passed unchanged. • After Pi passes the token to Pi+1, Pi becomes green . • The algorithm terminates when processor P0 receives a green token and is itself idle. ICS 573: High-Performance Computing

Dijkstra's Token Termination Detection ICS 573: High-Performance Computing

Tree-Based Termination Detection • Associate weights with individual workpieces. Initially, processor P0 has all the work and a weight of one. • Whenever work is partitioned, the weight is split into half and sent with the work. • When a processor gets done with its work, it sends its parent the weight back. • Termination is signaled when the weight at processor P0 becomes 1 again. • Note that underflow and finite precision are important factors associated with this scheme. ICS 573: High-Performance Computing

Tree-Based Termination Detection ICS 573: High-Performance Computing

Parallel Formulations of Depth-First Branch-and-Bound • Parallel formulations of depth-first branch-and-bound search (DFBB) are similar to those of DFS. • Each processor has a copy of the current best solution. This is used as a local bound. • If a processor detects another solution, it compares the cost with current best solution. • If the cost is better, it broadcasts this cost to all processors. • If a processor's current best solution path is worse than the globally best solution path, only the efficiency of the search is affected, not its correctness. ICS 573: High-Performance Computing

Parallel Formulations of IDA* • Common Cost Bound: • Each processor is given the same cost bound. • Processors use parallel DFS on the tree within the cost bound. • The drawback of this scheme is that there might not be enough concurrency. • Variable Cost Bound: • Each processor works on a different cost bound. • The major drawback here is that a solution is not guaranteed to be optimal until all lower cost bounds have been exhausted. • In each case, parallel DFS is the search kernel. ICS 573: High-Performance Computing

Parallel Best-First Search • The core data structure is the open list (typically implemented as a priority queue). • Each processor locks this queue, extracts the best node, unlocks it. • Successors of the node are generated, their heuristic functions estimated, and the nodes inserted into the open list as necessary after appropriate locking. • Since we expand more than one node at a time, we may expand nodes that would not be expanded by a sequential algorithm. ICS 573: High-Performance Computing

Parallel BFS: Centralized Lists ICS 573: High-Performance Computing

Problems with Centralized Lists • The termination criterion of sequential BFS fails for parallel BFS. • p nodes from the open list are being expanded • The open list is a point of contention. • Can severely limit speedup even on shared-address-space architectures • Distributed open lists may address can help ICS 573: High-Performance Computing

Parallel Best-First Search • Avoid contention by having multiple open lists. • Initially, the search space is statically divided across these open lists. • Processors concurrently operate on these open lists. • Since the heuristic values of nodes in these lists may diverge significantly, • we must periodically balance the quality of nodes in each list. • A number of balancing strategies based on ring, blackboard, or random communications are possible. ICS 573: High-Performance Computing

Parallel Best-First Search ICS 573: High-Performance Computing

Parallel Best-First Graph Search • Graph search involves a closed list, where the major operation is a lookup (on a key corresponding to the state). • The classic data structure is a hash. • Hashing can be parallelized by using two functions: • the first one hashes each node to a processor, and • the second one hashes within the processor. • This strategy can be combined with the idea of multiple open lists. • If a node does not exist in a closed list, it is inserted into the open list at the target of the first hash function. • In addition to facilitating lookup, randomization also equalizes quality of nodes in various open lists. ICS 573: High-Performance Computing

Speedup Anomalies in Parallel Search • Since the search space explored by processors is determined dynamically at runtime, the actual work might vary significantly. • Speedup anomalies using p processors: • acceleration anomalies: executions yielding speedups greater than p • deceleration anomalies: executions yielding speedups of less than p • Speedup anomalies also manifest themselves in best-first search algorithms. • If the heuristic function is good, the work done in parallel best-first search is typically more than that in its serial counterpart. ICS 573: High-Performance Computing

Speedup Anomalies in Parallel Search ICS 573: High-Performance Computing

High-Performance Computing for Discrete Optimization: Techniques and Algorithms