Lecture 3

Lecture 3 Pessimistic GP-Schema Theories. This lecture will cover the material in Ch. 4 of the Langdon-Poli text. It covers definitions and theorems relating to different types of schemata for Genetic Programming. All the previous extensions of schema theory from Genetic Algorithms to Genetic Programming have had severe problems: the size of the tree the program corresponds to can vary from generation to generation, making any analogy with the simpler GA schema theory problematic at best.

Lecture 3 Rosca’s Tree Schemata. This refers to a definition of schema that was part of the research leading up to Rosca’s PhD thesis. Definition. A schema is a contiguous tree fragment rooted at the root node of the tree. Semantically, a rooted tree schema is a set of programs. The picture on the side corresponds to the schema (+ # x), where the # is matched by the subtree (- 2 x). The particular instance of the schema is the program (+ (- 2 x) x). Note that in a tree fragment (as in O’Reilly’s) the “don’t care” symbols can only appear in “leaf position”.

Lecture 3 A reason for introducing rootedness as a characteristic of a schema is that it allows us to introduce a notion of position, which was missing from the previous definitions of schemata for GP. A consequence is that a given schema can be instantiated at most once in any given program. We introduce some notation. • Let H denote a schema, Pt the population (multiset, bag) of programs at generation t. • For a program h, let N(h) denote the size of the program = number of nodes of the program tree. • For a schema H let O(H) denote the order of the schema, which is the number of defining symbols it contains (= the non-wild-cards). • Let m(H, t) be the number of programs in Pt that instantiate schema H.

Lecture 3 • Let f(H, t) denote the fitness of schema H in Pt. As usual, this is defined as • Let be the average fitness of population Pt, • Let pm denote the probability of mutation, and let pxc denote the probability of cross-over. In the absence of both mutation and cross-over, and in the presence of fitness proportional reproduction, the expected number of programs matching schema H in population Pt+1 (i.e., the expected number of instances of H) will be given by:

Lecture 3 We now introduce mutation and cross-over. We assume the events of mutation and cross-over to be mutually exclusive, which means that the probability neither will occur is just 1 - pm - pxc, and that at least one will occur is pm + pxc. We now need to compute the probability of disruption of a schema in both cases. In either case, a schema is disrupted by altering a defined node in the schema. In mutation, the subtree attached at that node will be replaced by an arbitrary subtree; in cross-over, the subtree will be replaced by a subtree randomly chosen from the program represented by the second parent. Although some replacements may result in the schema not being altered at all, a “pessimistic assessment” will assume disruption each time this replacement occurs

Lecture 3 A worst case assumption will then conclude that the probability that a given tree h, containing a schema H will lead to the disruption of the schema, is given by O(H)/N(h) = the number of defined symbols in the schema, divided by the total number of symbols in the tree: Pd(H, h, t) = O(H)/N(h). For each tree in Pt, which is an instance of H (i.e., hÎPtÇH), an upper bound on the probability of creating a “disrupted schema” is given by where the fraction is just the relative frequency, among all programs that are instances of schema H, with which h will be chosen for reproduction (fitness proportionate reproduction).

Lecture 3 Summing over all programs, we have the probability that schema H will be disrupted under either mutation or cross-over. The total probability that schema H will be disrupted is bounded above by (pm + pxo) Pd(H, t), and the total probability that it will not be disrupted from generation t to generation t+1 is bounded below by 1 - (pm + pxo) Pd(H, t). The final formula, with everything thrown in:

Lecture 3 Rosca provides no definition of the notion of “length of a schema”, and the trees in the successive populations can still grow without bound. Schemata divide the space of all programs into subspaces containing programs of different “shapes” - as trees - and sizes. We now try to move to schemata that attempt to embody more of the properties of GA-schemata: the Fixed-Size-and-Shape Schemata. Hopefully, if we succeed, we will be able to obtain results closer to those of the GA schema theorems, and, especially, results that are more “usable” for predictive purposes.

Lecture 3 An Example:

Lecture 3 We take our cue from binary-coded GAs:

Lecture 3 Definition. A GP-schema is a rooted tree composed of nodes from the set , where F and T are function set and terminal set used in the current GP universe (a “run”), and the operator = is a polymorphic function with as many arities as the number of different arities of the elements of , the terminals in T being 0-arity functions. Ex.: with F = {+, -}, T ={x, y}, the schema H = (+ (- = y) =) has the following four programs as instances: (+ (- x y) x) (+ (- x y) y) (+ (- y y) x) (+ (- y y) y).

Lecture 3 Definition (Order). The number of non-= symbols is called the order O(H) of a schema H. Ex.: for H = (+ (- = y) =) , O(H) = 3. Definition (Length). The total number of nodes in the schema is called the length N(H) of a schema H. Ex.: for H = (+ (- = y) =) , N(H) = 5. Definition (Defining Length). The number of links in the minimum tree fragment including all the non-= symbols within a schema H is called the defining length L(H) of the schema. Ex.: the program (+ (- 2 y) x) gives rise to 32 schemata. Some are given below:

Lecture 3 Note: order, length and defining length of a schema are independent of the shape and size of programs in the population that instantiate the schema. Note: one of the claims made in standard GAs is that a schema can be somehow though of a “sampling in parallel”. Although the arguments for the exploitation of “implicit parallelism” are weak at best (if not outright nonsense), we can observe that, in an “alphabet” consisting of two binary functions and two terminals, a schema H of order 5 (O(H) = 5 - five defined positions in the tree) and length N(H) = 30 (i.e., 30 nodes in the tree - 25 “don’t care”s) corresponds to 2 N(H) - O(H) = 225 different programs: interior nodes can be filled by functions (2 choices each time), while leaf nodes can be filled by terminals (2 choices each time). Furthermore, a program of length N (N(h) = N) corresponds to 2N schemata (there are 2N subsets of a set of N nodes).

Lecture 3 Note: this definition of GP-schemata is somewhat lower level than any of those given in the previous lecture. The sense in which this is true is that schemata according to the definitions by Koza, Altenberg, O’Reilly and Whigham can be represented by sets of schemata as defined here, while some of these schemata cannot be represented via the other definitions. Ex.: O’Reilly’s schema H = [(+ # #)] in a program universe where all functions have arity 2 and all program trees have maximum depth 2, can be represented by our set of schemata {(+ = =), (= = (+ = =)), (= (+ = =) =), (= (+ = =) (= = =)), (= (= = =) (+ = =)), (+ = (= = =)), (+ (= = =) =), (+ (= = =) (= = =))}. With a different “alphabet” and more complex trees, the set just grows.

Lecture 3 Recall that in O’Reilly’s definition, schemata are sets of subtrees and tree fragments, where tree fragments are trees with at least one leaf replaced by the “don’t care” symbol, which can be matched by any subtree - including those with just one node. The expression (= x y), which is a perfectly good schema in the current context, cannot be represented at all via O’Reilly’s definition. Rosca’s schemata can also be represented by sets of current schemata, while schemata whose defining nodes do not form a compact subtree (i.e., the minimum tree fragment linking all the defining nodes must contain “don’t care” nodes) cannot be represented via his definition. We now look at some different classes of schemata.

Lecture 3 Definition (Hyperspaces and hyperplanes). A schema G is a hyperspace if it does not contain any defining nodes (e.g., O(H) = 0). A schema H is a hyperplane if it contains at least one defining node. The schema G(H) obtained by replacing all the defining nodes in hyperplane H with “don’t care” symbols is called the hyperspace associated with H. A graphical interpretation of this definition is given in the next figure.

Lecture 3 Picture:

Lecture 3 Note: unless a size or depth limit is imposed on program trees, the number of hyperspaces is infinite. One could say that the “program universe” is infinite and can be decomposed as a disjoint union of hyperspaces. Under the assumption that both the function and terminal alphabets (sets F and T) are finite (which is the case in practice, although both alphabets are potentially infinite), each hyperspace is finite. If we consider each position in a schema as a “coordinate”, we can think of a schema of size n as an n-dimensional “space”, each dimension parametrized by the appropriate (function or terminal) alphabet. Fixing the value in one dimension is equivalent to fixing a hyperplane in a geometric interpretation. The next figure tries to show this.

Lecture 3 Picture

Lecture 3 Point Mutation and One-Point Cross-over. In the case of GAs, point mutation replaced one position in the chromosome with another value from the acceptable alphabet. In this case, the analogue is the replacement of a function by a function of the same arity, or the replacement of a terminal by another terminal. Cross-over (in GAs) requires taking two parents (chromosomes) and deciding at what point the chromosomes are “broken”, with the tail of the first parent’s chromosome replaced by the tail of the second parent’s (and, possibly, vice-versa, two obtain two offspring). In GPs, the idea up to this point has been to arbitrarily pick a node (internal or otherwise) of the first parent; pick an arbitrary node of the second, and attach the subtree rooted at that node in place of the subtree at the node chosen for the first parent.

Lecture 3 This choice leads, among other consequences, to the arbitrary growth of the offspring in size and depth: there is no way to bound the offspring other than concluding that the depth of the offspring is no worse than the sum of the depths of the parents. Combined with the possibility of multiple instantiations of a schema by the same tree, we lose control of any simple (= easily computable) intuition about consequences (as we saw in several of the schema theorems we derived). We are going to choose a method that leaves us with a tree with depth no larger than that of the deeper of the parents. The procedure is presented in the next slide, although we introduce some descriptions of procedures first.

Lecture 3 • Alignment. Copies of the two parent trees are recursively (jointly) traversed starting from the root nodes to identify the parts with the same shape (= the same arity in the nodes visited). Recursion is stopped along each path as soon as an arity mismatch is found (or the end is reached). All links so found are stored. • Cross-over point selection. A random crossover point is selected with uniform probability among the stored links. • Swap. The two subtrees below the common crossover point are swapped in the same way as standard crossover. The figure on the next slide gives an example and shows the relationship to “binary GA” mechanisms.

Lecture 3 Ex.:

Lecture 3 Note. Features of one-point-cross-over. • Assume we start with an arbitrary finite population. There will be at least one tree matching its “minimum-depth-tree”, and at least one matching its “maximum-depth-tree”. The mechanism just described will under no circumstances generate trees shallower or deeper than these two - thus the tree depth is bounded in both directions by that of the initial population. This means, at a minimum, that the wrong choice for initial population may keep us from reaching an acceptable solution (at least if we use only point mutation and one-point-cross-over): just the best approximation in a bounded portion of the search space.

Lecture 3 2. The population should tend to converge. This follows from the fact that, with a random initial population, only the very tops of the program trees will result - on average - in matches, which means that, through several initial generations one-point-cross-overs will occur almost exclusively “at the top”. Only after enough of the population has converged to “matching tops” is it possible for one-point-cross-overs to begin working in the lower levels. Once the tops are “fixed”, in the absence of point mutation, they will not change, since swapped subtrees will be attached to the same points of the same patterns. Eventually, the match propagates down the trees, leading to a convergence of the population “from the top down”.

Lecture 3 3. The offspring of one-point-cross-over inherits the common upper structure of its parents. Crossing a program with itself will only result in the same program being reproduced. This property is called “closure under cross-over”: one-point-cross-over applied to two programs sampling the same schema results in programs also sampling the same schema. This property has a search-space reduction effect, directing search towards regions of the search space that show consistent high fitness. 4. We can do the calculations to model GP schema disruption… And we now turn to a theorem that describes the disruption survival of GP schemata.

Lecture 3 Fitness Proportionate Selection. The first thing we compute is the effect of fitness proportionate selection. If {hÎH} denotes the event “a program h sampling schema H is selected”; if m(H, t) denotes the number of programs in a population Pt of size M which instantiate schema H; if f(H, t) denotes the fitness of schema H in the population Pt, ShÎHf(h)/m(H, t); if denotes the average fitness of a program in Pt; if f(h) denotes the fitness of a program h; then the probability that a program h sampling schema H is selected is given by the formula

Lecture 3 This indicates that, if H has better than average fitness, it will be selected with higher probability (exponentially so) from one generation to the next. Unfortunately, the finiteness of the population in each generation (and the fact that it changes from one generation to the next) means that the mean fitness changes as well. The result is thus more wishful thinking than sound theory… The next problem is that we have to quantify the effects of both cross-over and mutation. How can we quantify the probability of disruption?

Lecture 3 One-Point-Crossover. There are two ways in which a schema H can be disrupted (= the offspring of a schema instance is no longer an instance of the same schema) by one-point cross-over. Dc1(H) and Dc2(H) will denote the events “H is disrupted when a program h matching H is crossed over with a program “, in cases c1 and c2 respectively. Casec1. A program h is crossed over with a program of a different shape. The notion of “shape” is simply that of the hyperspace G(H) of all programs that match schema H. The juncture just described can also be expressed as the joint event Dc1(H) = {Dc(H), ÏG(H)}. Two examples are shown on the next slide.

Lecture 3 Ex.:---> Another example: let H = (+ = =), h = (+ x y), = (+ z (+ 2 3)). Swapping (+ 2 3) with y produces (+ x (+ 2 3)), which has a different shape from G(H) = (= = =). So the new program does not sample G(H) and, therefore, cannot match H.

Lecture 3 We now try to compute the probability of Dc1: Pr{Dc1(H)} = Pr{Dc(H) | ÏG(H)}·Pr{ ÏG(H)}. We start by computing the second term: Pr{ ÏG(H)} =1 - Pr{ ÎG(H)}. Since the parents are selected independently, the probabilities associated with the second parent, , are the same as those for the first, h. Since we are using fitness proportional reproduction (with replacement), we have The first term is harder to compute and will be denoted by pdiff(t). See the next slide for an example of why the complications… pdiff(t) will be interpreted as the fragility of a schema w.r.t. the shapes of the other schemata in the population.

Lecture 3 Ex.:

Lecture 3 The second way leading to disruption of a schema H involves crossing a program hÎH over with a program with the same structure as h (i.e., ÎG(H)), but which does not sample h (i.e., ÏH). We first note that (as observed before), schemata are closed under one-point cross-over: if two programs sampling the same schema undergo one-point cross-over, the resulting program (in the absence of mutation) samples the same schema. So we have to account for programs with the same shape ( ÎG(H)), but sampling a different schema ( ÏH): Dc2(H) = {Dc(H), ÏH, ÎG(H)} describes the appropriate joint event. We can express its probability as Pr{Dc2(H)} = Pr{Dc(H) | ÏH, ÎG(H)}·Pr{ ÏH, ÎG(H)}. We can see this in the figure on the next slide.

Lecture 3 The second term, Pr{ ÏH, ÎG(H)}, is just the probability that the second parent does not match H, but has the same shape G(H) as the first parent. Since HÍG(H), we also have that { ÎH} Í { ÎG(H)}. We can re-write these ideas as Pr{ ÏH, ÎG(H)} = Pr{ ÎG(H)} - Pr{ ÎH}. Pictorially, we can see the next slide:

Lecture 3 Picture for the earlier discussion.

Lecture 3 We can use the formulae for fitness proportionate selection to compute: Consider now the first part of the formula giving Pr{Dc2(H)}: Pr{Dc(H) | ÏH, ÎG(H)}.

Lecture 3 Let B(H) be the event that the cross-over point is between the defining nodes (the non-= ones) of H (= it occurs on one of the links in the minimal subtree thatcontains all the non-= nodes). The probability of this event is given by the quotient of the number of links in this minimal tree (the defining length of H) divided by the total number of links in H: Pr{B(H)} = L(H)/(N(H) - 1). This probability can be interpreted as the intrinsic fragility of the node composition of the schema. Since the occurrence of B(H) is necessary but not sufficient for disruption (not all such cross-overs will disrupt the schema), we have Pr{Dc(H) | ÏH, ÎG(H)} ≤ Pr{B(H)}.

Lecture 3 The table gives an example of such a disruption:

Lecture 3 Combining this with fitness proportionate selection we have: Since the events Dc1(H) and Dc2(H) are mutually exclusive, we have an upper bound to Pr{Dc(H)}:

Lecture 3 Considering that the probability of cross-over is 0 ≤ pxo ≤ 1, the probability of schema disruption due to cross-over is just pxo • Pr{Dc(H)}. Point Mutation. Let pm be the probability of a point mutation. The mutation will disrupt the schema only if it occurs at one of the “fixed” nodes, of which there are O(H): Pr{Dm(H)} = 1 - (1 - pm)O(H).

Lecture 3 We can now glue all of these affects together: E[m(H, t+1)] ≥ M•Pr{hÎH}•(1 - Pr{Dm(H)})•(1 - pxo•Pr{Dc(H)}). Replacing the expressions by more detailed ones: Theorem. In a generationl GP with fitness proportionate selection, one-point cross-over and point mutation:

Lecture 3 In the case of fitness proportionate selection with replacement, we have In the case of an arbitrary “selection with replacement strategy” where Pr{h ÎH} = p(H,t), the theorem can be restated via the formula:

Lecture 3 The formulae above are somewhat more complicated than those for GAs - which is reasonable, since the latter have no structural changes form one generation to the next: the chromosomes are of fixed length and fixed interpretation, with all genes in fixed positions. The essentially unbounded and arbitrary tree representation that programs require introduces many more complications. At this point, several difficulties have been swept under the rug, but we have enough to provide some guesses as to what might be going on (note all the qualifiers…) as the generations succeed one another.

Lecture 3 Early stages of evolution. The initial population, being a very small random portion of the universe of programs is likely to contain programs of different shapes: this simply means that the number of programs that sample the same schema is proportionately very small. It is thus extremely unlikely that two trees h and h’, of different shape, will exchange, by cross-over, a subtree of exactly the same shape. This can be interpreted as saying that pdiff is, approximately, 1. Furthermore, since few shapes are likely to be represented by more than one schema, so that the probability of schema disruption due to cross-over is very large (even before the effects of Dc2(H) events - disruption between programs of the same shape).

Lecture 3 Notice that our discussion left any ideas of schema construction out: at the beginning of the run we would expect new schemata to be constructed, as well as old schemata receiving more instances (if successful). Since schema construction was left out, one might expect that our estimate for m(H, t+1) will be quite conservative. Later Stages of Evolution. With a small mutation rate and just one-point cross-over, one would expect that, eventually, the population will begin to converge. This implies that the diversity of shapes (and sizes) will decrease. In turn this will make the common part between programs undergoing cross-over larger, and it will include more and more terminals. Thus the probability of swapping subtrees with the same structure will increase, and the fragility of the schema (pdiff) will decrease.

Lecture 3 This means also that the number of different program structures appearing in the population will decrease and that will become larger. As we examine the formula, this would indicate that the contribution of the term corresponding to cross-over disruption between trees of the same shape and size (Pr{Dc2(H)}), will become larger. Eventually, the whole population will correspond to just one shape. From then on, the evolution will be, essentially, the same as that for GAs.

Lecture 3

Lecture 3

Presentation Transcript

Lecture 3

Lecture 3

LECTURE №3

Lecture #3

Lecture 3-3

Lecture 3

Lecture 3:

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture 3

Lecture # 3

Lecture 3