**Genetic Algorithms(Evolutionary Computing)** • Genetic Algorithms are used to try to “evolve” the solution to a problem • Generate prototype solutions called chromosomes (individuals) • Backpack problem as example: • http://home.ksp.or.jp/csd/english/ga/gatrial/Ch9_A2_4.html • All individuals form the population • Generate new individuals by reproduction • Use a fitness function to evaluate individuals • Survival of the fittest: population has a fixed size • Individuals with higher fitness is more likely to reproduce

**Reproduction Methods** • Mutation • Alter a single gene in the chromosome randomly to create a new chromosome • Example • Cross-over • Pick a random location within chromosome • New chromosome receives first set of genes from parent 1, second set from parent 2 • Example • Inversion • Reverse the chromsome

**Interpretation** • Genetic algorithms try to solve a hill climbing problem • Method is parallelizable • The trick is in how you represent the chromosome • Tries to avoid local maxima by keeping many chromsomes at a time

**Another Example:Traveling Sales Rep Problem** • How to represent a chromosome? • What effects does this have on crossover and mutation?

**TSP** • Chromosome: Ordering of city numbers • (1 9 2 4 6 5 7 8 3) • What can go wrong with crossover? • To fix, use order crossover technique • Take two chromosomes, and take two random locations to cut • p1 = (1 9 2 | 4 6 5 7 | 8 3) • p2 = (4 5 9 | 1 8 7 6 | 2 3) • Goal: preserve as much as possible of the orderings in the chromosomes

**Order Crossover** • p1 = (1 9 2 | 4 6 5 7 | 8 3) • p2 = (4 5 9 | 1 8 7 6 | 2 3) • New p1 will look like: • c1 = (x x x | 4 6 5 7 | x x) • To fill in c1, first produce ordered list of cities from p2, starting after cut, eliminating cities in c1 • 2 3 9 1 8 • Drop them into c1 in order • c1 = (2 3 9 4 6 5 7 1 8) • Do similarly in reverse to obtain • c2 = (3 9 2 1 8 7 6 4 5)

**Mutation & Inversion** • What can go wrong with mutation? • What is wrong with inversion?

**Mutation & Inversion** • Redefine mutation as picking two random spots in path, and swapping • p1 = (1 9 2 4 6 5 7 8 3) • c1 = (1 9 8 4 6 5 7 2 3) • Redefine inversion as picking a random middle section and reversing: • p1 = (1 9 2 | 4 6 5 7 8 | 3) • c1 = (1 9 2 | 8 7 5 6 4 | 3) • Another example: • http://home.online.no/~bergar/mazega.htm

**Reinforcement Learning** • Game playing: So far, we have told the agent the value of a given board position. • How can an agent learn which board positions are important? • Play a whole bunch of games, and receive reward at end (+ or -) • How do you determine utility of states that aren’t ending states?

**The setup: Possible game states** • Terminal states have reward • Mission: Estimate utility of all possible game states

**Passive Learning** • Agent learns by “watching” • Fixed probability of moving from one state to another

**Sample Results**

**Technique #1: Naive Updating** • Also known as Least Mean Squares (LMS) approach • Starting at home, obtain sequence of states to terminal state • Utility of terminal state = reward • loop back over all other states • utility for state i = running average of all rewards seen for state i

**Naive Updating Analysis** • Minimizes mean square error with respect to seen data • Works, but converges slowly • Must play lots of games • Ignores that utility of a state should depend on successor

**Technique #2: Adaptive Dynamic Programming** • Utility of a state depends entirely on the successor state • If a state has one successor, utility should be the same • If a state has multiple successors, utility should be expected value of successors

**Finding the utilities** • To find all utilities, just solve equations • This is done via dynamic programming • “Gold standard” – this gets you the right values instantly, no convergence or iteration • Completely intractable for large problems: • For a real game, it means finding actual utilities of all states • Assumes that you know Mij

**Technique 3: Temporal Difference Learning** • Want utility to depend on successors, but want to solve iteratively • Whenever you observe a transition from i to j: • a = learning rate • difference between successive states = temporal difference • Converges faster than Naive updating

**Passive Learning in Unknown Environment** • Unknown environment = transition probabilities unknown • Only affects technique 2, Adaptive Dynamic Programming • Iteratively: • Estimate transition probabilities based on what you’ve seen • Solve dynamic programming problem with best estimates so far

**Active Learning in an Unknown Environment** • Probability of going from one state to another now depends on action • ADP equations are now:

**Exploration: where should agent go to learn utilities?** • Suppose you’re trying to learn optimal blackjack strategies • Do you follow best utility, in order to win? • Do you move around at random, hoping to learn more (and losing lots in the process)? • Following best utility all the time can get you stuck at an imperfect solution • Following random moves can lose a lot

**Where should agent go to learn utilities?** • f(u,n) = exploration function • depends on utility of move, and number of times that agent has tried it • One possibility: • Try a move a bunch of times, then eventually settle

**Generalization in Reinforcement Learning** • Maintaining utilities for all seen states in a real game is intractable. • Instead, treat it as a supervised learning problem • Training set consists of (state, utility) pairs • Learn to predict utility from state • This is a regression problem, not a classification problem • Can use neural network with multiple outputs

**Other applications** • Applies to any situation where something is to learn from reinforcement • Possible examples: • Toy robot dogs • Petz • That darn paperclip • “The only winning move is not to play”