spring 2011 artificial intelligence cosc 40503 week 2 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Spring 2011 Artificial Intelligence COSC 40503 WEEK 2 PowerPoint Presentation
Download Presentation
Spring 2011 Artificial Intelligence COSC 40503 WEEK 2

Loading in 2 Seconds...

play fullscreen
1 / 17

Spring 2011 Artificial Intelligence COSC 40503 WEEK 2 - PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on

Spring 2011 Artificial Intelligence COSC 40503 WEEK 2. Antonio Sanchez Texas Christian University. Stímulus. Feedback. Connectionism. Response. Interacting with the environment. Boards. Possibilities. Won !. Plays. Feedback at work : playing tic tac toe. Won =>

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Spring 2011 Artificial Intelligence COSC 40503 WEEK 2' - midori


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
spring 2011 artificial intelligence cosc 40503 week 2

Spring 2011Artificial IntelligenceCOSC40503WEEK 2

Antonio Sanchez

Texas Christian University

interacting with the environment

Stímulus

Feedback

Connectionism

Response

Interacting with the environment
credit assignment at work winning tic tac toe

Won =>

Increase the probabilities

of choosing a given play

After

Before

Credit assignment at work:winning tic tac toe
credit assignment and connectionism
Credit Assignment and Connectionism
  • For all the actions the environment delivers one single composite feedback
  • The automata must distribute the prize or punishement among the network, generating a credit assignment model
  • In this way the automata generates an adequate internal pattern of behavior
  • This is we call Learning
  • There are many methodologies to model such behavior
  • Here we shall use CLS (Collective Learning Systems)
so far
So Far ….

Artificial Intelligence deals with knowledge and learning

Artificial learning is obtained by

  • Traversing knowledge bases(rule based and logical programming)
  • Artificial selection(genetic and evolutionary algorithms)
  • Adaptive methods(connectionism and feedback)

Adaptive behavior studies have their roots in Pavlov’s studies of animal conditioning

two recurring concepts
Two recurring concepts

Feedback

Connectionism

Perceptron (Rosenblatt, Selfridge)

Learning Automata (Tsetlin, Narendra, Barto)

Neural Networks (Rummelhart, McClelland)

Collective Learning (Samuel, Michie, Bock)

Cybernetic Loop (Wiener, Rosenblueth, Ashby)

RP Policies (Thathachar,Viswanathan,Fu)

Backpropagation (Skjenoswsky, Hopefield)

Algedonic Loop (Beer, Bock)

Credit Assignment

Interacting with

the Environment

cls formalization
CLS Formalization
  • CLS = [ AUTOMATA, MA ]
  • Where AUTOMATA = { I, O, STM, A }
  • I : Is a vector of possible entries or stimuli
  • O : Is a vector of possible responses or actions
  • STM : Is the transition matrix where the Probability Pij of choosing Response Oj is stored for each Stimulus Ii
  • A : Is an Alegedonic algorithm (punishment / reward) the modifies the distinct Pij according to the compensation policy of the automaton, and it is precisely this algorithm that represents learning
  • MA: Is the Environment that emits a series of stimuli I and evaluates the responses O of the AUTOMATA, that serves to determine the values applied to Pij across the algorithm A, and the matrix STM.
cls mapping

1/2 1/4 0 0 1/4

0 0 3/4 1/8 0 1/8

Best moves

possible

moves

prescription

description

CLS mapping

After the game is over

and the winner is determined

the compensation method modifies

probabilities and the STM becomes

more prescriptive (knowledge) rather

than just descriptive (information)

Possible

moves

Other player’s turn

Selection

method

Initial method

Second

turn

Looking at the options the CLS selects

Its moves and gives the board to

the other player

After the other selects a move, the

CLS takes a second move and so on until

the game is over

Compensation

method

cls pseudo code

// Selection Process (mode: random or max )

int selection (STM, situation, mode)

{ int selectionMade; float cumulative, number ;

If ( mode == "max" )

{ max = 1;

for ( j= 1; j <= outputs ; j=j+1 )

If ( STM[situation][j] > STM[situation][max])

{ max = j ; selectionMade = max }

else { number = random(seed);

cumulative = 0; selectionMade = 0 ; j= 0;

while ( selectionMade <> 0 )

{ j= j + 1;

cumulative = cumulative + STM[situation,j] ;

If ( cumulative > number ) selectionMade = j }

return selectionMade ; } ;

CLS Pseudo Code

import java.awt.*;

import java.applet.Applet;

public class cls extends Applet

float [][] STM = new float [entries][outputs] ;

int [][] LOG = new int[ent_max][2] ;

String game = new String; String mode = new String;

int turn, i,k,j, situation, play, turn, times ;

// Basic Loop

public void init()

{ for (times = 1 ; times <= times_max; times = times +1 )

{ clear (LOG);

game = "playing"; situation = 0; turn= 0;

while (game = "playing")

{ turn = turn +1;

play = selection(situation, STM );

LOG[turn][1] = situation;

LOG[turn][2] = play;

game = evalua(situation);

if game=“playing”{situation = otherPlay (play);

game = evalua(situation); } }

turnmx = turn;

compensation( STM,LOG, game );

} };

// Procedure to Modify probabilities in STM

void compensation (STM,LOG, game)

{ float reward, punish, normal;

for ( turn=1 ; turn <= turnmx ; turn = turn + 1 )

{ i= LOG[turn][1];

k= LOG[turn][2];

nplays = STM[turn][0] ; // possibles plays

// In Reward increase probability

If ( game == "Won" )

{ reward = ß*(1-STM[i][k] );

STM[i][k] = STM[i][k] + reward;

normal = reward/(nplays-1);

for ( j = 1; j <= outputs ; j = j+ 1 )

{ If (( j <> k ) & ( STM[i][j] <> 0 ))

STM[i][j] = STM[i,j] - normal }

// In Punishment reduce probability

else { punish = ß/2*STM[i][k];

STM[i][k] = STM[i][k] - punish;

normal = punish/(nplays-1);

for ( j= 1; j <= outputs ; j = j+ 1 )

{ If (( j <> k ) & ( STM[i][j] <> 0 ))

STM[i][j] = STM[i,j] + normal } } }

algedonic compensation
Algedonic compensation

In case of a Reward (with 0 < ß < 1)

For selection i -> k in the STM ( the selected play )

STM(t+1)i,k = STM(t)i,k + ß*(1– STM(t)i,k

For the others transitions i à j for j ≠ k

STM(t+1)i,j = STM(t)i,j - ß*(1– STM(t)i,k)/(n-1)

In case of a Punishment (with 0 < ß < 1)

For selection i -> k in the STM ( the selected play )

STM(t+1)i,k = STM(t)i,k - ß*STM(t)i,k

For the others transitions i à j for j ≠ k

STM(t+1)i,j = STM(t)i,j + ß*STM(t)i,k /(n-1)

cls non linear compensation

Rewards more at the beginning

STM[i][k] = STM[i][k] + ß*(1-STM[i][k])*(1-STM[i][k] )

Rewards more at the end

STM[i][k] = STM[i][k] + ß* (1-STM[i][k] )*(STM[i][k])

Punishes more at the beginning

STM[i][k] = STM[i][k] - ß/2*STM[i][k]*(1-STM[i][k] )

Punishes more at the end

STM[i][k] = STM[i][k] - ß/2* STM[i][k]*STM[i][k]

CLSNon linear Compensation
  • Think about the case of a R/P in my everyday life. How much do I listen to a R/P? It depends on:
  • Who is giving it to me
  • What is my expectation
  • The recent evaluations I have had
  • We can take into account such concerns, for example:
  • The domain of ß is 0 < ß < 1
  • A value of 0 will cause no learning, while a value of 1 will saturate the STM driving to one selection only
  • A reward/inaction is achieved by using ß = 0
  • When punishing a ß / 2 reduces the changes of wrongly updating probabilities

Boltzman entropy

As measurement of order, entropy can be define as

Using Entropy, we can check how well organized is the STM, thus the lower the value of the entropy the more uneven the probabilities are in the STM.

S = - STM[i][j] Log2 (STM[i][j] ) for all i,j

cls selection schemes
CLSSelection Schemes

Since we are storing probabilities, their values should sum 1 always per row. The selection process is done using a cumulative distribution

number = random(seed);

cumulative = 0; selectionMade = 0 ; j= 0;

while ( selectionMade <> 0 )

{ j= j + 1;

cumulative = cumulative + STM[situation,j] ;

If ( cumulative > number ) selectionMade = j }

However on the expectation of a good result, the selection can use the maximum

{ max = 1;

for ( j= 1; j <= outputs ; j=j+1 )

If ( STM[situation][j] >

STM[situation][max])

{ max = j ; selectionMade = max }

On considering this alternative the use of probabilities is not longer necessary and the a simple histogram calculation can be use to tally the selections made in each row of the STM

stm in time

As time goes by the STM becomes more prescriptive of the game, showing the right moves (knowledge)

Output options j

Inputs i

Δ time

The initial STM represents a descriptive matrix of possible actions in the game. Mostly the rules of the game (information)

STM in time

In reality is not only Δ time, but also Δ information that reduces the entropy and guides the STM towards the right answers, this the reason why information is also called negentropy

knowledge representation basic

Yet it pays to represent equal boards

with the same notation ‘--X-0----’.

Knowledge Representation (basic)

Could be represented as

‘----0----’

‘--X-0----’

  • Additional to the transitions in a game, it is also necessary to represents other aspects of the game such as:
    • The board or state of the game
    • The value of each board state
  • This becomes an important part of knowledge representation

If we use only arrays, you will see that the number of boards will make the amount Information to represent quite big.

As a matter of fact this is were data structures first began to be used in stead of arrays and the concept of link list was developed in languages such as IPL, Lisp, Snobol.

Why?

behavior in time

100 %

Δ performance

learning

Change

of rules

== >

Relearning

stable

Δ time

Behavior in time

Back to stability

learning versus preprogrammed knowledge

100 %

Below the

preprogrammed

It will remain limited until it is

reprogrammed

Δ performance

However

It will

relearn

Slow

learning

Δ time

Learning versus Preprogrammed Knowledge

Back to stability