A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap

A neglected problem in the computational theory of mindObject Tracking and the Mind-World gap In the course of these lectures I will try to show how several interconnected concepts are essential to understanding mind. They are: • Picking out, individuating, and nonconceptual selection • The type-token distinction – everyone is familiar with these terms but often fail to see their importance and relevance • This distinction crosses the proximal-distal distinction • The need for “tagging” or “marking” individuals to keep them distinct (but where does the tag reside?) • The correspondence problem: when do two proximal tokens correspond to the same individual (same distal token)? • The binding problem – how does the visual system indicate that several properties are conjoined – i.e., are properties of the same individual

Before I begin I would like you to see a ‘video game’ that will figure in the last part of my talk • The demonstration shows a task called “Multiple Object Tracking” • Track the initially-distinct (flashing) items through the trial (here 10 secs) and indicate at the end which items are the “targets” • After each example I’d like you to ask yourself, “How do I do it?” • If you are like most of our subjects you will have no idea, or a false idea…

Keep track of the objects that flash512x6.83 172x 169

How do we do it? What properties of individual objects do we use?

Going behind occluding surfaces does not disrupt tracking Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple items through occlusion: Clues to visual objecthood. Cognitive Psychology, 38(2), 259-290.

Not all well-defined features can be tracked:Track endpoints of these linesEndpoints move exactly as the squares did!

The basic problem of cognitive science • What determines our behavior is not how the world is, buthow we represent it as being • As Chomsky pointed out in his review of Skinner, if we describe behavior in relation to the objective properties of the world, we would have to conclude that behavior is essentially stimulus-independent • Nearly every naturally-occurring person-level action or behavioral regularity is cognitively penetrable • Any information that changes beliefs can systematically and rationally change behavior

Representation and Mind Why representations are essential • Do representations only come into play in “higher level” mental activities, such as reasoning? • Even at early stages of perception many of the states that must be postulated are representations (i.e. what they are about plays a role in explanations).

Examples from vision (1): Intrapercept constraints Epstein, W. (1982). Percept-percept couplings. Perception, 11, 75-83. Far Top/ Far High Front Bottom/ Back bottom

Another example of a classical representation

Other forms of representation…. Note the essential role played by the letter-labels • Lines FG, BC are parallel and equal. • Lines EH, AD are parallel and equal. • Lines FB, GC are parallel and equal. • Lines EA, HD are parallel and equal. • Vertices EF, HG, DC and AB are joined.... Other predicate-argument representations • Part-Of{Cube; Top-Face(EFGH), Bottom-Face(ABCD), Front-Face(FGCB), Back-Face(EHDA)} • Part-Of{Top-Face(Front-Edge(FG), Back-Edge(EH), Left-Edge(EF), Right-Edge(HG)},…

What’s wrong with this picture? What’s wrong is that the CTM is incomplete — it does not address a number of fundamental questions • It fails to specify how representations connect with what they represent – it’s not enough to use English words in the representation (that’s been a common confusion in AI) or to draw pictures (a common confusion in theories of reasoning with mental images) • English labels and pictures may help the theorist recall which objects are being referred to … • But what makes it the case that a particular mental symbol refers to one thing rather than another? • How are concepts grounded? (Symbol Grounding Problem)

Another way to look at what the Computational Theory of Mind lacks • The missing function in the CTM is a mechanism that allows perception to refer to individual tokens in the visual field directly and nonconceptually: • Not as “whatever has properties P1, P2, P3, ...”, but as a singular term that refers directly to an individual and does not appeal to the prior representation of the individual’s properties. • Such a reference is like a proper name or a pointer in a computer data structure, or like a demonstrative term (like this or that) in natural language. But it is difference from all of these. E.g. • Unlike a demonstrative or a deictic term, the reference is not determined by discourse context • Unlike a proper name it only refers to objects currently in view • Unlike the usual sort of pointer it does not refer by addressing a location;  rather it is like a pointer in a computer which serves as a variable and does not refer via a location, despite what the term “pointer” might imply.

An example from personal history: Why we need to pick out individual things without referring to their properties • We wanted to develop a computer system that would reason about geometry by actually drawing a diagram and noticing adventitious properties of the diagram from which it would conjecture lemmas to prove • We wanted the system to be as psychologically realistic as possible so we assumed that it had a narrow field of view and noticed only limited, spatially-restricted information as it examined the drawing • This immediately raised the problem of coordinating noticings and led us to the idea of visual indexes to keep track of previously encoded parts of the diagram.

Begin by drawing a line…. L1

Now draw a second line…. L2

And draw a third line…. L3

Notice what you have so far….(noticings are local – you encode what you attend to) L1 V6 L2 There is an intersection of two lines… But which of the two lines you drew are they? There is no way to indicate which individual things are seen again without a way to refer to individual (token) things

Look around some more to see what is there …. L5 L2 V12 Here is another intersection of two lines… Is it the same intersection as the one seen earlier? Without a special way to keep track of individuals the only way to tell would be to encode unique properties of each of the lines. Which properties should you encode?

In examining a geometrical figure one only gets to see a sequence of local glimpses

The incremental construction of visual representations requires solving a correspondence problem over time • We have to determine whether a particular individual element seen at time t is identical to another individual element seen at a previous time t-.This is one manifestation of the correspondence problem. • Solving the correspondence problem is equivalent to picking out and tracking the identity of token individuals as they change their appearance, their location or the way they are encoded or conceptualized • To do that we need the capacity to refer to token individuals (I will call them objects) without doing so by appealing to their properties. This requires a special form of demonstrative reference I call a Visual Index.

A note about the use of labels in this example • There are two purposes for figure labels. One is to specify what type of individual it is (line, vertex,..). The other is to specify which individual it is so it can be bound to the argument of a predicate which can then be evaluated. • The second of these is what I am concerned with because it is essential that we be able to indicate which individual a predicate applies to. • Many people (e.g., Marr, Yantis) have suggested that individuals may be marked by tags. But that won’t do since one cannot literally place a tag on an object. Even if we could it would not obviate the need to refer directly to individuals for the same reason that labels didn’t help in the geometry examples discussed earlier. • Labeling things in the world is not enough because to refer to the line labeled L1 you would have to be able to think “this is line L1” and you could not think that unless you had a way to first picking out the referent of this.

The difference between a direct (demonstrative) way and a descriptive (attributive) way of picking something out has produced many “You are here” cartoons. It is also illustrated in this recent New Yorker cartoon…

The difference between descriptive and demonstrative ways of picking something out (illustrated in this New Yorker cartoon by Sipress )

Referring and ‘Picking out’ • Picking out entails individuating, in the sense of separating an individual from a background (what Gestalt psychologists called a figure-ground distinction) and from all other possible things • This sort of picking out has been studied in psychology under the heading of focal or selective attention. • Focal attention can be understood as an instance of demonstrative reference! • Focal attention appears to pick out and adhere to objects rather than places • In addition to the usual unitary attention there is also evidence for a mechanism of multiple direct references (about 4 or 5), that I have called a visual index ora FINST • Indexes are different from split focal attention in many ways that we have studied in our laboratory (I will mention a few later) • A visual index is like a pointer in a computer data structure – it allows access but does not itself reveal anything about what is being pointed to

The requirements for picking out and keeping track of several individual things reminded me of an early comic book character called Plastic Man

Imagine being able to place several of your fingers on things in the world without recognizing their properties while doing so. You could then refer to those things (e.g. ‘what finger #2 is touching’) and could move your attention to them. You would then be said to possess FINgers of INSTantiation (FINSTs)

Some questions raised by this view of indexing as primitive reference • Is there a limit on the number of such indexes? If so; • Is it fixed structural (architectural) property? • Can it be altered by different tasks, experience, etc? • How is it different from focal attention? • What determines whether something is attended? • What object properties allow objects to be tracked? • How can an object be selected without being selected as “the object with property P” (e.g., the object at location <x,y>)? “Selection” is a misleading term. • Without some unique property how do you know which object you have selected? This is a misleading way to put it?

Information (causal) link FINST Demonstrative reference link FINSTs and Object Files are the basic mechanisms that link the world and its conceptualization The only thing in this picturethat is conceptual is what’s in the Object Files (unless you count a reference as conceptual) Object File contents are conceptual!

A note on terminology • A FINST provides a reference to an individual visible ‘thing’ • I sometimes call this referent a FING by analogy with FINST and sometimes an object to conform with usage in psychology • A FINST does not pick out or refer to something as an object, because OBJECT is a concept. So FINGs are nonconceptual. Maybe proto object ? • I have also called it a pointer, but that erroneously suggests that it points to the location of an object, as opposed to the object itself. In a computer, a pointer is the name of a stored datum. • I have said that a FINST is a visual demonstrative like ‘this’ or ‘that’, but this too is misleading because the reference of a demonstrative depends on the context and intentions of the speaker • I have also noted that a FINST is like a proper name but that won’t do either since a name can pick out something not in sensory contact whereas a FINST can only refer to a visible item (or one that has been only briefly out of sight).

A quick tour of some evidence for FINSTs • The correspondence problem • The binding problem • Evaluating multi-place visual predicates (recognizing multi-element patterns) • Operating over several visual elements at once without having to search for them first • Subitizing • Subset selection • Multiple-Object Tracking • Cognizing space without requiring a spatial display in the head

Dawson Configuration(Dawson &Pylyshyn, 1988)

Apparent Motion solves a correspondence problemDawson Configuration (Dawson &Pylyshyn, 1988) Linear trajectory? Curved trajectory? Which criterion does the visual module prefer?

Apparent Motion solves a correspondence problemDawson Configuration (Dawson &Pylyshyn, 1988) Nearest vector distance? Nearest mean distance? Nearest configural distance? Which criterion does the visual module prefer?

Dawson Configuration(animated)

Dawson ConfigurationDifferent Shapes Ignored

Yantis use of the “Ternus Configuration” to demonstrate the early visual effect of objecthood Short time delays result in “element motion” (the middle object persists as the “same object” so it does not appear to move)

Long time delays result in “group motion” because the middle object does not persist but is perceived as a new object each time it reappears

But long delays, when the disappearance appears to be due to occlusion by an opaque surface, maintain objecthood, and therefore behave like short delays

A quick tour of some evidence for FINSTs • The correspondence problem • The binding problem • Evaluating multi-place visual predicates (recognizing multi-element patterns) • Operating over several visual elements at once without having to search for them first • Subitizing • Subset search • Multiple-Object Tracking • Cognizing space without requiring a spatial display in the head

Encoding conjunctions of properties and solving the Binding Problem • Experiments have shown that detecting conjunctions of several properties involves attending to the bearers of the properties. These studies have provided a basis for understanding an important problem in visual analysis – the Binding Problem • The following aside is to illustrate some aspects of the problem of encoding conjunctions.

How are conjunctions of features detected? Read the vertical line of digits in the following display Under these conditions Conjunction Errors are very frequent

Rapid visual search (Treisman) Find the following simple figure in the next slide:

This case is easy – and the time is independent of how many nontargets there are – because there is only one red item. This is called a ‘popout’ search

This case is also easy – and the time is independent of how many nontargets there are – because there is only one right-leaning item. This is also a ‘popout’ search.

Rapid visual search (conjunction) Find the following simple figure in the next slide:

A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap