Probabilistic Plan Recognition

Probabilistic Plan Recognition Kathryn Blackmond Laskey Department of Systems Engineering and Operations Research George Mason University Dagstuhl Seminar April 2011

The problem of plan recognition is to take as input a sequence of actions performed by an actor and to infer the goal pursued by the actor and also to organize the action sequence in terms of a plan structure Schmidt, Sridharan and Goodson, 1978 …the problem of plan recognition is largely a problem of inference under conditions of uncertainty. Charniak and Goldman, 1993

PPR in a Nutshell Thomas Bayes(1702-1762) • Represent • set of possible plans • anticipated evidence for each plan • Specify • prior probabilities for plans • likelihood for evidence given plans • Infer plans using Bayes Rule …or just directly specify P(plan|obs) Bayes, Thomas. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370- 418, 1763.

Why Probability? • Theoretically well-founded representation for relative plausibility of competing explanations • Unified approach to inference and learning • Combine engineered and learned knowledge • Many general-purpose exact and approximate algorithms with strong theoretical justification and practical success • Good results on many interesting problems • But… • Inference and learning (exact and approximate) are NP hard • Balancing tractability and expressiveness is a major research and engineering challenge

Representing Plans and Observations • Plan recognition requires a computational representation of possible plans and observable evidence • Goals • Actions • When executed in combination, actions are expected (with high probability) to achieve the goal • Preconditions / postconditions of actions • Constraints • Most notably, temporal ordering • Observables • Actions may or may not be directly observable • Sometimes we observe effects of actions • Hierarchical decomposition of the above • For probabilistic plan recognition, we need to assign probabilities to these elements • Balance expressivity against tractability of inference & learning

Some Representations for PPR • Bayesian networks • Hidden Markov Models / Dynamic Bayesian Networks • Plan Recognition Bayesian Networks / Probabilistic Relational Models / Multi-Entity Bayesian Networks • Bayesian Abductive Logic Programs • Stochastic Grammars • Conditional Random Fields • Markov Logic Networks Each of these formalisms can be thought of as a way of representing a set of “possible worlds” and defining a probability measure on an algebra of subsets

Graphical Probability Models • Factorize joint distribution into factors involving only a few variables each • Graph represents conditional independence assumptions • Local distributions specify probability information for small groups of related variables • Factors are combined into joint distribution • Drastically simplifies specification,inference and learning • 20 possible goals, 100 possible actions • Fully general model  2.5x1031 probabilities • “Naïve Bayes model”  19x20x100=38,000 probabilities • If each goal has only 10 associated actions then “naïve Bayes model”  19x10 = 190 probabilities • Naïve Bayes inference scales as #variables x #states/variable Naïve Bayes Model

Bayesian Network (BN) • Directed graph represents dependencies • Joint distribution factors as • Factored representation makes specification, inference and learning tractable for interesting classes of problems • Directed graph naturally represents causality • Effects of intervention via “do” operator • Explaining away Pr(R,E,I,W,T,B,S) = Pr(R)Pr(E)Pr(I|R)Pr(W|R)Pr(T|E,I)Pr(B|W)Pr(S|W) 127 probabilities  14 probabilities

Possible and Probable Worlds • “Traditional or deductive logic admits only three attitudes to any proposition: definite proof, disproof, or blank ignorance.” (Jeffreys) • Semantics of classical logic is based on possible worlds • Set of possible worlds defined by language, domain, and axioms • In propositional logic, possible worlds assign truth values to atoms (e.g., R  T; W  T; E  F) • Probability theory • Set of possible worlds is called the sample space • Probability measure maps subsets to real numbers • Probability axioms are a natural extension of classical propositional logic to likelihood • BN combines propositional logic with probability

Other Factored Representations • Markov network: factorization specified by undirected graph • More natural for domains without natural causal direction • Joint distribution factorizes as: • Chain graph: factorization specified by graph with both directed and undirected edges • Representations to exploit context-specific independence • Probability trees • Tree-structured parameterization for local distributions in a BN C indexes cliques in the graph xiC is ith variable in clique C kC is size of clique C Z is a normalization constant

Conditional Random Fields • Bayesian networks are generative models • Represent joint probability over plans and observations • Realistic dependence models often yield intractable inference • Conditional (or discriminative) model directly represents probability of plans given observations • Can allow some dependencies to be relaxed • CRFs are discriminative • Undirected graph represents local dependencies • Potential function represents strength of dependence • A CRF is a family of MRFs (a mapping from observations to potentials)

Inference in Graphical Models • Exact inference • E.g., Belief propagation, junction tree, bucket elimination, symbolic probabilistic inference, cutset conditioning • Exploit graph structure / factorization to simplify computation • Infeasible for complex problems • Approximate (deterministic) • E.g., Loopy BP, variationalBayes • Approximate (stochastic) • E.g., Gibbs sampling, Metropolis-Hastings sampling, likelihood weighting • Combinations • E.g., Bidyuk and Dechter (2007) – cutset sampling

Goal: compute probability distribution of random variable B given evidence (assume B itself is not known) Key idea: impact of belief in B from evidence "above" B and evidence "below" B can be processed separately Justification: B d-separates “above” random variables from “below” random variables Random variables “above” B A1 A4 A5 p p A2 A3 A6 ? D7 D5 B Random variables “below” B l D1 D6 D2 D3 D4 = evidence random variable Belief Propagation for Singly Connected BNs • This picture depicts the updating process for one node. • The algorithm simultaneously updates beliefs for all the nodes. • Loopy BP applies BP to network with loops; often results in good approximation

Likelihood Weighting (for BNs) • Proceed through non-evidence variables in order consistent with partial ordering induced by graph • Sample variable according to its local probability distribution • Calculate weight proportional to Pr(evidence | sampled values) • Repeat Step 1 until done • Estimate Pr(Variable=value) by weighted sample frequency

Junction Tree Algorithm • Compile BN into junction tree • Tree of clusters of nodes • Has JT property: variable belonging to 2 clusters must belong to all clusters along path connecting them • Becomes part of the knowledge representation • Changes only if the graph changes • Use local local message-passing algorithm to propagate beliefs in the junction tree • Query on any node or any set of nodes in same cluster can be computed from cluster joint distribution ABC BCDEH CDGEH DEGHJ DFGHJ FGJK JKL

Gibbs Sampling • Initialize • Evidence variables assigned to observed values • Arbitrary value for other variables • Sample non-evidence nodes one at a time: • Sample with probability Pr(variable | Markov blanket) • Replace with newly sampled value • Repeat Step 2 until done • Estimate Pr(Variable=value) by sample frequency • Markov blanket • In BN: parents, children, co-parents • In MN: neighbors • Variable is conditionally independent of rest of network given its Markov blanket

Cutset Sampling (for BNs) • Find a loop cutset • Initialize cutset variables • Do until done • Propagate beliefs on non-cutset variables • Do Gibbs iteration on cutset • Estimate P(Variable=value) by averaging probability over samples • This is a kind of “Rao-Blackwellization” • Reduce variance of Monte Carlo estimator by replacing a sampling step with an exact computation with same expected value

Variational Inference • Method for approximating posterior distribution of unobserved variables given observed variables • Approximation finds distribution in family with simpler functional form (e.g., remove some arcs in graph) by minimizing a measure of distance from true posterior • Estimation via “variational EM” • Alternate between “expectation” and “maximization” steps • Converges to local minimum of distance function • Yields lower bound for marginal likelihood • Often faster but less accurate than MC

Extending Expressive Power of BNs • Propositional logic + probability is insufficiently expressive for requirements of plan recognition • Repeated structure • Multiple interrelated entities (e.g., plans, actors, actions) • Type hierarchy and inheritance • Unbounded number of potentially relevant variables • Some formalisms with greater expressive power: • PBN (Plan recognition BN) • PRM (Probabilistic Relational Models) • OOBN (Object-Oriented Bayesian Networks) • MEBN (Multi-Entity BN) • Plates • BALP (Bayesian AbductiveLogic Programs) Charniak and Goldman (1993)

Example: Maritime Domain Awareness Entities, attributes and relations

MDA Probabilistic Ontology Built in UnBBayes-MEBN

MDA SSBN Screenshot of situation-specific BN in UnBBayes-MEBN (open-source tool for building & reasoning with PR-OWL ontologies)

Protégé Plugin for UnBBayes

drag-and-drop Drag-and-Drop Mapping

Markov Logic Networks • First-order knowledge base with weight attached to formulas and clauses • KB + individual constants  ground Markov network containing variable for each grounding of a formula in the KB • Compact language for specifying large Markov networks

MLN Example(Richardson and Domingos, 2006)

CRFs for Chat Recognition(Hsu, Lian and Jih, 2011) • Subscript indexes pairs of individuals • Yit represents chatting activity of pair • Xit represents observed acoustic features • Dependence structure: • Within-pair temporal dependence • Between-pair concurrent dependence • Can be represented as MLN

Possible and Probable FO Worlds • In first-order logic, a possible world (aka “structure”) assigns: • Each constant symbol to a domain element (e.g., go3  obj23) • Each n-ary function symbol to a function on n-tuples of domain elements (e.g., (go-stp pln1)  obj23 • Each n-ary relation symbol to a set of n-tuples of domain elements (e.g., inst  {(obj23, go-), (obj78, liquor-store), (obj78, store) … } • A first-order probabilistic logic assigns a probability measure to first-order structures • This is called “measure model” semantics (Gaifman,1964)

FOL + Probability: Issues • Probability zero ≠ unsatisfiable • E.g., every possible value of a continuous distribution has probability zero • FOL is undecidable; FOL + probability is not even semi-decidable • Example: IID sequence of coin tosses, 0 < P(H) < 1 • Given any finite sequence of prior tosses, both H and T are possible • We cannot disprove anynon-extreme probability distribution from a finite sequence of tosses • Wrong solution: “We will prevent you from expressing this query because we cannot tractably compute the answer.” • Better solution: “Represent the problem you really want to solve, and then figure out a way to approximate the answer.” • Think carefully about what the real problem is!

Knowledge Based Model Construction • KBMC system contains : • Base representation that represents goals, plans, actions, actors, observables, constraints, etc. • Model construction procedure that maps a context and/or query into a target model • At problem solving time • Construct a problem-specific Bayesian network • Process queries on constructed model using general-purpose BN algorithm • Advantages of expressive representation • Understandability • Maintainability • Knowledge reuse • Exploit repeated structure (representation, inference, learning) • Construct only as much of model as needed for query

Hypothesis Management • Constructed BN rapidly becomes intractable, especially in presence of existence and association uncertainty • What do we really need to represent? • Heuristics help to avoid constructing (or prune) very unlikely hypotheses (or variables with very weak impact on conclusions) • E.g., from only “John went to the airport” do not nominate hypothesis that John intends to set off a bomb • But a security system needs to be on the alert for prospective bombers!

Lifted Inference • Constructed BN (propositionalized theory) typically contains repeated structure • Applying standard BN inference often results in many repetitions of the identical computation • Lifted inference algorithms detect such repetitions • “Lift” problem from ground to first-order level • Perform computation only once • Very active area of research (Braz, et al., 2005)

Learning = Inference … in theory, at least i=1,…,N j=1,…,M Plate model for parameter learning of store-of local distribution

Representing Temporal Evolution • Plans evolve in time • HMM / DBN / PDBN replicate variables describing temporally evolving situation Hidden Markov Model (HMM) unobservable evolving state + observable indicator Partially Dynamic Bayesian Network (PDBN) some variables not time-dependent Dynamic Bayesian Network (DBN) factored representation of state / observable

DBN Inference • Any BN inference algorithm can be applied to a finite-horizon DBN • Special-case inference algorithms exploit DBN structure • “Rollup” algorithm marginalizes out past hidden states given past observations to explicitly represent only a sliding window • Viterbi algorithm finds most probable values of hidden states given observations • Forward-backward algorithm estimates marginal distributions for hidden states given observations • Exact inference is generally intractable • Factored frontier algorithm approximates marginalization of past hidden state for intractable DBNs • Particle filter is a temporal variant of likelihood weighting with resampling • Beware of static nodes!

Particle Filter Resampling initialization Likelihood weighting Resampling Evolution Likelihood weighting • Maintains sample of weighted particles • Each particle is a single realization of all non-evidence nodes • Particle is weighted by likelihood of observation given particle • Particles are resampled with probability proportional to weight From van derMerwe et al. (undated)

Particle Impoverishment • Particles with large weights are sampled more often, leading to low particle diversity • This effect is counteracted by “spreading” effects of process noise • Impoverishment is very serious when: • Observations are extremely unlikely • Low “process noise” leads to long dwell times in widely separated basins of attraction • “In fact, for the case of very small process noise, all particles will collapse to a single point within a few iterations.” • “If the process noise is zero, then using a particle filter is not entirely appropriate.” (Arulampulam et al., 2002)

       X      Particle Filter with Static Nodes • PF cannot recover from impoverishment of static node • Some approaches: • Estimate separate PF for each combination of static node • Only if static node has small state space • Regularized PF - artificial evolution of static node • Ad hoc; no justification for amount of perturbation; information loss over time • Shrinkage (Liu & West) • Combines ideas from artificial evolution & kernel smoothing • Perturbation “shrinks” static node for each particle toward weighted sample mean • Perturbation holds variance of set of particles constant • Correlation in disturbances compensates for information loss • Resample-Move (Gilks & Berzuini) • Metropolis-Hastings step corrects for particle impoverishment • MH sampling of static node involves entire trajectory but is performed less frequently as runs become longer

Stochastic Grammars • Motivation: find representation that is sufficiently expressive for plan recognition but more tractable than general DBN inference • A stochastic grammar is a set of stochastic production rules for generating sequences of actions (terminal symbols in the grammar) • Modularity of production rules yields factored joint distribution

Stochastic Grammar - Example • Taken fromGeiband Goldman (2009) • Plans are represented as and/or tree with temporal constraints

Stochastic Grammar - Inference • Parsing algorithms can be applied to compute restricted class of queries • If plans can be represented in a given formalism then that formalism’s inference algorithms can be applied to process queries • We are often interested in a broader class of queries than traditional parsing algorithms can handle (e.g., we usually have not observed all actions) • Parse tree can be converted to DBN • Enables answering a broader class of queries • Can exploit structure of grammar to improve tractability of inference • Special-purpose algorithms exploit grammar structure

Where Do We Stand? • Contributions of probabilistic methods • Useful way of thinking about problems • Unified approach to reasoning, parameter learning, structure learning • Principled combination of KE with learning • Can learn from small, moderate and large samples • Many general-purpose exact and approximate algorithms with strong theoretical justification and practical success • Good results (better than previous state of the art) on many interesting problems • Many challenging problems remain • Exact learning and inference are intractable • High-dimensional multi-modal distributions are just plain ugly • All inference algorithms break down on the toughest cases • Asymptotics doesn’t mean much when the long run is millions of years! • With good engineering backed by solid theory, we will continue to make progress

Bibliography (1 of 2) Arulampalam, M. Maskell, S., Gordon, N. and Clapp, T. A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking, IEEE Transactions on Signal Processing, 50 , pp. 174–188, 2002. Bidyuk, B. and Dechter, R. "Cutset Sampling for Bayesian Networks", Journal of Artificial Intelligence Research 28, pages 1-48, 2007. Braz, R., Amir, E. and Roth, D. Lifted First-Order Probabilistic Inference. Proceedings of the International Joint Conference on Artificial Intelligence, 2005. Bui, H., Venkatesh, S. and West, G. "Policy Recognition in the Abstract Hidden Markov Model", Artificial Intelligence, Journal of Artificial Intelligence Research, Volume 17, pages 451-499, 2002. Charniak, E. and Goldman, R. A Bayesian Model of Plan Recognition. Artificial Intelligence, 64: 53-79, 1993. Charniak, E. and Goldman, R. A Probabilistic Model of Plan Recognition. Proceedings of the Ninth Conference on Artificial Intelligence 1991. Darwiche, A. Modeling and Reasoning with Bayesian Networks. Cambridge University Press. 2009. Gaifman, H. Concerning measures in First-Order calculi. Israel Journal of Mathematics, 2, 1–18, 1964. Geib, C.W. and Goldman, R.P. A Probabilistic Plan Recognition Algorithm Based on Plan Tree Grammars. Artificial Intelligence 173, pp. 1101–1132, 2009. Gilks, W.R. and Berzuini, C. Following a Moving Target—Monte Carlo Inference for Dynamic Bayesian Models,” Journal of the Royal Statistical Society B, 63, pp. 127–146, 2001. Hsu, J., Lian, C., and Jih, W. Probabilistic Models for Concurrent Chatting Activity Recognition. ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 1, 2011. Jensen, F., Bayesian Networks and Decision Graphs (2nd edition). Springer, 2007. Korb, K. and Nicholson, A. Bayesian Artificial Intelligence. Chapman and Hall, 2003. Koller, D., Friedman, N. Probabilistic Graphical Models. MIT Press, 2009. Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, 2001. Laskey, K.B., MEBN: A Language for First-Order Bayesian Knowledge Bases, Artificial Intelligence, 172(2-3): 140-178, 2008

Bibliography (2 of 2) Liao, L. Patterson, D. J. Fox, D. and Kautz, H. Learning and Inferring Transportation Routines. Artificial Intelligence, 2007. Liao, L., Fox, D., AND Kautz, H. Hierarchical Conditional Random Fields for GPS-based Activity Recognition. In Springer Tracts in Advanced Robotics. Springer, 2007. Liu, J. and West, M., Combined Parameter and State Estimation in Simulation-Based Filtering,” in Sequential Monte Carlo Methods in Practice, A. Doucet, J. F. G. de Freitas, and N. J. Gordon, Eds. New York: Springer-Verlag, 2001. Musso, C. Oudjane, N and LeGland, F. Improving Regularised Particle Filters, in Sequential Monte Carlo Methods in Practice, A. Doucet, J. F. G. de Freitas, and N. J. Gordon, Eds. New York: Springer-Verlag, 2001. Neapolitan, R. Learning Bayesian Networks. Prentice Hall, 2003. Pearl, J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. Pynadath, D.V. and Wellman, M.P. Probabilistic State-Dependent Grammars for Plan Recognition. Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, 2000. Pynadath, D. V. and Wellman, M. P. Generalized queries on probabilistic context-free grammars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):65–77, 1998. Richardson, M. and Domingos, P., Markov Logic Networks. Machine Learning, 62, 107-136, 2006. Schmidt, C., Sridharan, N., and Goodson, J., The plan recognition problem: An Intersection of psychology and Artificial Intelligence, Artificial Intelligence 11 pp. 45–83, 1978. van derMerwe, R., Doucet, A., de Freitas, N. and Wan, E. The Unscented Particle Filter, Adv. Neural Inform. Process. Syst. 2000. van derMerwe, R., Doucet, A., de Freitas, N. and Wan, E. (undated) “The unscented particle filter,” http://cslu.cse.ogi.edu/publications/ps/UPF_CSLU_talk.pdf Wellman, M.P., J.S. Breese, and R.P. Goldman (1992) From knowledge bases to decision models. The Knowledge Engineering Review, 7(1):35-53.

Probabilistic Plan Recognition