1 / 34

Keep the Adversary Guessing: Agent Security by Policy Randomization

Keep the Adversary Guessing: Agent Security by Policy Randomization. Praveen Paruchuri University of Southern California paruchur@usc.edu. Motivation: The Prediction Game. Police vehicle Patrols 4 regions Can you predict the patrol pattern ? Pattern 1 Pattern 2

cate
Download Presentation

Keep the Adversary Guessing: Agent Security by Policy Randomization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California paruchur@usc.edu

  2. Motivation: The Prediction Game • Police vehicle • Patrols 4 regions • Can you predict the patrol pattern ? • Pattern 1 • Pattern 2 • Randomization decreases Predictability • Increases Security

  3. Domains • Police patrolling groups of houses • Scheduled activities at airports like security check, refueling etc • Adversary monitors activities • Randomized policies

  4. Problem Definition • Problem : Securityfor agents in uncertain adversarial domains • Assumptions for Agent/agent-team: • Variable information about adversary • Adversary cannot be modeled (Part 1) • Action/payoff structure unavailable • Adversary is partially modeled(Part 2) • Probability distribution over adversaries • Assumptions for Adversary: • Knows agents plan/policy • Exploits the action predictability

  5. Outline Security via Randomization No Adversary Model Partial Adversary Model Contributions: New, Efficient Algorithms Randomization + Quality Constraints MDP/Dec-POMDP Mixed strategies: Bayesian Stackelberg Games

  6. No Adversary Model: Solution Technique • Intentional policy randomizationfor security • Information Minimization Game • MDP/POMDP:Sequential decision making under uncertainty • POMDP Partially Observable Markov Decision Process • Maintain Quality Constraints • Resource constraints (Time, Fuel etc) • Frequency constraints (Likelihood of crime, Property Value)

  7. Randomization with quality constraints Fuel used < Threshold

  8. No Adversary Model: Contributions • Two main contributions • Single Agent Case: • Nonlinear program: Entropy based metric • Hard to solve (Exponential) • Convert to Linear Program: BRLP (Binary search for randomization) • Multi Agent Case: RDR (Rolling Down Randomization) • Randomized policies for decentralized POMDPs

  9. MDP based single agent case • MDP is tuple < S, A, P, R > • S – Set of states • A – Set of actions • P – Transition function • R – Reward function • Basic terms used : • x(s,a) : Expected times action a is taken in state s • Policy (as function of MDP flows) :

  10. Entropy : Measure of randomness • Randomness or information content quantified using Entropy ( Shannon 1948 ) • Entropy for MDP - • Additive Entropy – Add entropies of each state • Weighted Entropy – Weigh each state by it contribution to total flow

  11. Randomized Policy Generation • Non-linear Program: Max entropy, Reward above threshold • Exponential Algorithm • Linearize: Obtain Poly-time Algorithm • BRLP (Binary Search for Randomization LP) • Entropy as function of flows

  12. BRLP: Efficient Randomized Policy • Inputs: and target reward • can be any high entropy policy (uniform policy) • LP for BRLP • Entropy control with

  13. BRLP in Action Beta = .5 = 1 - Max entropy = 0 Deterministic Max Reward Target Reward Increasing scale of

  14. Results (Averaged over 10 MDPs) For a given reward threshold, • Highest entropy : Weighted Entropy : 10% avg gain over BRLP • Fastest : BRLP : 7 fold average speedup over Expected Entropy

  15. Multi Agent Case: Problem • Maximize entropy for agent teams subject to reward threshold • For agent team: • Decentralized POMDP framework • No communication between agents • For adversary: • Knows the agents policy • Exploits the action predictability

  16. Policy trees : Deterministic vs Randomized A1 O2 O1 A1 A2 O1 O2 O1 O2 A1 A1 A1 A2 A2 A2 O1 O2 O1 O2 O1 O2 O1 O2 O2 O1 Deterministic Policy Tree Randomized Policy Tree

  17. RDR : Rolling Down Randomization • Input : • Best ( local or global ) deterministic policy • Percent of reward loss • d parameter – Number of turns each agent gets • Ex: d = .5 => Number of steps = 1/d = 2 • Each agent gets one turn ( for 2 agent case ) • Single agent MDP problem at each step

  18. RDR : d = .5 Agent 1 Fix Agent 2’s policy Maximize joint entropy Joint Reward > 90% M = Max Reward Agent 2 Fix Agent 1’s policy Maximize joint entropy Joint reward > 80% 90% of M 80% of M

  19. RDR Details • To derive single agent MDP: • New Transition, Observation and Belief Update rules needed • Original Belief Update Rule – • New Belief Update Rule –

  20. Experimental Results : Reward Threshold vs Weighted Entropy ( Averaged 10 instances )

  21. Security with Partial Adversary Modeled • Police agent patrolling a region. • Many adversaries (robbers) • Different motivations, different times and places • Model (Action & Payoff) of each adversary known • Probability distribution known over adversaries • Modeled as Bayesian Stackelberg game

  22. Bayesian Game • It contains: • Set of agents: N (Police and robbers) • A set of typesθm (Police and robber types) • Set of strategiesσi for each agent i • Probability distribution over types Пj: θj  [0,1] • Utility function: Ui : θ1 * θ2 * σ1 * σ2  R

  23. Stackelberg Game • Agent as leader • Commits to strategy first: Patrol policy • Adversaries as followers • Optimize against leaders fixed strategy • Observe patrol patterns to leverage information Adversary Nash Equilibrium: <a,a> : [2,1] Leader commits to uniform random strategy {.5,.5} Follower plays b: [3.5,1] Agent

  24. Previous work: Conitzer, Sandholm AAAI’05, EC’06 • MIP-Nash (AAAI’05): Efficient best Nash procedure • Multiple LPs Method (EC’06): Given normal form game • Finds optimal leader strategy to commit to • Bayesian to Normal Form Game • Harsanyi Transformation: Exponential adversary strategies • NP-hard • For every joint pure strategy j of adversary: (R, C: Agent, Adversary)

  25. Bayesian Stackelberg Game: Approach • Two Approaches: • Heuristic solution • ASAP: Agent Security via Approximate Policies • Exact Solution • DOBSS: Decomposed Optimal Bayesian Stackelberg Solver • Exponential savings • No Harsanyi Transformation • No exponential # of LP’s • One MILP program (Mixed Integer Linear Program)

  26. ASAP vs DOBSS • ASAP: Heuristic • Control probability of strategy • Discrete probability space • Generates k-uniform policies • k = 3 => Probability = {0, 1/3, 2/3, 1} • Simple and easy to implement • DOBSS: Exact • Modify ASAP Algorithm • Discrete to continuous probability space • Focus of rest of talk

  27. DOBSS Details • Previous work: • Fix adversary (joint) pure strategy • Solve LP to find best agent strategy • My approach: • For each agent mixed strategy • Find adversary best response • Advantages: • Decomposition technique • Given agent strategy • Each adversary can find Best-response independently • Mathematical technique obtains single MILP

  28. Obtaining MILP • Decomposing Substitute

  29. Experiments: Domain • Patrolling Domain: Security agent and robber • Security agent patrols houses • Ex: Visit house a • Observe house and its neighbor • Plan for patrol length 2 • 6 or 12 strategies : 3 or 4 houses • Robbers can attack any house • 3 possible choices for 3 houses • Reward dependent on house and agent position • Joint space of robbers exponential • strategies: 3 houses, 10 robbers

  30. Sample Patrolling Domain: 3 & 4 houses 3 houses LPs: 7 followers DOBSS: 20 4 houses LP’s: 6 followers DOBSS: 12

  31. Conclusion • Agent cannot model adversary • Intentional randomization algorithms for MDP/Dec-POMDP • Agent has partial model of adversary • Efficient MILP solution for Bayesian Stackelberg games

  32. Vision • Incorporating machine learning • Dynamic environments • Resource constrained agents • Constraints might be unknown in advance • Developing real world applications • Police patrolling, Airport security

  33. Thank You • Any comments/questions ?

More Related