1 / 12

Learning from Disagreeing Demonstrators

University of British Columbia. bnds@cs.ubc.ca. Motivation. Some traditional cases of ... driving domain, want to optimize travel time and number of crashes ...

Melvin
Download Presentation

Learning from Disagreeing Demonstrators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Learning from Disagreeing Demonstrators

    Bruno N. da Silva University of British Columbia bnds@cs.ubc.ca

    Slide 2:Motivation

    Some traditional cases of Learning from Demonstration assume a human expert In some (subjective) tasks, there might not be a single expert How to drive from point A to B

    Slide 3:Motivation

    In general, these tasks involve more than one feature e.g. in the driving domain, want to optimize travel time and number of crashes Different contexts lead to different tradeoffs between features Idiosyncratic demonstrators do not reflect on their routine approach to the problem

    Slide 4:Problem definition

    How can we integrate idiosyncratic (disagreeing) demonstrations to form a homogeneous and effective policy?

    Slide 5:Solution

    We extend the framework presented by Argall et al, 2007 Traditional demonstrations in the first stage Robot execution and human critique in the second stage Robot collects critiques Robot updates policy

    Slide 6:The 1st stage of the mechanism

    Slide 7:The 2nd stage of the mechanism

    Slide 8:A little more concretely…

    The first stage can be interpreted as a set of datapoints (pm,an,c) Perception pm Action an Confidence on the mapping c The criticism will affect the confidence If praise the execution, increase c If knock the execution, decrease c

    Slide 9:But let’s not be naďve

    If demonstrators “lie” in the demonstration, they would “lie” in the criticism Therefore, associate a reputation ri with each demonstration di And update the confidence level carefully c := c + ri * f(feedback)

    Slide 10:Adjusting reputation ranks

    And adjust ri based on (lack of) improvement from di’s feedback ri := ri + ? * evaluation(feedback) evaluation(.) can be interpreted as a Pareto improvement from the feedback

    Slide 11:Current investigations

    Policy conversion? Rate of conversion? What are the long term effects on human demonstrators? Frustration? Repudiation? Will critiques really be mindful?

    Slide 12:Thanks!

    Questions?

More Related