Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

Based on following two papers • L. Yu, W. Zhang, J. Wang, Y. Yu, SeqGAN: sequence generative adversarial Nets with policy gradient. AAAI, 2017. • J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, Adversarial learning for neural dialogue generation. arXiv: 1701.06547v4, 2017. • And HY Lee’s lecture notes.

Maximizing Expected Reward Encoder Generator Human In place of discriminator Update θ We wish to maximize the expected reward: θ* = arg maxθ γθ , where, γθ = Σh P(h) Σx R(h,x) Pθ (x|h) ≈ (1/N) Σi=1..N R(hi ,xi ) ---- sampling (h1,x1), … (hN,xN) But, now how do we do differentiation?

Policy gradient • Gradient ascent: θnew  θold + η Δγθ^old Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) Note: 1. Without R(h,x), this is max. likelihood. 2. Without R(h,x), we know how to do this. 3. Too approximate this, we can: if R(hi,xi)=k, repeat (hi,xi) k times. if R(hi,xi)=-k, repeat (hi,xi) k times, with -η

If R(hi,xi) is always positive: Because it is probability … Ideal case Pθ(x|h) (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3) Not sampled Due to Sampling (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)

Solution: subtract a baseline If R(hi,xi) is always positive, we subtract a baseline (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) (1/N) Σi=1..N ( R(hi ,xi ) – b) Δlog Pθ (xi |hi ) Not sampled Subtract a baseline Pθ(x|h) (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)

Chatbot by SeqGAN • Let’s replace human by a discriminator with reward function: R(h,x) = λ1r1(h,x) + λ2r2(h,x) + λ3r3(h,x) Encourage continuation Say something new Semantic coherence

http://www.nipic.com/show/3/83/3936650kd7476069.html Chat-bot by conditional GAN Input sentence/history h En De response sentence x Chatbot Input sentence/history h Discriminator Real or fake response sentence x human dialogues

Discrimi nator scalar scalar update Encoder Can we do backpropogation? A B A Tuning generator a little bit will not change the output. A A A B B B Alternative: improved WGAN De En Chatbot (ignoring sampling process) <BOS> A B

Discrimi nator SeqGAN solution, using RL scalar update • Use the output of discriminator as reward • Update generator to increase discriminator = to get maximum reward • Different from typical RL • The discriminator would update Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) De En Discriminator Score Chatbot

g-step New Objective: discriminator θt (1/N) Σi=1..N R(hi ,xi ) log Pθ (xi |hi ) (h1, x1) R(h1, x1) (h2, x2) R(h2, x2) … (hN, xN) R(hN, xN) θt+1 θt + ηΔγθ^t (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) d-step discriminator fake real

Rewarding a sentence vs word • Consider example: • hi =“what is your name” • xi = “I do not know” • Then logPθ(xi|hi) = logPθ(x1i|hi) +logPθ(x2i|hi,x1i) +logPθ(x3i|hi,x1:2i) • But if x = “I am Ming Li”, word I should have probability going up. If there are a lot of sentences to balance, this is usually ok. But when there is not enough samples, we can do reward at word level. I don’t know

Rewarding at word level • Reward at sentence level was: Δγθ ≈ (1/N) Σi=1..N (R(hi ,xi )-b) Δlog Pθ (xi |hi ) • Change to word level: Δγθ ≈ (1/N) Σi=1..N Σt=1..T (Q(hi ,x1:ti )-b)Δlog Pθ (xti |hi,x1:t-1i ) • How to estimate Q? Monte Carlo.

Monte Carlo estimation of Q • How to estimate Q(hi,x1:ti)? E.g. Q(“what is your name?”, “I”) • Sample sentences starting with “I” using the current generator, and using the discriminator to evaluate xA = I am Ming Li D(hi, xA) = 1.0 xB = I am happy D(hi, xB) = 0.1 Q(hi, ”I”) = 0.5 xC = I don’t know D(hi, xC) = 0.1 xD = I am superman D(hi, xD) = 0.8

Experiments of Chatbot Reinforce = SeqGAN with reinforcement learning sentence level REGS Monte Carlo = SeqGAN with RL on word level

Li et al 2016 Example Results (Li, Monroe, Ritter, Galley, Gao, Jurafsky, EMNLP 2016)

Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

LECTURE 25: REINFORCEMENT LEARNING

Reinforcement Learning

LECTURE 21: REINFORCEMENT LEARNING

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning