• PPO, and keeping the policy on a short leash

    In both the REINFORCE post and the more recent Soft Actor-Critic one I left the same IOU lying around, a proper technical walk through PPO, and this is me finally paying it. Proximal Policy Optimization is the algorithm that a good bit of modern reinforcement learning applications run on, including...

  • GRPO, and letting the group be the baseline

    The PPO post built its objective up as a stack of repairs to the plain policy gradient, and the last and most expensive piece in that stack was the critic, a second network the size of a small policy whose only job was to supply a baseline. GRPO is what...

  • Soft Actor-Critic, and paying the agent to stay unsure

    A while ago I wrote about REINFORCE, which learns a policy directly and then, rather wastefully, throws away every trajectory the moment it has finished updating on it. That is the price of being on-policy: the data was generated by the current policy, so once the policy changes the data...

  • REINFORCE, landing on planets, and playing Flappy-Bird Code

    This is a quick blog post about the REINFORCE algorithm in reinforcement learning and the recent project I made with it. It also features mentions to another, previous project. REINFORCE: the top level. REINFORCE stands for “REward Increment = Nonnegative Factor * Offset Reinforcement * Characteristic Eligibility”. It is a...

  • Dueling DQN, and splitting a state's worth from an action's

    The Double DQN post changed how the bootstrap target is computed and did not touch the network at all, the same convolutional trunk and the same single head emitting one number per action, only the arithmetic on top of it rearranged. The dueling architecture is the other half of that...