Rishi Jain

World models and the text ceiling

Large language models are trained on trillions of tokens of text, and they are very good, but the way they come to be good has always struck me as a little odd when you set it next to how a person learns. A child does not read the internet. It...

June 14, 2026

Notes
PPO as a stack of repairs to the policy gradient

In both the REINFORCE post and the more recent Soft Actor-Critic one I left the same IOU lying around, a proper technical walk through PPO, and this is me finally paying it. Proximal Policy Optimization is the algorithm that a good bit of modern reinforcement learning applications run on, including...

June 9, 2026

Notes
GRPO and the critic it throws away

The PPO post built its objective up as a stack of repairs to the plain policy gradient, and the last and most expensive piece in that stack was the critic, a second network the size of a small policy whose only job was to supply a baseline. GRPO is what...

June 9, 2026

Notes
Soft Actor-Critic: paying an agent to stay unsure

A while ago I wrote about REINFORCE, which learns a policy directly and then, rather wastefully, throws away every trajectory the moment it has finished updating on it. That is the price of being on-policy: the data was generated by the current policy, so once the policy changes the data...

June 8, 2026

Notes
REINFORCE on a lunar lander and Flappy Bird Code

This is a quick blog post about the REINFORCE algorithm in reinforcement learning and the recent project I made with it. It also features mentions to another, previous project. REINFORCE: the top level. REINFORCE stands for “REward Increment = Nonnegative Factor * Offset Reinforcement * Characteristic Eligibility”. It is a...

June 6, 2026

Projects

World models and the text ceiling

PPO as a stack of repairs to the policy gradient

GRPO and the critic it throws away

Soft Actor-Critic: paying an agent to stay unsure

REINFORCE on a lunar lander and Flappy Bird Code ↗

REINFORCE on a lunar lander and Flappy Bird Code