Soft Actor-Critic: paying an agent to stay unsure

A while ago I wrote about REINFORCE, which learns a policy directly and then, rather wastefully, throws away every trajectory the moment it has finished updating on it. That is the price of being on-policy: the data was generated by the current policy, so once the policy changes the data is stale and has to go. I wanted to understand an algorithm that does the opposite, that hoards its old experience in a buffer and keeps squeezing learning out of it long after the policy that produced it has moved on, and that also does something I found genuinely odd at first, which is to deliberately pay the agent to stay uncertain about what to do. That algorithm is Soft Actor-Critic, and these are the notes on why both of those choices, off-policy learning and a reward for randomness, turn out to be the same good idea wearing two hats.

These notes assume you are comfortable with the actor-critic setup that sits under policy gradients, which I covered in the REINFORCE post, and with the overestimation problem in value learning, which I drew out in some detail in the Double DQN post. SAC borrows from both worlds, an actor that learns a policy and a critic that learns values, so it helps to have each one already in hand.

Actor, critic, and an appetite for randomness

SAC keeps two kinds of network and trains them against each other in the usual actor-critic dance. The actor is a policy network that, given a state, produces a probability distribution over actions rather than a single action, so the agent samples what to do instead of computing it deterministically, which is what lets it explore a range of possibilities and stay adaptable when the environment shifts under it. The critic is a value network that grades the actor’s choices, estimating how much long-run reward an action is worth in a given state, and SAC keeps two of these critics rather than one for a reason we will get to.

What makes it soft, and what separates it from a plain actor-critic, is that the objective the actor is chasing has a second term bolted on. An ordinary agent wants to maximize reward and nothing else. A SAC agent wants to maximize reward plus the entropy of its own policy, which is to say it is rewarded both for collecting return and for keeping its action distribution as spread out and undecided as it can get away with. That single addition is the maximum-entropy framework, and almost everything distinctive about SAC, its exploration, its stability, its appetite for old data, follows from it.

It is also an off-policy algorithm, which is the trait it shares with Q-learning and not with REINFORCE. Rather than learning only from the most recent batch of interactions and then discarding them, SAC writes every transition it sees, the state, the action, the reward, the next state and whether the episode ended, into a replay buffer, and it trains by sampling old transitions from that buffer at random. The same experience gets reused many times across many updates, which is what makes it so much more sample-efficient than an on-policy method, and it is the entropy term, as we will see, that keeps that reuse honest.

The objective: reward plus entropy

Let me write the appetite for randomness down properly. The entropy of the policy in a state \( s \) is the expected surprise of its own actions,

\[ \mathcal{H}\big(\pi(\cdot \mid s)\big) = \mathbb{E}_{a\sim\pi}\big[-\log \pi(a \mid s)\big], \]

which is large when the policy spreads its probability across many actions and small when it commits almost all of its mass to one. A deterministic policy has zero entropy, and a uniform one has the most entropy available. The standard reinforcement-learning objective maximizes expected reward summed over a trajectory, and SAC adds the entropy of the policy at every step, each weighted by a coefficient \( \alpha \),

\[ J(\pi) = \sum_{t=0}^{T} \mathbb{E}_{(s_t, a_t)\sim\rho_\pi}\Big[\, r(s_t, a_t) + \alpha\,\mathcal{H}\big(\pi(\cdot \mid s_t)\big) \,\Big], \]

where \( \rho_\pi \) is the distribution over states and actions that the policy induces as it runs. Read the bracket slowly: at each step the agent banks the reward it actually earned and, alongside it, a bonus of \( \alpha \) times how undecided it managed to remain. An agent under this objective is trying to find good actions while staying as spread out as the rewards will allow, holding back from collapsing all of its confidence onto the single best thing it has seen so far.

The coefficient \( \alpha \), called the temperature, is the dial between the two appetites. Push it toward zero and the entropy bonus vanishes, leaving ordinary reward maximization and a policy that collapses onto the single best action it has found, which is pure exploitation. Push it up and the entropy bonus dominates, and the agent will happily sacrifice reward to keep its options spread, which is exploration taken to its limit. Everything interesting in SAC happens in the trade between those two, and the temperature is what sets the exchange rate.

The soft-optimal policy is a Boltzmann distribution

The lovely thing about this objective is that, for a fixed critic, you can write down exactly what the best policy looks like, and it is not some opaque thing a network has to grope toward. Suppose the critic hands us a value \( Q(a) \) for each action in some state, and we want the policy density \( \pi \) that maximizes expected value plus entropy in that state. That is the problem

\[ \max_{\pi}\ \int \pi(a)\,\big(Q(a) - \alpha \log \pi(a)\big)\,da \quad \text{subject to} \quad \int \pi(a)\,da = 1, \]

where the first term is the expected value under the policy and the second is \( \alpha \) times the entropy written out as \( -\mathbb{E}[\log \pi] \). To solve it we attach a Lagrange multiplier \( \lambda \) for the constraint that the density integrates to one and take the functional derivative with respect to the value of \( \pi \) at each action, which gives

\[ Q(a) - \alpha \log \pi(a) - \alpha - \lambda = 0. \]

The \( -\alpha \) appears because differentiating \( -\alpha\,\pi \log \pi \) brings down both a \( -\alpha \log \pi \) and a \( -\alpha \). Solving this for \( \log \pi(a) \) gives \( \log \pi(a) = \frac{Q(a) - \alpha - \lambda}{\alpha} \), and exponentiating both sides separates cleanly into a part that depends on the action and a part that does not,

\[ \pi(a) = \exp!\Big(\tfrac{Q(a)}{\alpha}\Big)\cdot\exp!\Big(\tfrac{-\alpha-\lambda}{\alpha}\Big). \]

The second factor carries no \( a \) in it, so it is just a constant that makes the whole thing integrate to one, and folding it into a normalizer \( Z = \int \exp(Q(a)/\alpha)\,da \) leaves the result in its final form,

\[ \pi(a) = \frac{1}{Z}\exp!\Big(\frac{Q(a)}{\alpha}\Big). \]

So the optimal soft policy is a Boltzmann distribution over the action values at temperature \( \alpha \): each action’s probability is exponential in its value, scaled by how hot the temperature is. The same constant \( Z \) that normalized the policy also turns out to be the soft value of the state, the best reward-plus-entropy the agent can expect from there,

\[ V(s) = \alpha \log \int \exp!\Big(\frac{Q(s, a)}{\alpha}\Big)\,da = \alpha \log Z, \]

and that expression is worth staring at, because \( \alpha \log \int \exp(Q/\alpha) \) is the log-sum-exp, the smooth stand-in for the maximum. As \( \alpha \to 0 \) it converges exactly to \( \max_a Q(s, a) \), the hard greedy value, and the Boltzmann policy above sharpens into a spike on the single best action. The soft value is a softened maximum and the soft policy a softened argmax, and the temperature is how much softening you ask for.

The widget below is that Boltzmann policy with a temperature dial you can turn. The gray curve is a fixed value landscape \( Q(a) \) over a one-dimensional continuous action, with a tall peak on the right marking the genuinely best action and a shorter peak on the left marking a worse but still decent option. The gold fill is the soft policy \( \pi(a) \propto \exp(Q(a)/\alpha) \) it induces. Cool the temperature right down and the policy becomes a narrow spike sitting on the global peak, the dashed greedy action, ignoring everything else. Warm it up and the policy widens, and past a point it puts real probability on the worse peak too, which is exactly the behavior you want, since the only way to find out whether an action you have underrated is secretly good is to keep trying it.

α— entropy H— soft value V— P(best)— P(worse)—

The grey curve is the value landscape Q(a): the tall peak on the right is the genuinely best action, the shorter peak on the left a worse option. The gold fill is the soft policy π(a) ∝ exp(Q(a)/α). Turn α down and the policy collapses onto the best action, the dashed line; turn it up and the policy spreads, eventually willing to try the worse option, which is exactly how it would discover a peak it had underrated.

Watch the readout as you move the dial. The entropy climbs as the temperature rises, the soft value swells with it once the entropy bonus starts to outweigh the reward the agent is giving up, and the probability mass leaks from the best basin into the worse one. At the lowest temperature nearly all of the mass sits on the best action and the policy has all but stopped exploring, and at the highest it is approaching the flat uniform distribution that explores everything and commits to nothing.

The temperature dial, and tuning it without guessing

The trouble with a single fixed temperature is that the right amount of exploration is not a constant. Early in training the agent knows almost nothing, its value estimates are noise, and it badly wants high entropy so it covers the action space and does not commit to whatever happened to look good first. Later, once it has a real sense of where the reward is, that same high entropy is just throwing away return on actions it already knows are worse. You want the temperature high at the start and low by the end, and picking a schedule for that by hand is the kind of fiddly hyperparameter babysitting that tends to go wrong.

SAC’s answer is to stop setting the temperature directly and instead set a target entropy \( \bar{\mathcal{H}} \), the amount of randomness you would like the policy to hold on average, and let the algorithm adjust \( \alpha \) to hit it. The temperature gets its own little objective,

\[ J(\alpha) = \mathbb{E}_{a\sim\pi}\big[-\alpha\big(\log \pi(a \mid s) + \bar{\mathcal{H}}\big)\big], \]

which is minimized with respect to \( \alpha \) by gradient descent alongside everything else. The gradient works out to \( \bar{\mathcal{H}} - \mathcal{H}_{\text{current}} \), so the update has a pleasantly intuitive shape: when the policy’s current entropy has fallen below the target, meaning it has grown too decisive too soon, the gradient pushes \( \alpha \) up and buys back exploration, and when the entropy is sitting above the target, the gradient pulls \( \alpha \) down and lets the agent start cashing in. The temperature falls naturally over training as the policy earns the right to be confident, without anyone drawing the schedule in advance, and this adaptive temperature is a good part of why SAC tends to just work across tasks that would need very different fixed settings.

Two critics, and a habit of pessimism

Now to the second critic I mentioned and quietly deferred. The critic in any value-based method is trained by bootstrapping, pointing each estimate at the reward plus the discounted value of where it lands next, and I spent the whole Double DQN post on the fact that this bootstrap leans systematically high, because the maximization buried in the target seeks out whichever action got the most generous upward kick of noise and reports that inflated value as truth. Left alone the optimism compounds through training, the critic talks itself into a goldmine that is not there, and the policy chasing that critic wobbles.

SAC takes a blunt approach to this. It keeps two critics, \( Q_{\theta_1} \) and \( Q_{\theta_2} \), initialized differently and trained on the same data so that their estimation errors do not line up perfectly, and when it builds the target value it takes the smaller of the two,

\[ y = r + \gamma\Big(\min_{i=1,2} Q_{\theta’_i}(s’, a’) - \alpha \log \pi(a’ \mid s’)\Big), \qquad a’ \sim \pi(\cdot \mid s’), \]

where \( a’ \) is freshly sampled from the current policy at the next state and the \( -\alpha \log \pi(a’ \mid s’) \) term carries the entropy bonus forward into the value, since in the soft world the value of a state includes the randomness you get to keep there. The piece doing the anti-optimism work is the \( \min \). Taking the minimum of two estimates is a deliberately pessimistic operation, and where the bootstrap’s maximization leans high, the minimum over two critics leans low, so the two biases push against each other.

The widget makes the leaning concrete on the cleanest possible case. An action’s true value is exactly zero, so any nonzero average is bias the estimator invented from noise. Two independent critics each report a noisy estimate of that zero. A single critic is unbiased, as it should be, and its running average sits right on zero. Take the minimum of the two, the way SAC does, and the average drifts below the truth and settles on the closed form \( -c/3 \), where the noise runs over \( [-c, c] \).¹

draws0 single critic— min of two— true value0.000 −c/3—

The action is worth exactly 0, so any nonzero average is invented. A single critic is unbiased and its average sits on 0. Keep the smaller of two critics, the way SAC does, and the average settles on −c/3: a deliberate pessimism that leans low to cancel the optimism the bootstrap leans high.

Run a thousand draws and watch the two traces separate: the single critic hugging zero, the minimum sitting calmly below it. That downward bias is the entire point. The bootstrap is busy pushing the estimates up, and the minimum gives back a controlled amount of pessimism to cancel it, which keeps the critic from running away with itself and gives the whole system the smoother, steadier learning that SAC is known for. It is a cruder instrument than Double DQN’s trick of decoupling selection from evaluation, but it is cheap, it composes with everything else, and in the noisy off-policy setting SAC lives in it does the job.

Off-policy, and why the buffer keeps paying out

The replay buffer is what lets SAC reuse experience, and reuse is where the sample efficiency comes from. Every transition the agent has ever seen sits in the buffer, and each training step samples a batch of old transitions at random rather than waiting for fresh ones, so a single expensive interaction with the environment goes on contributing to the learning long after it happened. In a setting where collecting data is slow or costly, a real robot arm rather than a simulator running a thousand copies in parallel, that reuse is the difference between a feasible training run and an impossible one.

This is exactly what REINFORCE cannot do. Being on-policy, its gradient is only valid for data drawn from the current policy, so the moment it updates, the trajectories it just collected are the wrong distribution and have to be thrown out. SAC sidesteps that because its critics learn values rather than policy gradients, and a value is a property of the action, not of the policy that happened to choose it, so an old transition still carries usable information about what that action was worth no matter how much the policy has drifted since.

The entropy term and the buffer turn out to support each other rather than just coexisting. A policy kept deliberately spread out visits a wider range of states and actions than a sharpening, greedy one would, so the experience piling up in the buffer is more varied, which gives the critics a richer and less repetitive diet to learn from. The appetite for randomness does double duty here. Beyond exploring the environment in the moment, it keeps the buffer stocked with the kind of varied data that off-policy learning feeds on, which is the sense in which the two headline choices are really one idea.

The loop, briefly

Stitched together the algorithm is a steady cycle. The agent acts in the environment by sampling from its stochastic policy and stores each transition in the buffer, then samples a batch of old transitions back out of it. With that batch it updates both critics by regressing them onto the min-clipped soft target above, updates the actor by nudging its policy toward the Boltzmann distribution the critics imply, which in practice means maximizing \( \mathbb{E}_{a\sim\pi}[Q(s, a) - \alpha \log \pi(a \mid s)] \), and nudges the temperature toward its entropy target. Then it does it all again.

One implementation detail is worth naming because it is what makes the actor update differentiable. The policy does not output an action directly, it outputs the mean and spread of a Gaussian, and the action is formed by drawing a noise sample \( \epsilon \) and computing \( a = \tanh(\mu_\phi(s) + \sigma_\phi(s)\,\epsilon) \), the reparameterization trick that pushes the randomness into \( \epsilon \) so the gradient can flow through \( \mu \) and \( \sigma \). The \( \tanh \) squashes the action into a bounded range, which matters for continuous control where torques and velocities have limits, and it costs a small correction term in the log-probability to account for the squashing, but the upshot is that the actor can be trained by ordinary backpropagation through the critic.

A last small thing that is easy to forget: the stochastic policy is for training. When you actually deploy the agent and want its best behavior, you stop sampling and act on the mean of the policy, the deterministic center of the distribution, because the randomness was scaffolding for exploration and learning, not something you want injected into performance once the learning is done.

Where SAC pulls ahead of PPO

It is worth setting SAC against PPO, the other workhorse of continuous control, because the contrast sharpens what each is for. The first and biggest difference is the one this whole post has been circling: SAC is off-policy and reuses its buffer, while PPO is on-policy and needs fresh data for every update, so SAC is markedly more sample-efficient and pulls ahead whenever generating new experience is slow or expensive, which is most of the time outside a fast simulator.

The second is exploration. PPO does carry an entropy bonus, but it is a small regularizing nudge tacked onto the side of the objective, whereas in SAC entropy is written into the heart of what the agent is optimizing and the temperature controlling it is tuned automatically as learning progresses. That makes SAC the stronger explorer in complicated environments where the agent really does need to try a wide spread of actions before the good ones become clear, and it means SAC adapts its own exploration over training in a way PPO has no built-in mechanism to match. The twin critics and their clipped minimum add a stability against overestimation that PPO, leaning on a single value baseline, handles less directly, and in genuinely continuous action spaces the squashed stochastic policy tends to be the better-fitted tool.

None of this makes PPO the wrong choice, and it would be dishonest to pretend otherwise. PPO has fewer moving parts, no second critic, no temperature to tune, no buffer to manage, and that simplicity buys a reliability that is genuinely valuable, especially when you can throw a massive parallel simulator at it and data is effectively free. SAC is the algorithm to reach for when each sample is precious and the exploration problem is hard, and PPO is the one to reach for when samples are cheap and you would rather not babysit a more elaborate machine. The maximum-entropy framing is what earns SAC its edge on the first kind of problem, and it is the same edge whether you read it as better exploration, steadier critics, or a buffer kept full of varied data.

What’s next

The PPO technical explanation I have been promising is the natural companion to this one, since the two algorithms answer the same question with opposite instincts about how to treat their data. After that the plan is still the Hindi-English translator series, with its own posts on RoPE, beam search and the transformer, and a write-up of the HuggingFace small-models hackathon.

The closed form drops out of the order statistics, the mirror image of the maximum used in the Double DQN post. For two independent draws on \( [0, 1] \) the minimum has cumulative distribution \( 1 - (1 - x)^2 \) and density \( 2(1 - x) \), so its mean is \( \int_0^1 x\cdot 2(1 - x)\,dx = 1/3 \). Rescaling that uniform onto \( [-c, c] \) by \( \epsilon = 2c\,u - c \) gives \( \mathbb{E}[\min] = 2c\cdot\frac{1}{3} - c = -c/3 \). ↩