Dueling DQN, and splitting a state's worth from an action's
by Rishi Jain
The Double DQN post changed how the bootstrap target is computed and did not touch the network at all, the same convolutional trunk and the same single head emitting one number per action, only the arithmetic on top of it rearranged. The dueling architecture is the other half of that same run of small, almost suspiciously cheap DQN improvements, and it goes the other way: it leaves the target arithmetic exactly as it was and changes the shape of the network instead, on the observation that a single head forced to name a value for every action is doing more work than most states actually require. These are the notes on what that reshaping is, why the obvious way of writing it down is broken, and what the slightly less obvious fix buys you. It assumes you are comfortable with DQN itself, the replay buffer and the lagged target network and the Bellman update, and it will lean lightly on the Double DQN notes at the end, since the two are almost always run together.
A single Q head has to learn every action
A plain DQN keeps one function \( Q(s, a) \) that, given a state, returns a value for each action, and the only signal it ever trains on is how wrong those values are. That is a perfectly reasonable thing to ask for when the action genuinely matters, but a great many states are not like that. Picture an agent driving down a long empty stretch of road with no other cars and no junction in sight: steering a touch left, steering a touch right, holding straight, none of it changes the return in any meaningful way, and the only fact worth knowing about this state is that it is a comfortable, safe, moderately-valuable place to be. A plain DQN cannot say that fact directly. It has to express “this state is worth about 8, and by the way every action keeps it at about 8” by separately pinning down \( Q(s, \text{left}) \approx 8 \), \( Q(s, \text{straight}) \approx 8 \), \( Q(s, \text{right}) \approx 8 \), refining each of those numbers from its own trickle of experience and keeping them consistent with one another by accident rather than by design.
The thought behind the dueling architecture is that there are really two different questions tangled together in \( Q(s, a) \), and they have very different shapes. One is how good it is to be in this state at all, a single number that does not depend on the action, and the other is how much better or worse each particular action is than the typical action available here, which is where the action-dependence actually lives. Write the first as the value \( V(s) \) and the second as the advantage \( A(s, a) \), and the claim is that the action value is just their sum,
\[ Q(s, a) = V(s) + A(s, a), \]
with \( V(s) \) carrying the part of the estimate that the whole state shares and \( A(s, a) \) carrying only the per-action deviation from it. On the empty road the value stream learns the single number 8 and every advantage sits near zero, which is a far more honest description of the situation than three independently-wobbling action values that happen to agree.
Two streams after the trunk
The architecture realizes that split literally. The shared feature extractor, the convolutional or fully-connected layers that turn raw observations into features, is kept exactly as it was, and the network forks only after it. One branch, the value stream, collapses the features down to a single scalar \( V(s) \). The other branch, the advantage stream, produces a vector with one entry \( A(s, a) \) per action. An aggregation step at the very end recombines the scalar and the vector into the action values, and from the outside the network still consumes a state and emits one \( Q(s, a) \) per action, so it drops into an existing DQN training loop without any other change.
Carrying the parameters explicitly will matter in a moment, so let the shared trunk have parameters \( \theta \), the value stream its own head parameters \( \beta \), and the advantage stream its own head parameters \( \alpha \). The value head computes \( V(s; \theta, \beta) \) and the advantage head computes \( A(s, a; \theta, \alpha) \), the two sharing the trunk \( \theta \) and differing only in the small heads stacked on top. Nothing about this is expensive: a couple of extra fully-connected layers and one scalar output is the entire cost, and the forward and backward passes are the same order of work as before.
One observable, two unknowns
The naive aggregation, the bare \( Q = V + A \) written above, has a quiet problem that becomes obvious the moment you ask what the loss can actually see. Training only ever compares the network’s \( Q(s, a) \) against a target built from rewards and bootstrapped next-state values, so the loss is a function of \( Q \) alone and never of \( V \) and \( A \) separately. The trouble is that infinitely many different splits give the same \( Q \). Take any constant \( c \) at all, add it to the value and subtract it from every advantage, and the action values come out byte-for-byte identical,
\[ Q(s, a) = \big( V(s) + c \big) + \big( A(s, a) - c \big), \]
because the two \( c \) terms cancel inside the sum. The network is therefore free to park \( V(s) \) anywhere it likes and absorb the difference into the advantages, and since the loss is blind to which split it chose, gradient descent has no reason to settle on the one we wanted, the one where \( V \) really means the value of the state. The two streams can drift, the value stream learning some arbitrary offset and the advantage stream silently compensating, and the clean interpretation we built the architecture for evaporates while the loss reports no complaint.
The widget below is that ambiguity made draggable. A single state has four fixed action values, the four bars, and those bars are the only thing the loss ever observes, so they never move. The waterline is the value \( V \), and each action’s advantage is the gap between its bar top and that line, gold when the bar pokes above the waterline and red when it sits below. Drag the waterline anywhere and watch what happens: \( V \) changes, all four advantages change with it, and yet every reconstruction \( V + A \) lands back on the same bar, which is the whole problem in one motion. The entire vertical range is loss-equivalent, and nothing in the bars themselves tells you where the waterline ought to go.
The fix is to stop leaving the split free and impose a constraint that picks exactly one waterline out of the family. The dueling architecture does this by subtracting the mean advantage inside the aggregation, so that the value the network actually emits is
\[ Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \left( A(s, a; \theta, \alpha) - \frac{1}{\vert \mathcal{A} \vert}\sum_{a’} A(s, a’; \theta, \alpha) \right), \]
where \( \vert \mathcal{A} \vert \) is the number of actions and the sum runs over all of them. The bracketed term is the raw advantage with the average advantage removed, which forces the bracketed advantages to sum to zero no matter what the advantage stream emits, and that single constraint nails the waterline down. Average both sides over the actions and the bracketed term vanishes by construction, leaving
\[ \frac{1}{\vert \mathcal{A} \vert}\sum_{a} Q(s, a) = V(s), \]
so the value stream is pinned to mean exactly the average action value of the state, which is a genuinely sensible thing for it to mean. In the widget this is the “V → mean(Q)” button: it drops the waterline straight onto the average of the four bars, the one height where the gold gaps and the red gaps balance out and the advantages sum to zero.
A small worked example makes the pinning concrete. Suppose a state with three actions whose true action values are \( Q = (2, 2, 5) \), two actions tied and a third clearly better. The mean is \( 3 \), so the mean-subtracted split is \( V = 3 \) with advantages \( A = (-1, -1, 2) \), which do sum to zero, and adding them back gives \( (3-1, 3-1, 3+2) = (2, 2, 5) \). Now take the same action values but let the value stream wander off to \( V = 10 \): the advantages it would need are \( A = (-8, -8, -5) \), and these reconstruct the very same \( (2, 2, 5) \). Both splits are perfect as far as the loss is concerned, and the only thing separating them is the zero-mean constraint, which the first satisfies and the second violates by a mile. Impose the constraint and the second split is no longer reachable, so the value stream has nowhere to drift to.
Why the mean and not the max
The mean is not the only constraint that would pin the waterline, and the more principled-looking choice is the maximum,
\[ Q(s, a) = V(s) + \left( A(s, a) - \max_{a’} A(s, a’) \right), \]
which subtracts the largest advantage rather than the average. This one has a cleaner interpretation: it forces the advantage of the best action to be exactly zero, which makes \( V(s) \) equal to \( \max_{a} Q(s, a) \), the value of acting optimally, matching the textbook definition of the advantage function where the best action has no advantage over itself. In the widget it is the “V → max(Q)” button, dropping the waterline onto the tallest bar so that every advantage is zero or negative.
The reason the mean wins in practice is stability, and it comes from what each constraint does to the value stream’s target as the network learns. With the max form, the waterline is tied to whichever action currently looks best, so the instant the advantage stream changes its mind about the top action, even by a hair, the whole waterline jumps to a different bar and \( V(s) \) lurches with it. Early in training the argmax flips constantly as noisy advantages reshuffle, and the value stream ends up chasing a target that keeps teleporting. The mean form ties the waterline to the average instead, which moves smoothly as the advantages adjust and does not care which action is momentarily on top, so the value stream sees a calm target and converges without the thrashing. The original dueling architecture paper tries both and reports the mean as the more stable of the two, and the price is only that \( V(s) \) now means the average action value rather than the optimal one, which is a perfectly usable meaning and costs nothing in final performance because the constant offset washes straight out of the action values anyway.
Where the speed-up actually comes from
With the split well-posed, the benefit is in how the gradient gets spread around, and there are two threads to it. The first is that the value stream learns from every transition through a state regardless of which action was taken, because the mean-subtracted aggregation routes part of every action’s error back into the single shared \( V(s) \). A plain DQN updates only the one action value it has a sample for and leaves the others to wait their turn, so in a state with many actions each individual \( Q(s, a) \) is refined slowly, one visit at a time. The dueling value stream, by contrast, is touched on every visit no matter the action, so it accumulates evidence about the state’s worth far faster than any single action value ever could, and in the many states where the worth of being there is most of what matters this is most of the learning done early and cheaply.
The second thread is the empty-road case generalized. In states where the choice of action barely changes the outcome, a plain DQN still has to discover that fact the hard way, by separately learning each \( Q(s, a) \) and watching them converge to nearly the same number, which is a lot of redundant effort to establish that the actions were interchangeable all along. The dueling network represents that situation natively: the value stream learns the shared number and the advantage stream learns a vector close to zero, and it never had to refine per-action preferences that did not exist. You can watch the shape of this directly in the widget by switching to “actions barely matter”, where the four bars nearly coincide, the advantage gaps collapse toward nothing, and the single waterline is carrying essentially the entire action value on its own. That is exactly the regime where decoupling pays, and it is common enough in real environments that the architecture earns its keep without any change to the learning rule.
Dueling rarely rides alone
Because the dueling change is purely architectural, living entirely inside the network and its aggregation step, it is orthogonal to almost everything else in the DQN recipe, and that orthogonality is why it is seldom used by itself. It says nothing about how the bootstrap target is computed, so it composes directly with Double DQN, which leaves the architecture alone and instead splits the next-state max into a selection by the online network and an evaluation by the target network to take the overestimation bias back out. The two fixes address completely different leaks, one the shape of the function approximator and the other the optimism of the target, so stacking them is free and standard, and the combination is usually just called a dueling double DQN.
It says nothing about how transitions are drawn from the replay buffer either, so it composes equally well with prioritized experience replay, which stops sampling uniformly and instead draws transitions roughly in proportion to their TD-error, on the reasoning that a transition the network is currently very wrong about carries more to learn from than one it already predicts well, with an importance-sampling correction to undo the bias that non-uniform sampling would otherwise introduce. Put the architectural fix, the target fix, and the sampling fix together and you have most of the ingredients of the Rainbow agent, which is essentially the observation that this whole run of independent DQN improvements were addressing different leaks and could simply be added up.
Takeaway
The dueling architecture is a bet that the value of a state and the relative merit of each action are different enough in shape to be worth representing with different parts of the network, a single shared scalar for the first and a per-action vector for the second. Written naively as \( V + A \) the bet does not pay, because the loss only ever constrains their sum and the split is free to drift, so the real content of the idea is the mean subtraction that pins the value stream to the average action value and forces the advantages to sum to zero. That one constraint is what turns a redundant pair of streams into a faster learner, since the value stream then collects evidence on every visit to a state rather than every visit to a particular action, and it costs a couple of extra layers and no change at all to the target the network trains toward.