QGF avoids both failure modes with one key idea: instead of fully integrating the denoising ODE,
we can obtain a cheap, first-order approximation of the denoised action by
taking a single Euler integration step along the reference velocity field:
$$\hat{a}_1 \;=\; a_t \;+\; v_\theta(s,\, a_t,\, t)\cdot(1-t)$$
Then the critic gradient can be approximated with:
$$\nabla_{a_t} Q(s, a_1) \approx \nabla_{a_t} Q(s, \hat{a}_1) = \left(\frac{\partial \hat{a}_1}{\partial a_t}\right)^\top \nabla_{\hat{a}_1} Q(s, \hat{a}_1),$$
which is a product betweeen the gradient of $Q$ at the approximate denoised action and the Jacobian of the denoised action with respect to the noisy action, $J = \frac{\partial \hat{a}_1}{\partial a_t}$.
Empirically, we find that $J$ can be ill-behaved since it requires differentiation through $v_\theta$,
and simply replacing it with the identity gives better performance, effectively computing:
$$\nabla_{a_t} Q(s, a_1) \approx \hat{J}^\top \: \nabla_{\hat{a}_1} Q(s, \hat{a}_1) \quad \text{where} \quad {\hat{a}_1 = a_t + v_\theta(s, a_t,t) \cdot (1-t), \: \hat{J}=I}$$
During inferece, this gradient is added to the velocity field at each denoising step to "guide" denoising towards higher-value actions:
Algorithm — QGF Inference
Input: state $s$, reference flow $v_\theta$, critic $Q_\phi$, guidance weight $1/\beta$, steps $T$
$a_0 \sim \mathcal{N}(0, I)$
for $t = 0,\, 1/T,\, \ldots,\, 1 - 1/T$ do
$a' \leftarrow a_t + (1-t)\,v_\theta(s, a_t, t)$
$g \leftarrow \nabla_{a'} Q_\phi(s, a')$
$a_{t+1} \leftarrow a_t + \tfrac{1}{T}\bigl(v_\theta(s, a_t, t) + \tfrac{1}{\beta}\,g\bigr)$
end for
return $a_T$
Two design choices above in QGF may appear to be crude approximations:
we drop the Jacobian entirely and use a first-order approximation of the denoised action.
Both look like approximations one would only tolerate for efficiency.
Surprisingly, we find that neither is merely a compromise and both approximations outperform their more "exact" counterparts.
Intuitively:
More quantitatively, we find that these choices yield a lower-variance
gradient estimator that is better at optimizing $Q$-values.
We compare below three different gradient estimators:
-
QGF:
$\nabla_{a_t} Q(s, a_1) \approx \hat{J}^\top \: \nabla_{\hat{a}_1} Q(s, \hat{a}_1)$
-
QGF-Jacobian:
$\nabla_{a_t} Q(s, a_1) \approx J^\top \: \nabla_{\hat{a}_1} Q(s, \hat{a}_1)$
-
QGF-chain:
$\nabla_{a_t} Q(s, a_1) \approx \hat{J}^\top \: \nabla_{a_t} Q(s, \mathrm{ODE}(a_t))$
where $\hat{J}=I$ and $J=\frac{\partial \hat{a}_1}{\partial a_t}$.
Both QGF-Jacobian and QGF-chain are the more "exact" counterparts of QGF, respectively using the exact
Jacobian, and the exact denoising process.
We evaluate these gradient estimators, together with the OOD and BPPT gradients from above, with two metrics:
(1) gradient noise sensitivity and (2) $Q$-value optimization ability.
For (1), we compute the cosine similarity between the gradient at $a_t$ and at $a_t + \epsilon$,
where $\epsilon$ is a small perturbation.
For (2), we view each gradient estimator as an "optimizer" that tries to
optimize for high $Q$ value actions, which we find to roughly correlate with performance.
The analysis result above shows that QGF's two approximations together make it the
least sensitive to gradient noise, and the best $Q$-value optimizer. It is also better
than the OOD and BPTT gradients from above.