QGF: Test-Time Gradient Guidance of Flow Policies in RL

TL;DR

We propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF trains both a reference flow policy (via behavioral cloning) and a value function critic and, at test time, uses a novel critic gradient estimator to guide the reference flow policy to sample higher-value actions without any additional policy learning.

Why Study Test-Time Policy Improvement?

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or back-propagating through denoising processes, which cause well-known issues with stability and affects scalability. A simpler and more scalable alternative is to perform policy improvement entirely at test time: train a reference policy with standard behavioral cloning and learn a critic separately with standard TD learning, then at inference time, use the critic to "guide" the reference policy towards higher-value actions.

Specifically, we would like to leverage the critic gradient $\nabla_a Q(s,a)$ for efficient guidance of the reference policy's denoising process towards high-value directions. Is there a way to do this that makes policy performance as good as algorithms that specifically train the actor to maximize returns?

Problems with Naïve Gradient Guidance

Recall that the solution to the KL-regularized reward maximization problem satisfies the closed-form:

$$\pi(a \mid s) \;\propto\; \hat{\pi}(a \mid s) \cdot \exp\!\left(\tfrac{1}{\beta}Q(s,a)\right),$$

where $\hat{\pi}$ is the reference policy, and $\beta$ is the regularization strength. Since diffusion (and under some constraints flow matching) learn the score function, we can take the score functions on both sides,

$$\nabla_a \log \pi(a \mid s) \;=\; \nabla_a \log \hat{\pi}(a \mid s) \;+\; \tfrac{1}{\beta}\,\nabla_a Q(s,a).$$

We can extend the above definition to an extended action space with noisy perturbations of actions:

$$\nabla_{a_t} \log \pi(a_t \mid s) \;\approx\; \nabla_{a_t} \log \hat{\pi}(a_t \mid s) \;+\; \tfrac{1}{\beta}\,\nabla_{a_t} Q(s,a_t).$$

This suggests that in order to sample from the improved policy $\pi$, we can modify the denoising process to integrate over the score function of the reference policy plus a guidance term $\nabla_{a_t} Q(s, a_t)$. This is analogous to performing classifier guidance with a learned $Q$ function replacing the classifier.

✗ The Most Natural Approach: OOD Gradient $\nabla_{a_t} Q(s,\, a_t)$

Following the derivation above, the guidance is directly $\nabla_{a_t} Q(s, a_t)$ — the critic gradient at the noisy action $a_t$. However, the critic $Q(s,a)$ is trained exclusively on fully denoised actions from the dataset. Intermediate noisy actions $a_t$ can lie far outside the critic's training distribution and querying it at out-of-distribution (OOD) noisy actions requires the gradient of the Q-function to be correct far from its training data, which is not generally guaranteed.

In fact, we provide an illustrative example below showing that using this OOD gradient for guidance can result in a suboptimal solution.

1D illustrative example comparing OOD, BPTT, and QGF guidance

1D illustrative example. A flow matching model maps Gaussian noise to a tri-modal distribution; $Q$ is negative $L_2$ distance to the optimal action $a^*$. Guidance with the OOD gradient $\nabla_{a_t} Q(s, a_t)$ does not result in the optimal solution — regardless of guidance weight, it consistently misdirects the flow to a suboptimal action. BPTT and QGF both converge to $a^*$.

✗ A More Principled Approach: BPTT Gradient $\nabla_{a_t} Q\!\left(s,\, \mathrm{ODE}(a_t)\right)$

A more principled alternative avoids querying $Q$ on OOD inputs: since flow matching deterministically maps $a_t$ to a fully denoised action $a_1 = \mathrm{ODE}(a_t)$, we can define $Q(s, a_t) := Q(s, a_1)$. The gradient then is $\nabla_{a_t} Q(s, \mathrm{ODE}(a_t))$ and calculating it requires backpropagating through the entire denoising chain.

However, this BPTT gradient is expensive to compute and highly sensitive to noise. Small perturbations in $a_t$ lead to wildly different gradient directions, causing instability. See the illustrative example below, and also our quantitative analysis in the paper.

BPTT gradient instability in 1D illustrative example

BPTT gradient instability. The same 1D illustrative setting shows that the BPTT gradient can be highly unstable, especially at larger guidance weights.

Our Method: Q-Guided Flow (QGF)

QGF avoids both failure modes with one key idea: instead of fully integrating the denoising ODE, we can obtain a cheap, first-order approximation of the denoised action by taking a single Euler integration step along the reference velocity field:

$$\hat{a}_1 \;=\; a_t \;+\; v_\theta(s,\, a_t,\, t)\cdot(1-t)$$

Then the critic gradient can be approximated with:

$$\nabla_{a_t} Q(s, a_1) \approx \nabla_{a_t} Q(s, \hat{a}_1) = \left(\frac{\partial \hat{a}_1}{\partial a_t}\right)^\top \nabla_{\hat{a}_1} Q(s, \hat{a}_1),$$

which is a product betweeen the gradient of $Q$ at the approximate denoised action and the Jacobian of the denoised action with respect to the noisy action, $J = \frac{\partial \hat{a}_1}{\partial a_t}$. Empirically, we find that $J$ can be ill-behaved since it requires differentiation through $v_\theta$, and simply replacing it with the identity gives better performance, effectively computing:

$$\nabla_{a_t} Q(s, a_1) \approx \hat{J}^\top \: \nabla_{\hat{a}_1} Q(s, \hat{a}_1) \quad \text{where} \quad {\hat{a}_1 = a_t + v_\theta(s, a_t,t) \cdot (1-t), \: \hat{J}=I}$$

During inferece, this gradient is added to the velocity field at each denoising step to "guide" denoising towards higher-value actions:

Algorithm — QGF Inference

Input: state $s$, reference flow $v_\theta$, critic $Q_\phi$, guidance weight $1/\beta$, steps $T$

$a_0 \sim \mathcal{N}(0, I)$

for $t = 0,\, 1/T,\, \ldots,\, 1 - 1/T$ do

$a' \leftarrow a_t + (1-t)\,v_\theta(s, a_t, t)$ // approx. clean action (1-step Euler)

$g \leftarrow \nabla_{a'} Q_\phi(s, a')$ // QGF gradient estimator without Jacobian

$a_{t+1} \leftarrow a_t + \tfrac{1}{T}\bigl(v_\theta(s, a_t, t) + \tfrac{1}{\beta}\,g\bigr)$ // Q-guided step

end for

return $a_T$

The Two Approximations Beat Their Exact Counterparts

Two design choices above in QGF may appear to be crude approximations: we drop the Jacobian entirely and use a first-order approximation of the denoised action. Both look like approximations one would only tolerate for efficiency. Surprisingly, we find that neither is merely a compromise and both approximations outperform their more "exact" counterparts. Intuitively:

1: First-Order Approx. vs. Full ODE Denoising

Following the full denoising process of the base BC flow restricts the denoised action to cover the full dataset distribution, while our approximation allows small deviations from the exact dataset distribution and allows the flow to choose only certain modes of the dataset distribution.

2: Identity Approx. ($\hat{J}=I$) vs. Exact Jacobian

The Jacobian can be ill-conditioned, especially at early denoising steps where the one-step approximation is a crude approximation of the gournd truth denoising process. Including it amplifies noise and dramatically increases gradient variance.

More quantitatively, we find that these choices yield a lower-variance gradient estimator that is better at optimizing $Q$-values. We compare below three different gradient estimators:

QGF: $\nabla_{a_t} Q(s, a_1) \approx \hat{J}^\top \: \nabla_{\hat{a}_1} Q(s, \hat{a}_1)$
QGF-Jacobian: $\nabla_{a_t} Q(s, a_1) \approx J^\top \: \nabla_{\hat{a}_1} Q(s, \hat{a}_1)$
QGF-chain: $\nabla_{a_t} Q(s, a_1) \approx \hat{J}^\top \: \nabla_{a_t} Q(s, \mathrm{ODE}(a_t))$

where $\hat{J}=I$ and $J=\frac{\partial \hat{a}_1}{\partial a_t}$. Both QGF-Jacobian and QGF-chain are the more "exact" counterparts of QGF, respectively using the exact Jacobian, and the exact denoising process.

We evaluate these gradient estimators, together with the OOD and BPPT gradients from above, with two metrics: (1) gradient noise sensitivity and (2) $Q$-value optimization ability. For (1), we compute the cosine similarity between the gradient at $a_t$ and at $a_t + \epsilon$, where $\epsilon$ is a small perturbation. For (2), we view each gradient estimator as an "optimizer" that tries to optimize for high $Q$ value actions, which we find to roughly correlate with performance.

Gradient noise sensitivity. Cosine similarity between the gradient at $a_t$ and at $a_t + \epsilon$. Values near 1 indicate low variance. QGF has the lowest variance of all estimators. Averaged over 20 tasks, 4 seeds.

Q-value optimization ability for each gradient estimator

$Q$-value optimization ability. $Q$ values of final denoised actions under each gradient guidance scheme. QGF's two approximations together make it the best gradient-based optimizer, closely approaching the best-of-$N$ oracle. The OOD gradient exploits the critic and doesn't perform well (see paper for details).

The analysis result above shows that QGF's two approximations together make it the least sensitive to gradient noise, and the best $Q$-value optimizer. It is also better than the OOD and BPTT gradients from above.

Results

Comparing QGF with Test-Time and Train-Time Baselines

We group baselines into two categories: test-time methods that uses a learned critic to optimize action sampling from a BC policy entirely at test time, and training-time methods that trains a policy to maximize $Q(s, a)$ during training. QGF outperforms all prior test-time RL methods by a significant margin and is competitive with the best training-time baseline. QGF also outperforms its variant QGF-Jacobian, showing that dropping the Jacobian in the gradient estimator improves performance.

Offline RL results: QGF vs all baselines

Offline RL performance at 500k training steps (20 tasks from OGBench, 10 seeds). Interestingly, the best training-time baseline (EDP) also uses a first-order approximation of the denoised action.

Scaling Test-Time Compute with Best-of-$N$

Best-of-$N$ (BFN) sampling is an effective way to improve policy performance when additional test-time compute is available. In fact, we find that BFN is orders of magnitude more expensive in FLOPs than QGF and other test-time methods. Still, QGF alone outperforms BFN ($N$=4). A variant of our method, QGF+BFN, which combines QGF with BFN sampling, matches BFN ($N$=16) at much lower compute cost, showing gradient guidance is efficient and effective.

Performance with best-of-$N$ sampling (20 tasks from OGBench, 10 seeds).

GFLOPs per action for each test-time method.

Scaling QGF to Harder Tasks and Larger Models

On offline goal-conditioned long-horizon tasks QGF is consistently the best-performing method as difficulty increases, and QGF is consistently better than QGF-Jacobian. and scales better with model size (~$4\times$ gain at 3.2M params) while training-time baselines plateau or collapse.

Goal-conditioned offline RL at 1M steps (25 tasks from OGBench, 10 seeds).

Performance vs. model size on cube-triple (5 tasks).

QGF with Different Critics

QGF is agnostic to the critic type: plugging in a higher-quality TD-based critic (QAM) further boosts performance compared to IQL critics, and QGF+QAM critic outperforms QAM itself.

QGF with IQL vs. TD-based (QAM) critic (20 tasks, 4 seeds).

BibTeX

@article{zhou2026test, title = {Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning}, author = {Zhou, Zhiyuan and Peng, Andy and Xu, Charles and Li, Qiyang and Springenberg, Jost Tobias and Frans, Kevin and Levine, Sergey}, journal={arXiv preprint arXiv:2606.11087}, year = {2026}, }

Test-Time Gradient Guidance ofFlow Policies in Reinforcement Learning