vLLM V0 To V1: Correctness Before Corrections In RL

TL;DR

Hugging Face has completed key fixes in vLLM V1, restoring backend behavior to match vLLM V0, particularly in logprobs computation. This correction is vital for reliable reinforcement learning training. Remaining uncertainties include the impact of these fixes on other RL algorithms and future updates.

Hugging Face has confirmed that vLLM V1 now matches vLLM V0 in backend behavior after implementing four key fixes, primarily related to logprobs computation and runtime defaults, ensuring more accurate reinforcement learning training.

The company identified discrepancies in rollout logprobs, runtime defaults, inflight weight updates, and the use of fp32 lm_head as the main causes of initial mismatches between vLLM V1 and the vLLM V0 reference. These issues affected training metrics such as clip rate, KL divergence, entropy, and reward, which initially diverged from the expected behavior.

To address this, Hugging Face adjusted the logprobs mode to ‘processed_logprobs’, fixed runtime defaults like prefix caching and async scheduling, and aligned inflight weight update procedures. These changes resulted in vLLM V1 producing output consistent with vLLM V0, as demonstrated by comparative metrics shown in their recent figures.

Why It Matters

This development is significant because it ensures the reliability of reinforcement learning workflows that depend on precise logprobs calculations. Correct backend behavior is critical for training stability, policy updates, and overall model performance, especially in online RL systems like PipelineRL, PPO, and GRPO.

By fixing these core issues before modifying the RL objective, Hugging Face emphasizes the importance of backend correctness as a foundation for subsequent model improvements and training consistency.

Amazon

machine learning logprobs calculator

As an affiliate, we earn on qualifying purchases.

Background

The migration from vLLM V0 to V1 was a major rewrite of the inference engine, with initial issues surfacing in early training metrics. The primary concern was a mismatch in logprobs, which directly impacts policy ratios and reward calculations in RL training. Previous versions like vLLM 0.8.5 served as the reference, while vLLM 0.18.1 was used for V1, complicating direct comparisons.

Early attempts at V1 revealed deviations in key training signals, prompting a detailed investigation into three potential causes: semantic differences in logprobs, inference-path variations, and objective misalignments. The team prioritized ruling out inference and semantic issues first, leading to the recent fixes.

“We fixed the backend behavior by aligning logprobs processing, runtime defaults, and weight update procedures, ensuring V1 matches V0 in output quality.”

— Hugging Face engineering team

“Correct backend behavior is fundamental for stable RL training, and these fixes lay the groundwork for future improvements.”

— Hugging Face spokesperson

Amazon

reinforcement learning model training tools

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear how these fixes will impact other RL algorithms beyond the tested GSPO setup or how future updates might further refine backend behavior. The long-term stability and performance in diverse training scenarios are still being evaluated.

Amazon

AI model inference engine hardware

As an affiliate, we earn on qualifying purchases.

What’s Next

Hugging Face plans to monitor the impact of these fixes across different RL methods and models, with upcoming updates focusing on further backend optimizations and potential integration of new features to enhance training fidelity.

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

As an affiliate, we earn on qualifying purchases.

Key Questions

Why was aligning logprobs so critical in vLLM V1?

Accurate logprobs are essential for correct policy ratios, reward calculations, and overall RL training stability. Discrepancies can lead to unstable training dynamics and suboptimal policy updates.

What specific changes were made to fix the backend behavior?

The team set logprobs-mode to ‘processed_logprobs’, fixed runtime defaults like prefix caching and async scheduling, and aligned inflight weight updates to match vLLM V0 behavior.

Will these fixes affect other RL algorithms or only GSPO?

While primarily tested with GSPO, the fixes address fundamental backend behaviors that are relevant to any RL system relying on rollout logprobs, so they are expected to benefit other algorithms as well.

Are further updates planned for vLLM V1?

Yes, Hugging Face intends to continue refining backend stability, performance, and feature support to ensure consistent RL training across various setups.

vLLM V0 To V1: Correctness Before Corrections In RL

Up next

The Chinese whiz kids of Silicon Valley

Author

The Genius Factory Team

Share article