What Can RL Bring to VLA Generalization?
An Empirical Study

Jijia Liu^1*, Feng Gao^1*, Bingwen Wei¹, Xinlei Chen¹, Qingmin Liao¹, Yi Wu¹², Chao Yu^13†, Yu Wang^1†

¹Tsinghua University ²Shanghai Qi Zhi Institute ³Zhongguancun Academy
^*Equal Contribution ^†Corresponding Authors

Accepted by NeurIPS 2025

arXiv Code

Models

We conduct an empirical study to evaluate the generalization benefits of reinforcement learning (RL) fine-tuning versus supervised fine-tuning (SFT) for Vision-Language-Action (VLA) models.

In out-of-distribution tests, RL enhances VLA generalization substantially in Execution, improves moderately in Semantics, and performs on par with SFT for Vision.

Preliminary: Vision-Language-Action model

We base our study on OpenVLA (Kim et al., 2024), an open-source model that achieves state-of-the-art performance on various robot tasks.

At each time step the policy receives a single RGB image and an instruction, i.e., the history length H=1, and outputs a sequence of discretized action tokens representing the predicted control commands.

1. Effective RL fine-tuning of VLA models

1.1 RL algorithms: PPO, GRPO, DPO

We consider three representative RL algorithms: PPO, GRPO and DPO, fine-tuning the OpenVLA model with LoRA.

Our findings indicate that PPO consistently outperforms GRPO and DPO, likely due to non-stationary dynamics destabilizing GRPO, and sparse rewards together with distribution shifts limiting DPO.

Overview and performance of different RL methdos

1.2 Design factors of PPO

Shared actor-critic backbone: saves 45% VRAM and trains 53% faster in speed.

VLA warm-up: converges with about 50% fewer environment steps.

Minimal PPO epoch: reduces wall-clock time with similar sample-efficiency.

Actor-critic architecture of VLA RL fine-tuning

2. Evaluating fine-tuning methods on VLA generalization

2.1 Environments and datasets

Inspired by prior works (Fan et al., 2025; Stone et al., 2023) and the concept of Vision-Language-Action models, we define three dimensions of generalization:

Vision: We include both foreground and background changes, as well as image-level dynamic noise.

Semantics: We consider unseen variations in objects, receptacles, and instruction phrasings, as well as several new tasks.

Execution: We investigate changes in the initial positions of object and receptacle, as well as robot initial pose.

In the training setting, we randomise along three axes: 16 tables (Vision), 16 objects (Semantics), and perturbations of object and receptacle poses (Execution).

At test time we hold at least one of these factors out of distribution, introducing 9 novel objects, 16 unseen receptacles, 5 new table surroudings, and 16 distractor textures.

2.2 Comparison between SFT and RL

Performance comparison of RL and SFT w.r.t. number of demonstrations

In the aspect of data scale, SFT saturates at 16k demonstrations. At convergence, RL performs comparably to SFT-16k in the training setting and 42.6% better on unseen objects and tables.

Performance of generalization of SFT and RL

In the aspect of generalization, we find

Vision: SFT and RL perform comparably

Semantics: RL improves moderately

Execution: RL enhances substantially

3. Appendix: More demonstration videos

3.1 Vision tasks

Vision - Unseen Table

Case 1: both SFT and RL fail to grasp

Case 2: SFT grasps and sticks, RL succeeds

Vision - Dynamic Texture (weak)

Case 1: SFT fails to put on plate, RL fails to grasp

Case 2: SFT fails to grasp, RL succeeds

Vision - Dynamic Texture (strong)

Case 1: SFT sticks after grasping, RL fails to grasp

Case 2: SFT moves arm without holding the object, RL succeeds

Vision - Dynamic Noise (weak)

Case 1: both SFT and RL fail to grasp

Case 2: SFT moves arm without holding the object, RL succeeds

Vision - Dynamic Noise (strong)

Case 1: SFT fails to grasp, RL grasps and drops the object

Case 2: SFT fails to grasp, RL succeeds

3.2 Semantics tasks

Semantics - Unseen Objects

Case 1: SFT fails to grasp, RL grasps and drop the object out of the table

Case 2: SFT sticks, RL succeeds

Semantics - Unseen Receptacles

Case 1: SFT grasps and idles, RL doesn't grasp

Case 2: SFT fails to put the object, RL succeeds

Semantics - Unseen Instruction Phrasings

Instruct 1: pick up kitchen shovel and set it down on plate

Instruct 2: Put banana onto plate.

Case 1: both SFT and RL fail to grasp at the first time

Case 2: SFT sticks after grasping, RL succeeds

Semantics - Multi-Object (both seen)

Instruct 1: put watering can on plate

Instruct 2: put BBQ sauce on plate

Case 1: both SFT and RL fail to grasp

Case 2: SFT sticks after grasping, RL succeeds

Semantics - Multi-Object (both unseen)

Instruct 1: put champagne glass on plate

Instruct 2: put travel cup on plate

Case 1: SFT tries to grasp the wrong object, RL puts the wrong object on plate

Case 2: SFT sticks after grasping, RL succeeds

Semantics - Distractive Receptacles

Case 1: SFT puts the object on the wrong receptable, RL drops the object

Case 2: SFT hovers after grasping, RL succeeds

Semantics - Multi-Recep. (both unseen)

Instruct 1: put banana on sheet metal

Instruct 2: put plastic bottle on tomato slice

Case 1: SFT puts the object on the correct receptable, then moves it to the wrong one; RL directly put the object on the wrong receptacle

Case 2: SFT hovers after grasping, RL succeeds

3.3 Execution tasks

Execution - Unseen Position (obj. & recep.)

Case 1: SFT fails to grasp, RL sticks

Case 2: SFT moves arm without holding the object, RL succeeds

Execution - Unseen Robot Init Pose

Case 1: both SFT and RL fail to grasp

Case 2: SFT fails to grasp, RL succeeds

Execution - Mid-Episode Obj. Reposition

Case 1: both SFT and RL fail to grasp

Case 2: SFT moves arm without holding the object, RL succeeds

What Can RL Bring to VLA Generalization?An Empirical Study

Preliminary: Vision-Language-Action model

1. Effective RL fine-tuning of VLA models

1.1 RL algorithms: PPO, GRPO, DPO

1.2 Design factors of PPO

2. Evaluating fine-tuning methods on VLA generalization

2.1 Environments and datasets

2.2 Comparison between SFT and RL

3. Appendix: More demonstration videos

3.1 Vision tasks

3.2 Semantics tasks

3.3 Execution tasks

What Can RL Bring to VLA Generalization?
An Empirical Study