What Can RL Bring to VLA Generalization?
An Empirical Study

Jijia Liu*, Feng Gao*, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, Yu Wang
Tsinghua University
*Equal Contribution
Corresponding Authors

Overview of our study

We conduct an empirical study to evaluate the generalization benefits of reinforcement learning (RL) fine-tuning versus supervised fine-tuning (SFT) for Vision-Language-Action (VLA) models.

In out-of-distribution tests, RL enhances VLA generalization substantially in Execution, improves moderately in Semantics, and performs on par with SFT for Vision.

Preliminary: Vision-Language-Action model

We base our study on OpenVLA (Kim et al., 2024), an open-source model that achieves state-of-the-art performance on various robot tasks.

At each time step the policy receives a single RGB image and an instruction, i.e., the history length H=1, and outputs a sequence of discretized action tokens representing the predicted control commands.

OpenVLA model architecture

1. Effective RL fine-tuning of VLA models


1.1 RL algorithms: PPO, GRPO, DPO

We consider three representative RL algorithms: PPO, GRPO and DPO, fine-tuning the OpenVLA model with LoRA.

Our findings indicate that PPO consistently outperforms GRPO and DPO, likely due to non-stationary dynamics destabilizing GRPO, and sparse rewards together with distribution shifts limiting DPO.

Overview and performance of different RL methdos


1.2 Design factors of PPO

Shared actor-critic backbone: saves 45% VRAM and trains 53% faster in speed.

VLA warm-up: converges with about 50% fewer environment steps.

Minimal PPO epoch: reduces wall-clock time with similar sample-efficiency.

Actor-critic architecture of VLA RL fine-tuning
Performance of ablating key design

2. Evaluating fine-tuning methods on VLA generalization


2.1 Environments and datasets

Inspired by prior works (Fan et al., 2025; Stone et al., 2023) and the concept of Vision-Language-Action models, we define three dimensions of generalization:

Vision: We include both foreground and background changes, as well as image-level dynamic noise.

Semantics: We consider unseen variations in objects, receptacles, and instruction phrasings, as well as several new tasks.

Execution: We investigate changes in the initial positions of object and receptacle, as well as robot initial pose.

In the training setting, we randomise along three axes: 16 tables (Vision), 16 objects (Semantics), and perturbations of object and receptacle poses (Execution).

At test time we hold at least one of these factors out of distribution, introducing 9 novel objects, 16 unseen receptacles, 5 new table surroudings, and 16 distractor textures.



2.2 Comparison between SFT and RL


Performance comparison of RL and SFT w.r.t. number of demonstrations

In the aspect of data scale, SFT saturates at 16k demonstrations. At convergence, RL performs comparably to SFT-16k in the training setting and 42.6% better on unseen objects and tables.


Performance of generalization of SFT and RL

In the aspect of generalization, we find

Vision: SFT and RL perform comparably

Semantics: RL improves moderately

Execution: RL enhances substantially

3. Appendix: More demonstration videos


3.1 Vision tasks


Vision - Unseen Table


Case 1: both SFT and RL fail to grasp

Case 2: SFT grasps and sticks, RL succeeds



Vision - Dynamic Texture (weak)


Case 1: SFT fails to put on plate, RL fails to grasp

Case 2: SFT fails to grasp, RL succeeds



Vision - Dynamic Texture (strong)


Case 1: SFT sticks after grasping, RL fails to grasp

Case 2: SFT moves arm without holding the object, RL succeeds



Vision - Dynamic Noise (weak)


Case 1: both SFT and RL fail to grasp

Case 2: SFT moves arm without holding the object, RL succeeds



Vision - Dynamic Noise (strong)


Case 1: SFT fails to grasp, RL grasps and drops the object

Case 2: SFT fails to grasp, RL succeeds



3.2 Semantics tasks


Semantics - Unseen Objects


Case 1: SFT fails to grasp, RL grasps and drop the object out of the table

Case 2: SFT sticks, RL succeeds



Semantics - Unseen Receptacles


Case 1: SFT grasps and idles, RL doesn't grasp

Case 2: SFT fails to put the object, RL succeeds



Semantics - Unseen Instruction Phrasings

Instruct 1: pick up kitchen shovel and set it down on plate

Instruct 2: Put banana onto plate.


Case 1: both SFT and RL fail to grasp at the first time

Case 2: SFT sticks after grasping, RL succeeds



Semantics - Multi-Object (both seen)

Instruct 1: put watering can on plate

Instruct 2: put BBQ sauce on plate


Case 1: both SFT and RL fail to grasp

Case 2: SFT sticks after grasping, RL succeeds



Semantics - Multi-Object (both unseen)

Instruct 1: put champagne glass on plate

Instruct 2: put travel cup on plate


Case 1: SFT tries to grasp the wrong object, RL puts the wrong object on plate

Case 2: SFT sticks after grasping, RL succeeds



Semantics - Distractive Receptacles


Case 1: SFT puts the object on the wrong receptable, RL drops the object

Case 2: SFT hovers after grasping, RL succeeds



Semantics - Multi-Recep. (both unseen)

Instruct 1: put banana on sheet metal

Instruct 2: put plastic bottle on tomato slice


Case 1: SFT puts the object on the correct receptable, then moves it to the wrong one; RL directly put the object on the wrong receptacle

Case 2: SFT hovers after grasping, RL succeeds



3.3 Execution tasks


Execution - Unseen Position (obj. & recep.)


Case 1: SFT fails to grasp, RL sticks

Case 2: SFT moves arm without holding the object, RL succeeds



Execution - Unseen Robot Init Pose


Case 1: both SFT and RL fail to grasp

Case 2: SFT fails to grasp, RL succeeds



Execution - Mid-Episode Obj. Reposition


Case 1: both SFT and RL fail to grasp

Case 2: SFT moves arm without holding the object, RL succeeds