VLA or IL? A Controlled Dataset for Testing Whether Finetuning Turns Your VLA into a Fancy Imitation Learner

wpnews.pro

Robot manipulation is the ability of a robot to interact with and manipulate objects in the physical world, such as grasping objects, moving them precisely, and adapting to changes in the environment. Traditional approaches such as Imitation Learning (IL) [ACT, Diffusion Policy] learn directly from human demonstrations, mapping visual observations to actions. While effective in controlled settings, these policies are difficult to generalize. Vision-Language-Action (VLA) models [RT-2, OpenVLA, π series] represent a promising new paradigm. A VLA typically consists of a VLM backbone and an action expert: the VLM, pretrained on internet-scale vision-language data, provides rich high-level semantic understanding of the scene and the natural language instruction; the action expert then takes this semantic representation and outputs concrete robot actions. The entire architecture is trained end-to-end, enabling VLAs to not only understand what they are asked to do, but also execute it — rather than simply memorizing fixed scene-action mappings like traditional IL approaches.

A typical VLA model consisting of a VLM backbone and an action expert (image from π₀)

A VLA model is first pretrained on large-scale diverse data to acquire general visual and language understanding, then finetuned on a smaller dataset of demonstrations for a target task and environment. However, recent work has raised serious concerns about this finetuning process. Several studies suggest that finetuning causes VLAs to degrade into imitation learners that memorize scene-specific action sequences based on training distribution, rather than genuine understanding of the scene through the VLM backbone. LIBERO-PRO finds that model trajectories remain nearly identical when the target object is replaced, removed, or the instruction is corrupted. LIBERO-Plus further shows that models fail when the target object is displaced.

These observations raise a fundamental question: after finetuning, does the VLA degrade into a fancy imitation learner that relies purely on memorized scene-action mappings?

To test this, I identify two key properties that an effective VLA should satisfy:

I design a controlled dataset that independently varies these two properties, forming a 2x2 experimental design.

If a VLA truly understands language, changing the prompt to refer to a different object that is present in the scene should change the model's behavior accordingly. If a VLA truly generalizes spatially, moving the target object to an unseen position should not affect its ability to locate and grasp it. Failure in either case would suggest that the model relies on memorized scene-action mappings rather than genuine understanding.

VLA models are commonly finetuned on the LIBERO simulation benchmark. To precisely test language grounding and spatial generalization, I construct a controlled dataset based on one of its sub-suites, LIBERO-Object, which allows me to independently vary the prompt and object positions while keeping everything else fixed.

In LIBERO-Object, each task shares the same structure: a floor scene with one target object and 5 distractor objects, where the robot must pick up the target object and place it in a basket.

The 10 tasks in LIBERO-Object are:

To construct the 2x2 controlled dataset, I vary two factors independently:

This yields 4 conditions per task, and 40 controlled scenes in total:

Seen prompt	Unseen prompt
Original position
Baseline	Tests language grounding
Shuffled position
Tests spatial generalization	Tests both

One example series from the controlled dataset is shown above. To better highlight the target object in each scene, a blue circle is drawn around it.

In LIBERO, each task is defined by a BDDL configuration file, which specifies the scene layout, object placements, and the natural language prompt. During both training and inference, the VLA model receives the :language

field as its prompt.

Below is the baseline BDDL for the milk task (original_seen

):

(define (problem LIBERO_Floor_Manipulation)
  (:domain robosuite)
  (:language Pick the milk and place it in the basket)  ; [CHANGEABLE] language prompt

  (:objects
    milk_1 - milk
    basket_1 - basket
    cream_cheese_1 - cream_cheese
    tomato_sauce_1 - tomato_sauce
    butter_1 - butter
    orange_juice_1 - orange_juice
    chocolate_pudding_1 - chocolate_pudding
  )

  (:obj_of_interest
    milk_1    ; [CHANGEABLE] target object
    basket_1
  )

  (:init
    (On milk_1 floor_target_object_region)           ; [CHANGEABLE] object positions
    (On cream_cheese_1 floor_other_object_region_0)
    (On tomato_sauce_1 floor_other_object_region_1)
    (On butter_1 floor_other_object_region_2)
    (On orange_juice_1 floor_other_object_region_3)
    (On chocolate_pudding_1 floor_other_object_region_4)
    (On basket_1 floor_bin_region)                   ; fixed
  )

  (:goal
    (And (In milk_1 basket_1_contain_region))  ; [CHANGEABLE] target object
  )
)

To generate the controlled variants, I modify the fields marked [CHANGEABLE]

:

Unseen prompt conditions: The :language

field is changed to refer to a distractor object that is physically present in the scene. For example, “Pick the milk and place it in the basket“ is changed to “Pick the tomato sauce and place it in the basket“. The :obj_of_interest

field is updated from milk_1

to tomato_sauce_1

, and the :goal

field is updated from (In milk_1 basket_1_contain_region)

to (In tomato_sauce_1 basket_1_contain_region)

.

Shuffled position conditions: The object placements in the :init

section are randomly reassigned across the available floor regions (target_object_region

, other_object_region_0

to other_object_region_4

). For example, milk_1

which was originally at floor_target_object_region

may be reassigned to floor_other_object_region_3

after shuffling. The basket position at floor_bin_region

remains fixed.

The generation script and the full dataset are available at: https://github.com/FN8211/Control-Dataset

To validate the dataset, I ran pi0.5 with the LIBERO finetuned checkpoint on the four conditions using the milk task. The results are shown below:

Seen prompt	Unseen prompt
Original position
✅ Success	❌ Failure
Shuffled position
❌ Failure	❌ Failure

original_seen: Pick the milk and place it in the basket (original position) — Success

original_unseen: Pick the tomato sauce and place it in the basket (original position) — Failure

shuffled_seen: Pick the milk and place it in the basket (shuffled position) — Failure

shuffled_unseen: Pick the tomato sauce and place it in the basket (shuffled position) — Failure

The model succeeds only in the baseline condition, where both the prompt and object positions match the training distribution exactly. Changing either the prompt or the object positions — even when the target object is still present in the scene — causes complete failure.

source & further reading

dev.to — original article The Postgres Creator Says LLMs Score 0% on Real Databases. He Should Know. Building a Production-Grade AI Pipeline: Scoring 10,000+ Listings Daily with LLMs Passing the agent commerce checks without faking them

VLA or IL? A Controlled Dataset for Testing Whether Finetuning Turns Your VLA into a Fancy Imitation Learner

Run your AI side-project on zahid.host