# VLA or IL? A Controlled Dataset for Testing Whether Finetuning Turns Your VLA into a Fancy Imitation Learner

> Source: <https://dev.to/fn8211/vla-or-il-a-controlled-dataset-for-testing-whether-finetuning-turns-your-vla-into-a-fancy-4m6i>
> Published: 2026-05-26 00:33:13+00:00

Robot manipulation is the ability of a robot to interact with and manipulate objects in the physical world, such as grasping objects, moving them precisely, and adapting to changes in the environment. Traditional approaches such as Imitation Learning (IL) [[ACT](https://arxiv.org/abs/2304.13705), [Diffusion Policy](https://arxiv.org/abs/2304.13705)] learn directly from human demonstrations, mapping visual observations to actions. While effective in controlled settings, these policies are difficult to generalize. Vision-Language-Action (VLA) models [[RT-2](https://arxiv.org/abs/2304.13705), [OpenVLA](https://arxiv.org/abs/2304.13705), [π series](https://arxiv.org/abs/2304.13705)] represent a promising new paradigm. A VLA typically consists of a VLM backbone and an action expert: the VLM, pretrained on internet-scale vision-language data, provides rich high-level semantic understanding of the scene and the natural language instruction; the action expert then takes this semantic representation and outputs concrete robot actions. The entire architecture is trained end-to-end, enabling VLAs to not only understand what they are asked to do, but also execute it — rather than simply memorizing fixed scene-action mappings like traditional IL approaches.

*A typical VLA model consisting of a VLM backbone and an action expert (image from π₀)*

A VLA model is first pretrained on large-scale diverse data to acquire general visual and language understanding, then finetuned on a smaller dataset of demonstrations for a target task and environment. However, recent work has raised serious concerns about this finetuning process. Several studies suggest that finetuning causes VLAs to degrade into imitation learners that memorize scene-specific action sequences based on training distribution, rather than genuine understanding of the scene through the VLM backbone. [LIBERO-PRO](https://arxiv.org/abs/2510.03827) finds that model trajectories remain nearly identical when the target object is replaced, removed, or the instruction is corrupted. [LIBERO-Plus](https://arxiv.org/abs/2510.13626) further shows that models fail when the target object is displaced.

These observations raise a fundamental question: **after finetuning, does the VLA degrade into a fancy imitation learner that relies purely on memorized scene-action mappings?**

To test this, I identify two key properties that an effective VLA should satisfy:

I design a controlled dataset that independently varies these two properties, forming a 2x2 experimental design.

If a VLA truly understands language, changing the prompt to refer to a different object that is present in the scene should change the model's behavior accordingly. If a VLA truly generalizes spatially, moving the target object to an unseen position should not affect its ability to locate and grasp it. Failure in either case would suggest that the model relies on memorized scene-action mappings rather than genuine understanding.

VLA models are commonly finetuned on the [LIBERO](https://libero-project.github.io/datasets) simulation benchmark. To precisely test language grounding and spatial generalization, I construct a controlled dataset based on one of its sub-suites, LIBERO-Object, which allows me to independently vary the prompt and object positions while keeping everything else fixed.

In LIBERO-Object, each task shares the same structure: a floor scene with one target object and 5 distractor objects, where the robot must pick up the target object and place it in a basket.

The 10 tasks in LIBERO-Object are:

To construct the 2x2 controlled dataset, I vary two factors independently:

This yields 4 conditions per task, and 40 controlled scenes in total:

| Seen prompt | Unseen prompt | |
|---|---|---|
Original position |
Baseline | Tests language grounding |
Shuffled position |
Tests spatial generalization | Tests both |

One example series from the controlled dataset is shown above. To better highlight the target object in each scene, a blue circle is drawn around it.

In LIBERO, each task is defined by a BDDL configuration file, which specifies the scene layout, object placements, and the natural language prompt. During both training and inference, the VLA model receives the `:language`

field as its prompt.

Below is the baseline BDDL for the milk task (`original_seen`

):

```
(define (problem LIBERO_Floor_Manipulation)
  (:domain robosuite)
  (:language Pick the milk and place it in the basket)  ; [CHANGEABLE] language prompt

  (:objects
    milk_1 - milk
    basket_1 - basket
    cream_cheese_1 - cream_cheese
    tomato_sauce_1 - tomato_sauce
    butter_1 - butter
    orange_juice_1 - orange_juice
    chocolate_pudding_1 - chocolate_pudding
  )

  (:obj_of_interest
    milk_1    ; [CHANGEABLE] target object
    basket_1
  )

  (:init
    (On milk_1 floor_target_object_region)           ; [CHANGEABLE] object positions
    (On cream_cheese_1 floor_other_object_region_0)
    (On tomato_sauce_1 floor_other_object_region_1)
    (On butter_1 floor_other_object_region_2)
    (On orange_juice_1 floor_other_object_region_3)
    (On chocolate_pudding_1 floor_other_object_region_4)
    (On basket_1 floor_bin_region)                   ; fixed
  )

  (:goal
    (And (In milk_1 basket_1_contain_region))  ; [CHANGEABLE] target object
  )
)
```

To generate the controlled variants, I modify the fields marked `[CHANGEABLE]`

:

**Unseen prompt conditions**: The `:language`

field is changed to refer to a distractor object that is physically present in the scene. For example, “Pick the milk and place it in the basket“ is changed to “Pick the tomato sauce and place it in the basket“. The `:obj_of_interest`

field is updated from `milk_1`

to `tomato_sauce_1`

, and the `:goal`

field is updated from `(In milk_1 basket_1_contain_region)`

to `(In tomato_sauce_1 basket_1_contain_region)`

.

**Shuffled position conditions**: The object placements in the `:init`

section are randomly reassigned across the available floor regions (`target_object_region`

, `other_object_region_0`

to `other_object_region_4`

). For example, `milk_1`

which was originally at `floor_target_object_region`

may be reassigned to `floor_other_object_region_3`

after shuffling. The basket position at `floor_bin_region`

remains fixed.

The generation script and the full dataset are available at: [https://github.com/FN8211/Control-Dataset](https://github.com/FN8211/Control-Dataset)

To validate the dataset, I ran [pi0.5](https://arxiv.org/abs/2504.16054) with the LIBERO finetuned checkpoint on the four conditions using the milk task. The results are shown below:

| Seen prompt | Unseen prompt | |
|---|---|---|
Original position |
✅ Success | ❌ Failure |
Shuffled position |
❌ Failure | ❌ Failure |

*original_seen: Pick the milk and place it in the basket (original position) — Success*

*original_unseen: Pick the tomato sauce and place it in the basket (original position) — Failure*

*shuffled_seen: Pick the milk and place it in the basket (shuffled position) — Failure*

*shuffled_unseen: Pick the tomato sauce and place it in the basket (shuffled position) — Failure*

The model succeeds only in the baseline condition, where both the prompt and object positions match the training distribution exactly. Changing either the prompt or the object positions — even when the target object is still present in the scene — causes complete failure.
