VLA or IL? A Controlled Dataset for Testing Whether Finetuning Turns Your VLA into a Fancy Imitation Learner A developer has created a controlled dataset to test whether finetuning Vision-Language-Action (VLA) models degrades them into imitation learners that memorize scene-action mappings rather than genuinely understanding language and spatial relationships. The dataset, based on the LIBERO-Object simulation benchmark, independently varies object prompts and positions across 40 scenes to evaluate whether VLAs can adapt to unseen language instructions or object locations after finetuning. Early results suggest that finetuned VLAs often fail these tests, raising concerns that they rely on memorized patterns rather than true visual-language comprehension. Robot manipulation is the ability of a robot to interact with and manipulate objects in the physical world, such as grasping objects, moving them precisely, and adapting to changes in the environment. Traditional approaches such as Imitation Learning IL ACT https://arxiv.org/abs/2304.13705 , Diffusion Policy https://arxiv.org/abs/2304.13705 learn directly from human demonstrations, mapping visual observations to actions. While effective in controlled settings, these policies are difficult to generalize. Vision-Language-Action VLA models RT-2 https://arxiv.org/abs/2304.13705 , OpenVLA https://arxiv.org/abs/2304.13705 , π series https://arxiv.org/abs/2304.13705 represent a promising new paradigm. A VLA typically consists of a VLM backbone and an action expert: the VLM, pretrained on internet-scale vision-language data, provides rich high-level semantic understanding of the scene and the natural language instruction; the action expert then takes this semantic representation and outputs concrete robot actions. The entire architecture is trained end-to-end, enabling VLAs to not only understand what they are asked to do, but also execute it — rather than simply memorizing fixed scene-action mappings like traditional IL approaches. A typical VLA model consisting of a VLM backbone and an action expert image from π₀ A VLA model is first pretrained on large-scale diverse data to acquire general visual and language understanding, then finetuned on a smaller dataset of demonstrations for a target task and environment. However, recent work has raised serious concerns about this finetuning process. Several studies suggest that finetuning causes VLAs to degrade into imitation learners that memorize scene-specific action sequences based on training distribution, rather than genuine understanding of the scene through the VLM backbone. LIBERO-PRO https://arxiv.org/abs/2510.03827 finds that model trajectories remain nearly identical when the target object is replaced, removed, or the instruction is corrupted. LIBERO-Plus https://arxiv.org/abs/2510.13626 further shows that models fail when the target object is displaced. These observations raise a fundamental question: after finetuning, does the VLA degrade into a fancy imitation learner that relies purely on memorized scene-action mappings? To test this, I identify two key properties that an effective VLA should satisfy: I design a controlled dataset that independently varies these two properties, forming a 2x2 experimental design. If a VLA truly understands language, changing the prompt to refer to a different object that is present in the scene should change the model's behavior accordingly. If a VLA truly generalizes spatially, moving the target object to an unseen position should not affect its ability to locate and grasp it. Failure in either case would suggest that the model relies on memorized scene-action mappings rather than genuine understanding. VLA models are commonly finetuned on the LIBERO https://libero-project.github.io/datasets simulation benchmark. To precisely test language grounding and spatial generalization, I construct a controlled dataset based on one of its sub-suites, LIBERO-Object, which allows me to independently vary the prompt and object positions while keeping everything else fixed. In LIBERO-Object, each task shares the same structure: a floor scene with one target object and 5 distractor objects, where the robot must pick up the target object and place it in a basket. The 10 tasks in LIBERO-Object are: To construct the 2x2 controlled dataset, I vary two factors independently: This yields 4 conditions per task, and 40 controlled scenes in total: | Seen prompt | Unseen prompt | | |---|---|---| Original position | Baseline | Tests language grounding | Shuffled position | Tests spatial generalization | Tests both | One example series from the controlled dataset is shown above. To better highlight the target object in each scene, a blue circle is drawn around it. In LIBERO, each task is defined by a BDDL configuration file, which specifies the scene layout, object placements, and the natural language prompt. During both training and inference, the VLA model receives the :language field as its prompt. Below is the baseline BDDL for the milk task original seen : define problem LIBERO Floor Manipulation :domain robosuite :language Pick the milk and place it in the basket ; CHANGEABLE language prompt :objects milk 1 - milk basket 1 - basket cream cheese 1 - cream cheese tomato sauce 1 - tomato sauce butter 1 - butter orange juice 1 - orange juice chocolate pudding 1 - chocolate pudding :obj of interest milk 1 ; CHANGEABLE target object basket 1 :init On milk 1 floor target object region ; CHANGEABLE object positions On cream cheese 1 floor other object region 0 On tomato sauce 1 floor other object region 1 On butter 1 floor other object region 2 On orange juice 1 floor other object region 3 On chocolate pudding 1 floor other object region 4 On basket 1 floor bin region ; fixed :goal And In milk 1 basket 1 contain region ; CHANGEABLE target object To generate the controlled variants, I modify the fields marked CHANGEABLE : Unseen prompt conditions : The :language field is changed to refer to a distractor object that is physically present in the scene. For example, “Pick the milk and place it in the basket“ is changed to “Pick the tomato sauce and place it in the basket“. The :obj of interest field is updated from milk 1 to tomato sauce 1 , and the :goal field is updated from In milk 1 basket 1 contain region to In tomato sauce 1 basket 1 contain region . Shuffled position conditions : The object placements in the :init section are randomly reassigned across the available floor regions target object region , other object region 0 to other object region 4 . For example, milk 1 which was originally at floor target object region may be reassigned to floor other object region 3 after shuffling. The basket position at floor bin region remains fixed. The generation script and the full dataset are available at: https://github.com/FN8211/Control-Dataset https://github.com/FN8211/Control-Dataset To validate the dataset, I ran pi0.5 https://arxiv.org/abs/2504.16054 with the LIBERO finetuned checkpoint on the four conditions using the milk task. The results are shown below: | Seen prompt | Unseen prompt | | |---|---|---| Original position | ✅ Success | ❌ Failure | Shuffled position | ❌ Failure | ❌ Failure | original seen: Pick the milk and place it in the basket original position — Success original unseen: Pick the tomato sauce and place it in the basket original position — Failure shuffled seen: Pick the milk and place it in the basket shuffled position — Failure shuffled unseen: Pick the tomato sauce and place it in the basket shuffled position — Failure The model succeeds only in the baseline condition, where both the prompt and object positions match the training distribution exactly. Changing either the prompt or the object positions — even when the target object is still present in the scene — causes complete failure.