Segmenting Robot Video into Actionable Subtasks

wpnews.pro

<- More blog posts

35 min read

A benchmark and field report on using VLMs to turn robot and egocentric video into timestamped subtask annotations.

Robotics
Annotations
Subtask
Benchmarks

TL;DR #

We introduce WGO‑Bench, a new benchmark for testing robotics subtask annotation performance across100 egocentric and robot-video episodes with743 annotated segments spanning62 unique high-level task instructions. - We ran over 60 experiments to find the best subtask annotation pipeline: the best subtask segmentation method reaches0.306 F1, subtask labeling reaches** 61.0% accuracy**, and the best end-to-end pipeline reaches** 0.168 F1**. - Gemini models are undisputed best for this task, with the best model ( Gemini 3.5 Flash) outperforming the best non-Gemini model (GPT-5.5) by** 24.5%. - Our best end-to-end method uses contact sheets to keep inference cheap, costing $2.64 per hour of video(batch pricing), or roughly 19x less than human annotation**. - The full pipeline is open source and implemented in Refiner; see theready-to-use subtask annotation exampleto run it on your own videos.

Why annotate subtasks? #

Imagine walking into a kitchen you have never seen before with an instruction: "Make me goulash." If you have never cooked it, you will need to learn it. To do so, you need more than the final instruction; you need the steps, the objects, and where to find them: open the left-most shelf, take out the cutting board, place it on the counter, pick up an onion, peel it, put it on the board, chop it, and so on.

Robot learning has a similar problem. To teach robots new long-horizon tasks, we need more than weak high-level instructions. For a robotics demonstration video, the useful signal is which subtask is happening at each moment, and where one subtask ends and the next begins.

Subtasks are becoming a central learning signal in recent robotics work. Zawalski et al. (2025)Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, Sergey Levine. (2025). Robotic Control via Embodied Chain-of-Thought Reasoning. https://arxiv.org/abs/2407.08693 uses subtasks together with chain-of-thought reasoning between plans and actions. The recent π series (Physical Intelligence et al., 2025Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, et al. (2025). $\pi_0.5$: a Vision-Language-Action Model with Open-World Generalization. https://arxiv.org/abs/2504.16054) and RT‑H (Belkhale et al., 2024Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, et al. (2024). RT-H: Action Hierarchies Using Language. https://arxiv.org/abs/2403.01823) use semantic subtask prediction alongside low-level action learning, with both showing substantial gains from this extra supervision. Subtasks are also useful beyond direct policy training: SARM (Kim et al., 2025Changyeon Kim, Minho Heo, Doohyun Lee, Jinwoo Shin, Honglak Lee, Joseph J. Lim, et al. (2025). Subtask-Aware Visual Reward Learning from Segmented Demonstrations. https://arxiv.org/abs/2502.20630) uses them for reward modeling.

As robotics data collection continues to scale, we need annotation pipelines that can keep up. Paying human annotators to watch every hour of video quickly stops being feasible. Despite the promising results, there is little public material on how to mine subtask annotations at scale. The closest public writeup we found is Scale's dense video captioning post (Choghari et al., 2026Choghari, Jade, Sansone, Agustin, Pasqualis, Nicolas, Mader, Conrado, Tiupikov, Aleks, Sivapurapu, Mouli. (2026). The Path to Large Scale Dense Video Captioning. https://labs.scale.com/blog/path-to-large-scale-dense-video-captioning), but it focuses on hand/egocentric manipulation videos only and starts from already separated clips. For robotics, that skips two harder problems: taking a raw episode and deciding where one subtask ends and the next begins, and testing whether the same methods transfer from egocentric video to robot-camera settings. To fill this gap, we created a scalable pipeline to have models annotate subtasks without any human intervention, costing $2.64 per hour of video (batch pricing), making it roughly 19x cheaper than humans. This post shares the lessons we learned from this effort, including the best end-to-end method we found for mining subtasks from both egocentric and robot videos, as well as our new benchmark for robotics subtask annotation: WGO‑Bench (What's Going On Bench).

The full pipeline is open-sourced in Refiner, our robotics data processing framework. To run it on your own data, see the ready-to-use example code.

Measuring Progress: WGO‑Bench #

To iterate and choose the best approach, we needed a benchmark. Instead of directly training and evaluating robot policies on every candidate method, which would be very slow and expensive, we built a new benchmark, WGO‑Bench, to directly measure how close VLMs can get to human annotator performance, which are still employed for most of the current industrial efforts.

Benchmark composition

We collected and manually annotated 100 episodes spanning head-camera recordings from Galaxea World (Jiang et al., 2025Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, et al. (2025). Galaxea Open-World Dataset and G0 Dual-System VLA Model. https://arxiv.org/abs/2509.00576), third-person camera views of station-arm manipulation from DROID (Khazatsky et al., 2025Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, et al. (2025). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. https://arxiv.org/abs/2403.12945), and egocentric videos from HomER (Toloka, 2026Toloka. (2026). HomER v2: Home Egocentric Robotics Dataset. Hugging Face.) to create WGO‑Bench, a diverse subtask annotation benchmark. In total, it contains 743 annotated segments across 62 unique high-level task instructions.

Section	Type	Viewpoint	Samples	Unique tasks	Total duration	Avg ep len	Resolution	Segments
HomER	Human	Egocentric	25	17	39.2 min	94.0s	Mixed, mostly 1920x1080 / 848x480	470
DROID	Robot	External robot camera	50	26	24.9 min	29.9s	320x180	150
Galaxea	Robot	Robot head camera	25	19	7.4 min	17.7s	1280x720	123
Total	Mixed	Mixed	100	62	71.5 min	42.9s	Mixed	743

Manually annotating subtasks

We manually annotated WGO‑Bench demonstrations following a strict annotation protocol: segments are atomic manipulation events, boundaries follow object-state changes, and labels must be self-contained enough to train policies without relying on previous actions.

Annotation protocol details+ #

The episodes were segmented into atomic manipulation events rather than motion fragments. A subtask ends when the event is complete, not when the robot hand returns to a neutral pose. Unless there is a clear , the next subtask starts immediately after the previous one.

Boundaries are placed at object-manipulation changes: when an object becomes held, is released, reaches a new location, or a door or lid changes state. Camera motion, hesitation, and tiny hand adjustments are not separate subtasks.

Labels are self-contained. They do not refer to previous human or robot actions, and they describe the manipulated object and target location as precisely as possible: not "put the cup on the table", but "put the cup on the table next to the bowl." This prevents ambiguity because most robotic policies do not take past frames or actions as input.

Annotation interface

For the annotations themselves, we first tried existing tools like CVAT and Labelbox, but found them too inflexible. Instead, we built a simple API for manipulating datasets and calling generative models on our platform, then connected it to Codex. Annotators can point it at a dataset, describe what needs to be labeled, and generate a custom interface on demand, with model calls wired in for pre-segmentation or pre-labeling. All edits go through the same data layer, allowing parallel work, while keeping the data consistent. This interface is currently in closed beta; if you would like access, request it here.

Lessons from the annotation trenches

Even with a clear annotation protocol and a purpose-built UI, subtask annotation is difficult and slow for humans. The most obvious issue is time: without prefilled model suggestions, one minute of video easily takes more than ten minutes to annotate carefully.

Egocentric videos are usually the most time-consuming. Hands move quickly, subtasks are often much shorter than in robot videos, and both hands can act at once. For example, picking up a knife with the left hand while moving a tomato with the right. Even deciding where one action ends and the next begins can become ambiguous, such as whether sliding a container across a table should be split into pick -> slide or treated as one slide action.

The labels themselves are also hard to write precisely. Locations can be difficult to describe when there is no clear object to anchor them to, and manipulated objects are not always easy to recognize, especially in lower-resolution video. Fast egocentric motion makes this worse because a pick can happen in only a few frames, leaving little visual evidence for where the pick ends and the next action begins.

Grading

For grading, we define three related tasks: segmentation, labeling, and end-to-end annotation. Segmentation tests whether a model can find the right time boundaries. Labeling tests whether a model can name a segment when the correct time window is already given. End‑to‑end annotation tests the full setting: the model must both find the segment and label it correctly.

For segmentation, the model predicts timestamped subtask boundaries. Labels can be included, but they are ignored for this score. We use Segment F1 and count a predicted segment as matching a gold (human annotated) segment when IoU >= 0.75

. Before scoring, we snap the first predicted start time and the last predicted end time to the human annotation boundaries, since the outer edges of an episode are often ambiguous (it is not always clear whether the first action starts at the beginning of the video, when the robot first appears, or when the robot starts moving; the same issue applies at the end).

For labeling, the model receives the gold segment boundaries and predicts one label per segment. We use a Scale‑style LLM‑as‑judge rubric, adapted for subtask labels, with gemini-3.5-flash

as the judge.

Prompt: Label judge rubric+ #

You are judging whether a predicted subtask label matches a gold subtask label.

Gold label:
{gt_label}

Predicted label:
{pred_label}

Episode instruction:
{instruction}

Accept if:
- It describes the same manipulation event or world-state change.
- The main action is correct.
- The main manipulated object is correct.
- Source, destination, direction, or spatial relation is correct when central to the event.
- Wording can differ; synonyms are fine.
- It may be slightly less detailed than the gold label if it is still useful.

Reject if:
- The action is wrong.
- The main object is wrong.
- Source, destination, or direction is flipped or wrong.
- It describes a different event.
- It is too vague to identify the subtask.
- It hallucinates an important object or action.

Ignore:
- Grammar.
- Minor wording differences.
- Timing; timing is evaluated separately.

Return only JSON:
{"match": true}

Finally, we evaluate the end-to-end setting. It uses the same setup as segmentation, but a predicted segment also has to pass the labeling test in order to be considered a match.

Part I: Finding subtask boundaries #

Since we split annotation into boundary detection and labeling, the first question is how to find the subtask boundaries. Some methods can use extra robot state, such as gripper position, joint state, or end-effector motion. We intentionally do not assume access to that signal, because egocentric video and many scraped robot-video datasets only provide pixels.

Fixed‑length baseline

As a sanity-check baseline, we ignore the video content and task instruction entirely. We simply split each episode into consecutive segments that are exactly 5.77s

long, the mean gold segment duration in WGO‑Bench, and score the resulting boundaries against the human subtask boundaries. This reaches 0.070

Segment F1 on the full 100-episode benchmark.

Splitting Videos by Similarity

Our first non-trivial approach follows prior work on VLM captioning for subtask segmentation (Suzuki et al., 2026Kanata Suzuki, Shota Shimizu, Tetsuya Ogata. (2026). Proprioception Enhances Vision Language Model in Generating Captions and Subtask Segmentations for Robot Task. https://arxiv.org/abs/2512.20876). We ask a VLM (Gemini 3.5 Flash) to caption frames from the video, embed those captions, then split segments when the cosine similarity between neighboring caption embeddings drops far enough. Unfortunately, the threshold suggested in the paper heavily over-segments, and even after sweeping many similarity thresholds this method only barely beats the fixed-length baseline, reaching 0.081

F1.

One possible issue is the unnecessary conversion through text: image -> caption -> embedding. In the hope that this was the bottleneck, we skipped captioning and embedded the frames directly with Gemini Embedding 2. Unfortunately, this made things much worse, resulting in 0.007

F1.

Prompt: Gemini embedding+ #

Represent this robot video frame based on the current sub-goal the robot or human is pursuing. Focus on what object is being manipulated, what state change is underway, where the object is moving, and whether the current sub-goal appears different from nearby moments. Do not describe the image; produce an embedding useful for grouping frames that belong to the same ongoing sub-goal.

Asking the Model Directly

Per‑frame timestamped images

Instead of relying on embedding heuristics, we next fed frames directly into the model and asked it to segment the video. The key representation problem was how to represent time, as the model needed to know which frame corresponded to which timestamp.

Most approaches give the model time information through special temporal tokens (Li et al., 2025Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, Weidi Xie. (2025). Universal Video Temporal Grounding with Generative Multi-modal Large Language Models. https://arxiv.org/abs/2506.18883) or text. We used the simplest text-based version: interleave each sampled frame with its timestamp, use an instruction close to the one given to human annotators, and pack all frames into one prompt.

Because the model is already generating a structured segment list, we also ask it to include a short subtask label for each segment. That adds no meaningful extra cost and makes the outputs easier to inspect, but the experiments in this section are scored on boundary quality.

In this setup, the model receives a long sequence of separate images. Each frame is paired with timestamp text, and the prompt asks the model to use those timestamps when choosing subtask boundaries.

This resulted in a score of 0.193

F1, easily clearing the fixed-length baseline.

Contact Sheets: Better and Cheaper

The frame-based approach has two problems: it is expensive, and the performance is still weak. In our setup the Gemini API cost is $0.188

per minute of video while only reaching 0.193

F1.

We take inspiration from Scale's contact sheets and pack multiple frames into one large image. In our case, each sheet contains 20 frames in a 4-row by 5-column layout, sampled every 0.5 seconds, such that one sheet spans 10 seconds of video.

To encode timestamps, we use the same basic strategy as the frame setup, but describe the contact-sheet map instead of attaching one timestamp to each separate frame.

This only gives a small quality bump to 0.201

F1. The bigger win is cost: roughly 12x cheaper, dropping to $0.0158

per minute of video.

The cost drop from contact sheets mostly comes from how Gemini counts image inputs. Images up to 384 pixels in both dimensions count as 258

tokens, and larger images are split into 768x768

tiles, with each tile also counted as 258

tokens. Sending frames one by one pays that image cost for every sampled frame. Packing frames into one contact sheet pays for the sheet tiles instead, which amortizes the visual token cost across many frames. We also suspect contact sheets can be slightly better for quality because they reduce the number of separate images in the prompt; most models are trained with only a few images at a time, so a single composed image may be closer to the training distribution than a long list of individual frames.

Timestamp Cues

The auxiliary labels made the failure mode easy to inspect. Even though label quality is not the metric here, the generated labels are usually reasonable: the model knows that an object was picked up, placed, opened, or moved. The weak point is boundary placement. It can describe the broad event, but still misses finer subtasks and struggles to place the start and end times in the right place.

We initially avoided engraving timestamps directly into the frames because several works (Singh et al., 2019Singh, Amanpreet, Natarajan, Vivek, Shah, Meet, Jiang, Yu, Chen, Xinlei, Batra, Dhruv, et al. (2019). Towards VQA models that can read. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).; Liu et al., 2024Liu, Yuliang, Li, Zhang, Huang, Mingxin, Yang, Biao, Yu, Wenwen, Li, Chunyuan, et al. (2024). OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models. Science China Information Sciences. 10.1007/s11432-024-4235-6), as well as the Scale blog (Choghari et al., 2026Choghari, Jade, Sansone, Agustin, Pasqualis, Nicolas, Mader, Conrado, Tiupikov, Aleks, Sivapurapu, Mouli. (2026). The Path to Large Scale Dense Video Captioning. https://labs.scale.com/blog/path-to-large-scale-dense-video-captioning), warn against relying on visual text. Still, we decided to try it. Beyond plain timestamps, we also tried different encodings: replacing timestamps with IDs like {A-Z}_{1-20}

, where the letter marks the sheet and the number marks the tile inside the sheet, combining textual and visual cues, and changing where the timestamp is rendered.

Visual cues worked surprisingly well, pushing segmentation to 0.263

F1. Unfortunately, replacing timestamps with ID codes did not help, and combining visual cues with a text timestamp map made the result worse. So we kept the simple version: visual timestamps directly on the contact sheet.

Because visual cues worked so well, we also tried a few timestamp renderings: a yellow box in the top-right of each frame with black timestamp text, a white side strip with the frame time, and a large black box at the bottom-center with white timestamp text. None of them beat the original in-tile timestamp.

Contact Sheet Sweeps

After the timestamp sweep, the obvious next question was whether the rest of the contact-sheet design mattered too. We therefore swept resolution, sampling rate, and the number of frames per sheet.

Resolution

Surprisingly, scaling up the resolution did not help. It is not obvious why. When we resize benchmark videos down to 224px, even humans start to have more trouble annotating them, so we would expect normal-resolution sheets to help much more than they do. One possible explanation is that very large contact sheets hit practical limits in the model image pipeline, such as resizing, cropping, or less reliable attention over huge tiled images.

Sampling rate and sheet size

Again, the sweep did not improve over the original contact-sheet setting. Sampling more densely seemed to add noise or make the prompt harder to parse, while larger sheets did not provide enough extra context to justify the added visual complexity.

Input Decomposition Experiments

One reason we suspected the model might struggle was input size. For the longest episodes in our benchmark, a single full-episode call would send roughly 343 frames across 18 contact sheets, totaling around 20k input tokens. Since increasing context size and the number of separate input images typically degrades performance, we expected splitting segmentation into multiple smaller calls might help.

In each call, we sent one contact sheet with visual cues and additionally asked Gemini to emit whether the first segment was a continuation. We then postprocessed the results by merging continuations if either Gemini emitted the continuation flag or the segments matched exactly. We tried this both with a 2-frame overlap, to improve continuity matching, and with no overlap. To improve continuity further, we also tried adding textual information about the last unfinished segment from the previous call.

The main failure mode was that the model treated those artificial split points as real boundaries, ending segments there far more often than the gold annotations do:

Unfortunately, none of these methods beat the simple strategy of sending all contact sheets in one call. Still, some form of decomposition will likely matter when scaling to much longer episodes.

The results were fairly surprising: we thought adding overlap or last-segment context would reduce this split-boundary bias, but the opposite happened.

Prompt: Input decomposition+ #

Segment the current timestamped contact sheet from one continuous robot video.

Return only JSON with this shape:
{"segments":[{"start_sec":0.0,"end_sec":1.0,"subtask":"short action description","continues_previous":false}],"end_state":"short description of what may continue after this sheet"}

Context:
- This is sheet {sheet_index} of {sheet_count}.
- Current sheet visible time range: {start_sec:.2f}s to {end_sec:.2f}s.
- Only output segments that overlap emit range: {emit_start_sec:.2f}s to {emit_end_sec:.2f}s.
- Frames before emit_start_sec are overlap context only.
- Each sheet has 4 rows and 5 columns. Time runs left-to-right, then top-to-bottom.
- Every tile has a timestamp in the top-left corner. Use those visible timestamps.
- Episode instruction: {instruction}
- Previous accepted segments: {previous_segments_json}

Rules:
- Treat each segment as one manipulation event that changes the world state.
- Good boundaries happen when an object becomes held, is released, reaches a new location, a lid/door changes state, a tool starts/stops affecting a surface, or contents visibly move.
- If the first visible event continues the final previous accepted segment, set continues_previous=true on that first segment and use the same subtask wording if possible.
- Do not create a boundary just because this contact sheet starts.
- Avoid idle time, camera motion, hesitation, and tiny hand adjustments.

The Model Matters

Given that we had converged on image inputs, we next checked whether the bottleneck was the model itself. We benchmarked Google models alongside other proprietary and open-source models.

Gemini was clearly ahead of the other frontier labs in this setup. We were also surprised that Gemini Robotics ER did not do better, given that it should be tuned for spatial and robotics tasks. The open-source models were weaker here and often over-predicted segments.

Prompt Search with GEPA

At this point, the failures had a pattern. The model could often see the manipulation, but it applied the wrong granularity. It treated approach, grasp adjustment, hesitation, retreat, and tiny repositioning as separate events, while our annotations only count completed manipulation events.

Instead of continuing to hand-tune prompts, we used GEPA (Agrawal et al., 2026Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, et al. (2026). GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. https://arxiv.org/abs/2507.19457) on a separately annotated 15-episode validation set. The goal was to search for prompts that make the model follow the annotation protocol described above: segment atomic manipulation events, ignore incidental motion, and place boundaries at completed world-state changes.

The base prompt, event_reconstruct_contact_sheets_v1

, already described the right kind of boundaries: objects becoming held, released, moved, opened, closed, affected by tools, or visibly changing state. What it did not do strongly enough was constrain granularity. The best GEPA‑found prompt, completed_events_duration_prior_v1

, keeps the same event vocabulary but adds stricter counting rules: only completed manipulation events, no separate approach/grasp-adjustment/retreat segments unless the world state changes, no merging of distinct pick/place/open/close/pour/wipe events, and a light duration prior for the expected segment length.

The best prompt predicted fewer segments than the base prompt (550

vs. 645

) and improved overall F1 by removing many incidental-motion boundaries without losing too many real subtasks.

Final segmentation recipe

After running 54 segmentation experiments, spanning everything from 0.007

F1 to 0.306

F1, we converged on the following recipe:

sample frames every 0.5s
render them as 224px

tiles - pack 20

frames per contact sheet in a5

-column layout - draw visual timestamps directly on the frames

use Gemini 3.5 Flash with a completed-events prompt

The prompt itself was less about describing the image and more about enforcing the annotation protocol. It asked for completed manipulation events, not every visible movement; it split when objects were picked up, released, moved, opened, closed, affected by tools, or transferred; and it explicitly told the model not to split approach, grasp adjustment, retreat, hesitation, or tiny repositioning unless the world state changed.

Part II: Turning Segments Into Subtasks #

The second part of subtask annotation is labeling: given a fixed segment, describe the subtask happening inside it. This is considerably easier than boundary discovery because the model no longer has to reason over the whole episode at once; most subtasks are shorter than 10s

.

As described in Measuring Progress: WGO‑Bench, we evaluate labels with an LLM judge rather than exact string matching. A predicted label is correct if it describes the same manipulation event as the human annotation. Unlike boundary detection, this task does not have an obvious trivial baseline: there is no simple way to generate meaningful subtask labels without looking at the video.

We start from what worked best in our boundary experiments, with Scale's dense video captioning post (Choghari et al., 2026Choghari, Jade, Sansone, Agustin, Pasqualis, Nicolas, Mader, Conrado, Tiupikov, Aleks, Sivapurapu, Mouli. (2026). The Path to Large Scale Dense Video Captioning. https://labs.scale.com/blog/path-to-large-scale-dense-video-captioning) as a useful reference point for compact video inputs. Once the segment boundaries are fixed, we only need to figure out how much visual evidence and surrounding context the model needs to name the event correctly.

Input Formatting

As with segmentation, the first question is how to represent the video to the model. Labeling is easier because the input is already a short fixed segment, not a full episode, but it was still worth checking whether the same input-format lessons hold.

For most labeling experiments, we used the following prompt, based on what worked in segmentation.

Prompt: Labeling+ #

Annotate the fixed robot video segment shown in the contact sheet.

Return only JSON:
{"label":"short descriptive subtask label"}

Focus on the state change caused by the segment.

Rules:
- The frames are chronological and timestamped.
- The segment boundaries are fixed; do not create, split, merge, or move segments.
- Compare the beginning and end of the segment, then describe the completed visible change.
- Use one concise imperative phrase.
- Name the manipulated object and the action/state change.
- Include source, destination, side, direction, final placement, opened/closed state, filled/cleaned/cut/drawn/folded part when visible.
- If the segment is a continuous process, describe the process and its target, e.g. "wipe the wooden table with the cloth" or "dice the onion on the cutting board".
- Do not mention timestamps, frame numbers, uncertainty, or invisible intent.

Episode instruction: {instruction}

We treat labeling as a state-change question: given a fixed time window, compare the beginning and end, then name the completed manipulation event. The prompt stayed the same across these runs; only the visual input changed.

We tried different ways to feed visual information to the model: an MP4 video clip of the complete target segment, a single contact sheet made from three sampled frames, and the same three frames sent as separate image inputs.

Surprisingly, separate frame inputs were slightly better than contact sheets for labeling. We still continued with contact sheets because they are about 12x cheaper, and the gap was small enough that cost mattered more.

More Frames Are Not Always Better

The next question was how many frames to put inside the contact sheet. Unlike segmentation, labeling starts from well-bounded segments, so we can sample only a few frames uniformly and still capture most actions. In many cases, the start and end state already carry the useful signal.

We therefore ablated the number of frames per target sheet.

The best target-only setting uses only 5 frames. Beyond that, adding more frames does not meaningfully improve accuracy. However, this depends on how well the segments are split: with noisier or less precise boundaries, the model might need more frames to recover what happened.

Before and After Helps

The third ablation asks whether the model benefits from context beyond the target segment itself. Many actions are easier to name when you see what came immediately before or after: after picking up a cup, the next state often tells you where it was placed; after opening a door, the following segment can clarify whether the action was entering, closing, or repositioning.

We tested the following context variants:

whole-episode overview (one contact sheet sampled uniformly across the full episode)
local +/-1s context around the target segment
local +/-2s context around the target segment
previous/current/next fixed-segment context
previous/current/next fixed-segment context plus episode overview

Adding context helps, but only when it is structured as whole neighboring segments. A local +/-1s

or +/-2s

window is often too narrow: boundaries frequently sit near grasp or release events, so the padding does not show enough of the previous or next state. Whole‑episode overview also does not help much, likely because the global instruction already provides most of that high-level context.

Final recipe

For labeling, the final recipe is simple:

use the fixed segment boundaries
render three small contact sheets: previous segment, current segment, next segment
sample up to 5

frames per segment uniformly - ask Gemini 3.5 Flash to label only the current segment

use the neighboring segments only to disambiguate what changed

This gave our best raw-labeling result: 453/743 = 61.0%

accuracy. This is broadly consistent with Scale's finding that past/current/future visual context is the strongest representation for clip-level labeling (Choghari et al., 2026Choghari, Jade, Sansone, Agustin, Pasqualis, Nicolas, Mader, Conrado, Tiupikov, Aleks, Sivapurapu, Mouli. (2026). The Path to Large Scale Dense Video Captioning. https://labs.scale.com/blog/path-to-large-scale-dense-video-captioning). The main difference is that we use compact contact sheets instead of their hand-collage setup, which keeps the same temporal contrast while making the pipeline substantially cheaper to run.

End‑to‑End Evaluation #

After finding the best segmentation and labeling recipes, the final question is whether the full task should be done in one pass or split into two steps: segment first, then relabel the predicted segments.

Our first experiments were just combining the best segmentation setup with the best labeling setup, but to our surprise this performed worse than taking the labels directly from segmentation. Thus we ran one more experiment: use the original segment label as a strong prior, then ask the model to verify and minimally correct it using the previous, current, and next segment images.

Method
best segment -> label	`0.302`	`71.5%`	`0.184`	`0.132`	`0.154`
one-pass segmentation labels	`0.302`	`73.7%`	`0.190`	`0.136`	`0.158`
segment -> seeded relabeling	`0.302`	`78.1%`	`0.201`	`0.144`	`0.168`

Prompt: Seeded relabeling+ #

Annotate one fixed segment from a longer video.
Return only JSON:
{"label":"short descriptive subtask label"}
Inputs:
- The first image is the previous fixed segment, if it exists; otherwise it is blank/context only.
- The second image is the current target segment.
- The third image is the next fixed segment, if it exists; otherwise it is blank/context only.
- Each image is timestamped with absolute video time.
Episode instruction:
{instruction}
Target segment:
{segment_index} of {segment_count}
Target time:
{start_sec:.2f}s to {end_sec:.2f}s
Original predicted label for this exact segment:
{seed_label}
Rules:
- Label only the current target segment.
- Use previous/next images only to disambiguate what changed during the current segment.
- Treat the original predicted label as a strong prior, not as ground truth.
- Verify and minimally correct the original label using the current target segment.
- If the original label describes the same action and main object, keep it, only improving grammar or adding clearly visible essential details.
- If it is too vague but directionally correct, make it more specific.
- If it describes the previous/next segment, the wrong action, wrong object, wrong destination, or wrong state change, replace it.
- Do not describe the previous or next segment.
- Do not split or merge the fixed segment.
- Do not introduce a new action unless it is clearly visible in the current target segment.
- Do not make the label broader than the fixed segment.
- Use one concise imperative phrase.
- Include the exact action and manipulated object.
- Include source, destination, side, direction, final location, opened/closed/filled/cleaned state, or affected part when visible and central.
- Do not mention timestamps, frame numbers, uncertainty, candidates, or invisible intent.

That finally beat the one-pass segmentation method and raised label accuracy on temporal matches to 78.1%

and semantic E2E F1 to 0.168

.

The tradeoff is cost. Seeded relabeling improves semantic accuracy, but it can get expensive because each predicted segment triggers another model call and the relabeling prompt often incurs a lot of thinking tokens. Batch pricing cuts this roughly in half, but segmentation-only is still the cheaper default when cost is the priority. If label quality matters more and the budget allows it, running segmentation followed by seeded relabeling gives the best end-to-end result.

Stage	Calls	Cost / hour	Batch cost / hour
segmentation	`100`	`$0.86/h`	`$0.43/h`
seeded relabeling	`532`	`$4.41/h`	`$2.21/h`
total end-to-end	`632`	`$5.27/h`	`$2.64/h`

Failure Mode Breakdown

Table 2 separates the two failure modes: label accuracy on temporal matches reaches 78.1%

, while Segment F1 is still only 0.302

. Clearly the main bottleneck is segmentation.

The largest segmentation failure mode is short subtasks. This mirrors the human annotation problem described in the manual labeling section: quick pick, place, and adjustment events can span only a few frames or seconds, making their boundaries hard to place. The same issue shows up in the pipeline. Segments shorter than 2s

have by far the lowest recall, only 0.074

. This is most pronounced in HomER, the egocentric subset with the lowest F1, both because it has the highest concentration of short segments and because it contains the concurrent-subtask issues described in the labeling section.

Once the model lands on the correct temporal segment, labeling is much less broken than segmentation. Interestingly, segments with correctly predicted boundaries are easier to label than all gold segments, with label accuracy increasing from 61%

to 70.8%

. When it comes to errors, the action itself is mostly correct (92.0%

slot accuracy), but the problem is grounding. Most error cases come from underspecification of state (pour water into glass

vs pour water into glass until it's full

) or initial/final location (pick up the plate below the table

vs pick up a plate

).

Final end-to-end recipe

Both the segmentation only and segmentation + seed relabeling are open-sourced in Refiner, our robotics data processing framework. To run it on your own videos, see the ready-to-use subtask annotation example.

Conclusion #

We have shown that VLMs can pull useful subtask annotations out of raw robot videos, but the way we show the video to the model matters more than we expected. The best setup was also the simplest one. Timestamped contact sheets were cheaper and worked better than long sequences of individual frames. Prompt wording and model choice still changed the results a lot, and labeling worked best when the model could see what happened just before and just after the segment.

WGO‑Bench is still a small benchmark, but it gives us concrete tools to measure this problem end-to-end, including boundary discovery, segment labeling, and the combined semantic score. We are releasing it because subtask annotations are becoming an important training signal for long-horizon robot learning, and the community needs public evaluation targets for how to produce them at scale.

This is also the broader reason we are building Macrodata Labs: better data for better robots. Robotics teams should be able to turn raw physical-world experience into inspectable, enriched, reusable training signal without rebuilding the same data plumbing for every project or relying on hard-to-scale human annotation. Subtask annotation is one example of that larger problem, and we will keep tackling common robotics-data bottlenecks so more teams can turn real-world demonstrations into real-world progress.

Interested in collaborating on this topic or improving physical AI data pipelines?

source & further reading

macrodata.co — original article