# Evaluate the whole setup, not just the model: a practical workflow for video VLMs

> Source: <https://dev.to/exploringvisionandai/evaluate-the-whole-setup-not-just-the-model-a-practical-workflow-for-video-vlms-2a2g>
> Published: 2026-06-24 20:36:46+00:00

This is a reworked, shorter version of a research note we wrote on the VideoDB Labs blog. I work on the team at VideoDB. The original post and the open source repo are linked at the bottom, and this article is set to canonical back to the original.

Most "which VLM is best" comparisons answer a question almost nobody actually has. For a real video workflow the output does not depend on the model alone. It depends on the segmentation strategy, frame sampling, resolution, the prompt, the model, any reasoning budget, latency limits, and whatever post-processing runs after. Swap any one of those and the numbers move.

So the unit you compare is not model A vs model B. It is configuration A vs configuration B, on your data, at the quality, latency, and cost you can actually support.

Retrieval, monitoring, summarization, moderation, metadata extraction, and Q&A are different tasks. They produce different outputs and tolerate different errors. Before picking a model, write down four things:

Those answers tell you where to start. Short-lived actions push you toward denser sampling and more frames. Mostly static video lets you get away with lighter extraction. Latency or cost pressure means you put the cheap configurations in the benchmark early.

The dataset is the center of the eval. Public benchmarks are fine for a sanity check, but they do not answer the question teams actually care about, which is whether this works on their footage.

A useful set includes normal cases, hard cases, near-miss negatives, boring stretches, and the failure modes you already know about. Surveillance data should include occlusion, low light, motion blur, empty scenes, and crowded scenes. Meeting data should include crosstalk, screen shares, poor audio, and long static sections. Do not build the set around what is easy to label. Build it around the decision you need to make.

For retrieval, the question is whether the right moment shows up, how high it ranks, and whether similar-but-wrong clips stay out. For alerting, whether the alert stream is usable. For summarization, whether it is factually correct and avoids inventing things. For metadata extraction it is often better to score field by field. Decide early whether missed events or false alarms cost you more, because that is a product choice, not an academic one.

A benchmark is not just a score. You want to answer what exact input produced this output, which configuration generated it, how it was scored, and what changed between two runs. Keeping the per-item context (video id, scene start and end, frame URLs, extraction config, prompt, model, output, scores) means a regression points you at a cause instead of a mystery. We use Langfuse for this layer in the repo, but the principle holds with any tracing setup.

The goal is not a leaderboard. By the end of a run you should know which configuration becomes the default path, which lighter setup is good enough for easy cases, which stronger setup to reserve for hard slices, and where the system still fails. And when quality is short, the first fix is often more signal into the model (denser sampling, more frames, better scene boundaries) before reaching for a bigger model.

We open sourced the pipeline so you can run the same process on your own videos, define your own metrics, swap in your own models, and compare configurations without rebuilding the stack.

Happy to answer questions about the workflow or the tradeoffs in the comments.
