Detection tells you what is in a frame. Decisions require knowing what changed, what it means, and what should happen next. Those are different questions and no detector, however accurate, can answer them.
TL;DR — Every real-world vision deployment eventually hits the same wall: the question the business asks is not the question the model answers. Detection answers “what is in this frame?” Decisions require “what changed?”, “what does it mean?”, and “what should happen next?” Bridging that gap takes four things no model provides: state, transitions, validation, and memory.
Founder & Chief Architect, VAOS Building machine perception infrastructure, cognitive runtimes, and vision-native reasoning systems.
In Part 1, we left a detector staring at a coffee shelf, counting croissants it would forget about a sixtieth of a second later. We said the problem wasn’t eyesight — it was the lack of a layer that maintains trustworthy state. That was the diagnosis. This part is the autopsy: four deployments, four industries, the same cause of death every time.
Every object detector on earth answers exactly one question:
“What is in this frame?”
And it answers it beautifully. Person, 0.94. Car, 0.89. Bicycle, 0.76. Sixty times a second, confidence scores included, no complaints, no coffee breaks.
The trouble is that no business, robot, or safety system has ever actually needed to know what is in a frame. What they need to know is:
“What changed?” “What does it mean?” “What should happen next?”
These aren’t harder versions of the same question. They’re different questions, and the difference is structural: the first is about a frame, and the other three are about time. You can stare at a single image for a thousand years and never find the answer to “what changed?” in it. That answer lives in the relationship between this frame and every frame before it — and the detector threw all of those away.
This is the gap between vision and decision. Let’s walk through it four times, because it’s the kind of thing you only really believe once you’ve seen it happen in your own building.
Back to Dana’s pastry case. The detector reports, faithfully and forever:
croissant: 0.97
croissant: 0.95
croissant: 0.93
That’s the detector’s entire vocabulary. Three words, repeated until the heat death of the universe. But the events Dana actually cares about look like this:
SOLD — count decreased, and stayed decreased
RESTOCKED — count increased after staff entered the zone
LOW STOCK — count crossed below threshold and persisted
EMPTY — count reached zero and held
Look closely at those four. Not one of them lives in a frame. Sold is a comparison between now and earlier. Restocked is a change plus a cause. Low stock is a threshold plus persistence. Empty is a value plus duration. Every single event the business cares about is a state transition — and the detector has no concept of state, so it can never emit one. You can ask it nicely. It will simply keep saying croissant, 0.97.
And it gets funnier, in the way production systems are always funny in hindsight. A barista’s hand passes over the case, hiding three croissants for two seconds. A naïve system dutifully logs three sales, then three miraculous restocks. By lunch it thinks the shop has sold four hundred croissants and is somehow still full. A system with validation knows croissants don’t sell and resurrect themselves in 200 milliseconds — and quietly throws the nonsense out.
It’s 2 a.m. on the night shift. The detector says, helpfully:
pallet: 0.91
What the operations team needs at 2 a.m. is:
pallet #P-2847 has remained in staging zone B
for 40 minutes — exceeds 20-minute dwell limit
Count the things in that one sentence that detection cannot give you. #P-2847 needs identity — knowing this is the same pallet across thousands of frames, not a new one blinking into existence each time. Has remained needs memory. 40 minutes needs temporal accounting. Exceeds the limit needs a rule pinned to this specific square of floor — staging zone B tolerates twenty minutes; the lane five meters over might shrug at two hours.
A frame-by-frame system sees that pallet about 72,000 times across forty minutes and finds every single sighting utterly unremarkable. The whole story — the thing worth waking someone up for — exists only in the accumulation. And accumulation needs a layer the detector simply doesn’t have.
A mobile robot rolls down a corridor and its detector blurts:
obstacle: 0.88
And then what? Stopping is the easy part — a brick can stop. Deciding is the hard part, and deciding needs answers the frame doesn’t contain:
Situation understood? Is this a wall, a parked cart, or a person mid-stride? A wall says reroute. A person says yield. A cart that’s been parked there for an hour says go tell a human.
**State changed? **Was this obstacle here three seconds ago? Something that just appeared in a corridor that was clear a moment ago is a completely different situation from something the robot’s been tracking all along — and honestly, it might be a perception glitch worth doubting before anyone slams the brakes.
Recovery possible? Will it move on its own? A person clears in seconds; a pallet has all night. The right move — wait, reroute, or escalate — depends entirely on the predicted trajectory of the state, not on the detection.
A robot that rediscovers the world from scratch sixty times a second can’t answer any of these. It can only flinch. And a warehouse full of robots flinching at shadows is a very expensive way to stand still.
.
Here the stakes stop being funny. The detector reports:
human: 0.96
But safe interaction depends on situations that exist only across time:
HUMAN APPROACHING — distance decreasing across frames
HUMAN RETREATING — distance increasing across frames
HUMAN DISTRESSED — posture anomaly persisting beyond noise
UNSAFE INTERACTION — proximity + velocity + zone, combined over time
Approaching and retreating produce identical individual frames — the only difference is trajectory, which is to say, the only difference is history. Distressed means telling a persistent pattern apart from one odd pose. Unsafe is a compound judgment over proximity, motion, and context that no confidence score will ever express.
A humanoid that knows only “human: 0.96” knows a human exists. It does not know whether to keep going, slow down, freeze, or call for help. Everything that matters about physical AI safety lives in that gap — and not one bit of it is a detection problem.
Four domains, one identical structure. In every case, the system needs exactly four capabilities that sit between the model and the decision:
Four industries, one identical gap. Frame-level detections on the left can only become decision-ready situations on the right by passing through state, transitions, validation, and memory.
State — an explicit answer to “what is true right now?” The shelf is low. The pallet is in zone B. The corridor is blocked. The human is within two meters. Not the model’s latest output — the world as currently believed.
Transition — change as a first-class event. Stocked → low at 14:32. Clear → blocked two seconds ago. The decisions live here, because “what changed?” can only be answered by something that bothered to record the change.
Validation — a bouncer at the door between observation and belief. Croissants don’t vanish in 200ms. Pallets don’t teleport. Humans don’t move at 40 km/h indoors. Observations that break the rules of the scene get turned away before they corrupt anything — which is the entire difference between an alert and a false alarm.
Memory — the history that makes duration, identity, and trend even thinkable. Forty minutes of dwell. Three hours of slow stock decay. The same pallet across 72,000 frames. Without memory, every question that starts with “how long,” “since when,” or “is this the same one” is unanswerable in principle, not just in practice.
Here’s what these four have in common: none of them is a model property. You cannot fine-tune your way to memory. You cannot scale your way to state. They’re architecture — a layer that has to be built, on purpose, between vision and decision. No amount of GPU will accidentally grow you one.
Here’s the uncomfortable one-line summary of all four stories:
The answer to every question that matters is not in the frame.
It’s in the difference between frames, the persistence across frames, the plausibility given the scene, and the history of everything seen so far. Detection was never going to provide that, because detection was never designed to. Asking a detector “what should happen next?” is like asking a thermometer to run the building’s heating: it can tell you it’s 19 degrees, with great precision, forever — it has no idea whether that’s a problem.
What’s needed is a runtime that takes detections in as evidence and produces validated, persistent understanding out the other side — a loop that observes, updates state, records transitions, validates them, and remembers.
Detection produces observations. Decisions require a runtime that maintains state, records transitions, validates belief, and remembers history.
In Part 3, we’ll introduce OSTVAL — a cognitive runtime for machine perception.
Next in this series: “OSTVAL: A Cognitive Runtime for Machine Perception” — the architecture of the loop: Observe → State → Transition → Validate → Act → Learn, and why naming the layer matters as much as building it.
AUTHOR’S NOTE:
Machine Perception is an ongoing series exploring the architecture of cognitive systems for vision, robotics, and autonomous machines. Questions, critiques, and alternative viewpoints are encouraged.
For further discussion around machine perception, cognitive runtimes, OSTVAL, and VAOS, join the conversation in the VAOS community:
https://discord.com/invite/Mg7TztrU
[https://github.com/vaos-online]
The Missing Layer Between Vision and Decision was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.