{"slug": "give-my-agent-eyes", "title": "Give My Agent Eyes", "summary": "Roboflow's RF-DETR computer vision model now powers vision agents that detect objects in real time and autonomously act on that data, such as suggesting recipes from a fridge's contents or building a shopping list. The system layers perception, reasoning, action, and iteration into a closed loop, enabling AI to not only see but also make decisions and execute tasks without human coding. This development bridges the gap between object detection and autonomous decision-making, allowing users to build functional vision agents without writing code.", "body_md": "[Computer vision models](https://roboflow.com/models?ref=blog.roboflow.com) have gotten remarkably good at seeing. Models such as Roboflow's [RF-DETR](https://roboflow.com/model/rf-detr?ref=blog.roboflow.com) can detect objects in real time with state-of-the-art accuracy and return clean, structured outputs: the items in your fridge, the parts on your line, the vehicles in your lot.\n\nVision agents take that perception and put it to work. By pairing a vision model's output with reasoning, action, and iteration in a closed loop, an agent doesn't just detect the items in your fridge: it suggests recipes, builds a meal plan, and drafts your shopping list. Perception becomes the foundation for decisions.\n\nIn this blog, we will explore what [vision agents](https://blog.roboflow.com/vision-agents/) are, how they bridge the gap between detecting objects and reasoning about them, and how you can build one yourself without writing a single line of code.\n\n## What Is a Vision Agent?\n\nA vision agent is an AI system that can see, reason, act, and iterate autonomously across multiple steps to complete a goal.\n\nTo understand how vision agents work in practice, let's break them down into layers. Each layer handles a different part of the pipeline, turning raw visual input into meaningful actions.\n\n**Perception:** The perception layer uses a computer vision model to process images or video frames. It extracts visual features and detects objects, including their classes, locations, and confidence scores, producing a structured representation of the scene and identifying regions of interest for further analysis.**Reasoning:** The reasoning layer uses the outputs from the perception layer, such as cropped regions of interest and detection metadata, to interpret the scene. Commonly powered by a[large multimodal model](https://roboflow.com/workflows/block/lmm?ref=blog.roboflow.com)(LMM), it enables the system to understand relationships, infer context, and reason about what is happening.**Action:** The action layer executes the decision by interacting with external systems. This can include calling APIs, updating databases, triggering workflows, or controlling devices to produce real-world effects.**Iteration:** The iteration layer ensures the system operates continuously rather than as a one-time process. It feeds new visual input back into the pipeline, updates internal state, and repeats the full cycle as new data becomes available.\n\n## Giving the Agent Eyes\n\nA common misconception in vision agent systems is that large multimodal models (LMMs) can handle both perception and reasoning on their own. While LMMs are powerful at understanding and interpreting visual information, they are not ideal for the initial step of *locating* relevant objects in a complex scene.\n\nThis is where a dedicated computer vision model, such as Roboflow's [RF-DETR](https://roboflow.com/model/rf-detr?ref=blog.roboflow.com), which can process visual data and produce structured outputs, becomes essential.\n\n#### Why Full Images Break LLMs\n\nFeeding full, unprocessed images directly into an LMM creates several issues. Real-world scenes are often noisy, cluttered, and contain far more information than is necessary for a specific task. Processing the entire image at once is computationally expensive and slow, especially when scaling to real-time applications.\n\nMore importantly, LMMs lack precise spatial focus. Without explicit guidance on *where* to look, they may attend to irrelevant regions or miss critical details entirely. This can lead to inefficient reasoning and, in some cases, hallucinated outputs where the model attempts to infer details it cannot clearly observe.\n\n#### The Solution: A Fast Spotter Model (RF-DETR Layer)\n\nTo solve this, vision agents use a specialized perception layer powered by fast [object detection models](https://playground.roboflow.com/models/task/object-detection?ref=blog.roboflow.com) such as [RF-DETR](https://roboflow.com/model/rf-detr?ref=blog.roboflow.com) created by Roboflow. This model acts as a real-time spotter, scanning the full scene at high frame rates (e.g., 30fps) to identify relevant objects and regions of interest.\n\nInstead of passing the entire image to the LMM, the spotter isolates and crops only the most important areas at high resolution. These focused inputs are then forwarded to the reasoning model, allowing it to operate on clean, relevant visual data.\n\nThis separation of tasks makes the system significantly more efficient, accurate, and scalable.\n\n### The Reasoning Layer\n\nThe reasoning layer is powered by Large Multimodal Models (LMMs) such as [Gemini 3.1 Pro](https://playground.roboflow.com/models/google/gemini-3-1-pro?ref=blog.roboflow.com), [Claude Sonnet 4.5](https://playground.roboflow.com/models/anthropic/claude-4-5-sonnet?ref=blog.roboflow.com), or [GPT-5.5](https://playground.roboflow.com/models/openai/gpt-5-5?ref=blog.roboflow.com), which act as the system's inspector.\n\nLarge Multimodal Models are AI systems that can process and understand multiple types of input at the same time, such as images, text, and structured data. In vision agents, they are responsible for interpreting visual information rather than simply detecting objects. This makes them well-suited for higher-level reasoning tasks that require understanding context and relationships within a scene.\n\nThis layer takes the cropped image output from the Perception Layer and uses it to answer complex, context-heavy questions that require interpretation rather than simple detection. By focusing only on relevant regions of the image, the inspector can apply reasoning over high-quality visual evidence instead of processing the full noisy scene - focusing on interpreting what those detections *mean* in context.\n\nFor example, it can determine whether a medical reading is physiologically possible, or evaluate whether a metal surface passes quality standards in an [industrial inspection](https://roboflow.com/industries/industrial-manufacturing?ref=blog.roboflow.com) setting.\n\nIn this way, the reasoning layer transforms structured visual inputs into meaningful judgments, bridging the gap between detection and actionable understanding.\n\n### Orchestrating the Agent in Roboflow Workflows\n\nNow that we understand what each layer does, it is time to build the pipeline. [Roboflow Workflows](https://roboflow.com/workflows?ref=blog.roboflow.com) gives you a low-code visual canvas where you can chain together every block described above with no custom code required. Here is a step-by-step walkthrough of exactly how to do it.\n\nTo bring this all together, we are going to build a real vision agent from scratch. The use case is simple but practical: a hydration monitoring agent that detects whether a bottle is present at a desk, crops it, and uses an LMM to reason about the scene before sending an email alert. It is a straightforward example that demonstrates every layer of the vision agent architecture, and the same pipeline can be adapted to almost any physical monitoring task by swapping out the detection class, the prompt, and the action block.\n\n### Step 1: Log in to Roboflow\n\nNavigate to [Roboflow](https://roboflow.com/?ref=blog.roboflow.com) and sign in. If you do not have an account, you can sign up for free; it only takes a minute. Once you are in, make sure you have a workspace set up. All your workflows and models will live here.\n\n### Step 2: Create a New Workflow\n\nIn the left-hand sidebar, click the **Workflows** tab and then select **Create Workflow**.\n\nOn the next screen, choose **Build Your Own** and click **Create Workflow**. This opens the visual canvas, a drag-and-drop editor where you will assemble your pipeline block by block.\n\nThis workflow will process images one frame at a time. The input can come from uploaded images, a connected camera feed, or periodic captures depending on your deployment setup.\n\n### Step 3: Add the Object Detection Block (RF-DETR)\n\nThe first block you will add is the object detection model. This is the perception layer, the Spotter that scans each incoming frame and tells the rest of the pipeline where to look.\n\nClick the **Add Block** icon and search for the **Object Detection** block. Connect it to the Image Input block already on the canvas. Once added, click on it to open the configuration panel on the right. Under the **Model** subheading, click **Public Models**. You will see a list of pre-trained models; select [RF-DETR](https://roboflow.com/model/rf-detr?ref=blog.roboflow.com) at the top of the list. Next to it, click the **Size** dropdown and switch it to **Small (512x512)**. Then click **Select Model**.\n\n### Step 4: Add a Detections Filter\n\nThe raw detection output from RF-DETR Small will include every class it finds in the frame. Since it is pretrained on the [Microsoft COCO dataset](https://blog.roboflow.com/coco-dataset/), it can detect 80 different object classes out of the box, including person, laptop, car, chair, and many more. In a real scene it will likely detect several of these at once, but for our hydration monitoring agent the only class we care about is `bottle`\n\n.\n\nThe **Detections Filter** block is inserted immediately after the Object Detection block to discard everything else. Click **Add Block**, search for **Detections Filter**, and connect it to the Object Detection block.\n\nTo configure the filter, click the block to open the **Configure Detections Filtering** panel. Set **Filter By** to **Class and Confidence**. Under **Filter Detections By**, check **Object Class**, set the operator to **Include**, and enter the following in the class name field: `bottle`\n\nAny detection whose class name is not in this list is discarded before being passed downstream. The Confidence checkbox can be checked and set to a minimum score such as 0.5.\n\nBy filtering down to just the bottle, each crop produced in the next step will be tight and focused on exactly the right object, giving the reasoning layer clean, relevant visual evidence rather than a cluttered frame full of detections it does not need to reason about.\n\n### Step 5: Add a Dynamic Crop Block\n\nOnce the Detections Filter has isolated the bottle, the **Dynamic Crop** block uses its bounding box coordinates to extract a tight, high-resolution crop from the original image.\n\nClick **Add Block**, search for **Dynamic Crop**, and connect it to the Detections Filter block. No additional configuration is needed; it automatically reads each filtered detection and extracts the relevant portion of the frame.\n\nThis is the step that makes the reasoning layer accurate and cost-efficient. Instead of passing the full cluttered scene to the LMM, you are giving it a clean, focused image of exactly the object it needs to reason about. This is what prevents hallucinations and keeps inference fast.\n\n### Step 6: Add the Vision Agent Block (LMM)\n\nNow the reasoning layer kicks in. Click **Add Block**, search for **Google Gemini**, and connect it to the Dynamic Crop block.\n\nClick on **Additional Properties** and configure the following:\n\n**API Key:** Select **Roboflow Managed API Key** to get started without any additional setup. If you have your own [Google AI Studio](https://aistudio.google.com/?ref=blog.roboflow.com) API key you can paste it in here instead for more control over your usage limits. Note that Gemini 3 Pro is the most capable model on the list and ideal for complex reasoning tasks, however due to rate limiting on the shared managed key we used **Gemini 3 Flash** for this demo, which still produces excellent results for a task like this one.\n\n**Task Type:** Set this to **Structured Output Generation**. This tells the block to expect a JSON schema rather than a free text response, which is essential for the next step.\n\n**Output Structure:** This is where you define the JSON template you want Gemini to fill in. Paste in the following:\n\n```\n{\n  \"bottle_present\": \"true or false, whether a bottle is clearly visible in the crop\",\n  \"bottle_status\": \"being held, on desk, empty, or unknown\",\n  \"person_present\": \"true or false, whether a person or part of a person is visible near the bottle\",\n  \"confidence\": \"number between 0 and 1\",\n  \"reasoning\": \"brief explanation of the visual cues used to reach this conclusion. Base this only on what is visible in the crop\"\n}\n```\n\nAdjust the thinking level based on your needs. For simpler high-throughput tasks, a lower setting keeps latency low. For complex or high-stakes decisions, such as medical imaging or quality checks where errors are costly, set it higher to allow the model to reason more carefully before responding.\n\nBecause the model receives a focused crop rather than the raw scene, it applies its full reasoning capacity to high-quality visual evidence rather than background noise.\n\n### Step 7: Add a JSON Parser Block\n\nGemini returns its response as text. Even with Structured Output Generation enabled, you need the **JSON Parser** block to cleanly extract each field and make it available as an individual variable that downstream blocks can use.\n\nThink of it this way: Gemini hands you a sealed envelope with everything inside it. The JSON Parser opens that envelope and sorts the contents into labeled folders so the next block can grab exactly what it needs without digging through everything.\n\nClick **Add Block**, search for **JSON Parser**, and connect it to the Google Gemini block. In the **Expected Fields** section, enter the following:\n\n`bottle_present, bottle_status, person_present, confidence, reasoning`\n\nThese must match the key names in your Output Structure exactly, including spelling and underscores. If they do not match, the parser will return empty or false values instead of the real data.\n\n### Step 8: Add an Email Notification Block\n\nWith structured output in hand, the final step is to act on it. Click **Add Block**, search for **Email Notification**, and connect it to the JSON Parser block.\n\nConfigure it as follows:\n\n**Email Provider:** Leave this as **Roboflow Managed API Key**.\n\n**Subject:**\n\n```\nHydration Reminder: Bottle Detected at Desk\n```\n\n**Receiver Email:** Enter your email address where you want to receive the alerts.\n\n**Message:**\n\n```\n<u><strong>Hydration Reminder: Bottle Detected at Desk</strong></u><br><strong>Bottle Present:</strong> {{ $parameters.bottle_present }}<br><strong>Bottle Status:</strong> {{ $parameters.bottle_status }}<br><strong>Person Present:</strong> {{ $parameters.person_present }}<br><strong>Confidence:</strong> {{ $parameters.confidence }}<br><strong>Reasoning:</strong> {{ $parameters.reasoning }}\n```\n\n**Message Parameters:** Click the **JSON** button at the top of the block to open the Advanced JSON Editor. Under `message_parameters`\n\n, map each field to its corresponding JSON Parser output as follows:\n\n```\n\"message_parameters\": {\n    \"bottle_present\": \"$steps.json_parser.bottle_present\",\n    \"bottle_status\": \"$steps.json_parser.bottle_status\",\n    \"person_present\": \"$steps.json_parser.person_present\",\n    \"confidence\": \"$steps.json_parser.confidence\",\n    \"reasoning\": \"$steps.json_parser.reasoning\"\n}\n```\n\n**Cooldown Seconds:** Leave at `5`\n\nfor testing. In a production deployment you would raise this to avoid being flooded with repeated alerts.\n\nEmail is the simplest action to demonstrate here, but the same pattern applies to any downstream system. You could swap this block out for a Webhook to post results to a Slack channel, a database write, or an API call to trigger another workflow entirely. The structured JSON output makes it straightforward to connect to whatever system you need.\n\n### Step 9: Test the Workflow\n\nWith all blocks connected, it is time to validate the pipeline end to end. In the [Roboflow Workflows](https://roboflow.com/workflows?ref=blog.roboflow.com) editor, click the **Test** button and upload an image containing a bottle.\n\nYou will see the workflow run through each block in sequence: RF-DETR detects the bottle, the Detections Filter confirms it passes, the Dynamic Crop isolates it, Gemini 3 Flash reasons about the scene, the JSON Parser extracts the structured fields, and the Email Notification fires.\n\nWithin a few seconds you should receive an email with the full hydration check results including the bottle status, whether a person is present, the confidence score, and Gemini's reasoning about what it observed in the crop.\n\n### Alternative: Building the Workflow with the Roboflow Agent\n\nIf you want a faster way to get started, [Roboflow Workflows](https://roboflow.com/workflows?ref=blog.roboflow.com) has a built-in [Agent panel](https://app.roboflow.com/solutions/chat/new?ref=blog.roboflow.com) on the left side of the editor. Instead of adding and configuring each block manually, you can just describe what you want and the agent will build the pipeline for you.\n\nOpen the Agent panel and enter the following prompt:\n\nThe agent will generate the workflow automatically. You may still need to go in and make small adjustments, but it gets you most of the way there without touching a single block manually. One thing to note is that the agent exposes every intermediate step as an output by default. Once the workflow is built, open the **Outputs** block and remove anything you do not need.\n\n## Give My Agent Eyes Conclusion\n\nVision agents represent a meaningful shift in what computer vision systems can actually do. By combining a fast specialist model for perception with a large multimodal model for reasoning, and wiring them together through Roboflow Workflows, you get a system that does not just detect things but understands them and acts on them.\n\nThe pipeline we walked through is the same architecture you can adapt to almost any physical task. Swap the detection model, rewrite the prompt, and change the action block. The structure stays the same. If you want to go deeper on how vision agents work conceptually, [this guide covers the full architecture in detail](https://blog.roboflow.com/vision-agents/). And if you are looking to extend your pipeline with code, check out [the best coding agents for vision AI](https://blog.roboflow.com/best-coding-agent-for-vision-ai/) to find the right tools for the job.\n\nWhether you prefer building block by block on the visual canvas or letting the Roboflow Agent generate the workflow from a single description, the tools are there and free to start with. Pick a physical task you want to automate, and start building.\n\n**Cite this Post**\n\nUse the following entry to cite this post in your research:\n\n[Yajat Mittal](/author/yajat/). (Jun 11, 2026).\nGive My Agent Eyes. Roboflow Blog: https://blog.roboflow.com/give-my-agent-eyes/", "url": "https://wpnews.pro/news/give-my-agent-eyes", "canonical_source": "https://blog.roboflow.com/give-my-agent-eyes/", "published_at": "2026-06-11 13:25:33+00:00", "updated_at": "2026-06-11 18:17:44.483122+00:00", "lang": "en", "topics": ["computer-vision", "ai-agents", "machine-learning", "artificial-intelligence", "ai-products"], "entities": ["Roboflow", "RF-DETR"], "alternates": {"html": "https://wpnews.pro/news/give-my-agent-eyes", "markdown": "https://wpnews.pro/news/give-my-agent-eyes.md", "text": "https://wpnews.pro/news/give-my-agent-eyes.txt", "jsonld": "https://wpnews.pro/news/give-my-agent-eyes.jsonld"}}