# How to Add an LLM to a Vision Pipeline (And When to Avoid It)

> Source: <https://blog.roboflow.com/adding-an-llm-to-a-vision-pipeline/>
> Published: 2026-06-15 19:38:31+00:00

Standard object detection pipelines are built to localize and classify because they excel at answering what an object is and where it’s located in the image. However, they struggle with obtaining additional details and can’t do any further processing with the positions. When a system needs to read text, assess a condition, or make a routing decision based on what it sees, a language model becomes a necessary addition to the pipeline.

This guide covers how to chain a detector with a vision-capable LLM inside a [ Roboflow Workflow](https://roboflow.com/workflows/build?ref=blog.roboflow.com), and when that combination is worth the added complexity vs. when it isn't.

## When to Add an LLM (and When Not To)

Not every pipeline benefits from a reasoning layer. Adding one unnecessarily introduces latency, cost, and architectural overhead. A decision can be made based on what the pipeline is actually required to produce.

### Add an LLM when the task involves:

- Text extraction and interpretation: like titles, serial numbers, labels, signage. LLMs handle stylized, degraded, or irregular typography far more reliably than classical
.__OCR__ - Structured output: like JSON records written directly to a database, spreadsheet, or to use in further processes without manual intervention.
- Contextual judgment: like assessing damage, estimating condition, or routing detections to different actions based on visual context.
- High visual variability: when the input space is too broad to cover with a
.__classification model__

### Stick with a specialized detection model when:

- The output is bounding boxes, counts, or real-time tracking when no further interpretation is required.
- Latency is a hard constraint for your project. Cloud LLM calls add hundreds of milliseconds, which adds up drastically at high framerates.
- A classification model already handles the task cleanly and at a lower cost.
[Edge](https://roboflow.com/ai/edge?ref=blog.roboflow.com)deployment makes sending payloads to a large cloud model impractical.

In this article, we will build a practical book cataloging workflow that perfectly captures all four criteria for using an LLM.

Books present a unique combination of visual and textual challenges: cover designs feature stylistic variety, titles require context-aware character recognition, physical wear demands baseline judgment, and the final output must compile into a standardized, structured record.

## The Two-Stage Architecture

The process will be this: a fast, specialized detector will handle positions, and the LLM will receive and process the cropped region relevant to the task. Therefore, the full image never reaches the more expensive model, meaning that tokens aren’t wasted on irrelevant pixels, making the process efficient.

**Perception Layer:** An [ RF-DETR](https://blog.roboflow.com/rf-detr/) model scans the frame and returns bounding box coordinates for every detected object. This is fast, cost-efficient, and accurate on custom domains.

**Reasoning Layer:** A vision-capable LLM receives the isolated crop and returns a structured JSON record: title, author, genre, language, condition notes, and a recommended next action.

### Step 1: Set Up Your Roboflow Workspace

Sign in to your [ Roboflow dashboard](https://app.roboflow.com/?ref=blog.roboflow.com). If you're new to the platform,

[. Then we can perform model training, workflow building, and deployment all there.](https://app.roboflow.com/login?ref=blog.roboflow.com)

__create a free account__### Step 2: Source a Book Detection Dataset

Head to [ Roboflow Universe](https://roboflow.com/universe?ref=blog.roboflow.com) and search for a book detection dataset. This

[is a reliable starting point. Fork it into your workspace to import the images and annotations directly.](https://universe.roboflow.com/zebra-learn/book-4abtl?ref=blog.roboflow.com)

__book detection dataset__If you're working with your own collection, upload your photos and annotate them using [ Roboflow Annotate](https://roboflow.com/annotate?ref=blog.roboflow.com), labelling each book region.

### Step 3: Train Your Detection Model

For a complete walkthrough of training an object detection model on a custom dataset, including preprocessing, augmentation, and evaluating results, refer to the [ how to train RF-DETR on a custom dataset](https://blog.roboflow.com/train-rf-detr-on-a-custom-dataset/) guide. Once the model is trained and hosted on Roboflow, return here to wire it into the workflow.

### Step 4: Select Your LLM

Model selection for the reasoning layer has a meaningful impact on output quality. Different LLMs vary considerably in OCR accuracy, structured output consistency, and reliability on stylized or degraded text.

The [ Roboflow Playground Model Rankings](https://playground.roboflow.com/ranking?ref=blog.roboflow.com) is the most practical place to evaluate candidates. You can run the same image through multiple vision-capable models side-by-side, with live rankings made using community evaluations. For text-heavy tasks, the

[are the most relevant filter.](https://playground.roboflow.com/ranking/ocr?ref=blog.roboflow.com)

__OCR-specific rankings__For this build, we're using[ GPT-5-mini](https://playground.roboflow.com/models/openai/gpt-5-mini?ref=blog.roboflow.com)

**,** which has consistent structured output, strong performance on varied typography, and is fast enough to keep the pipeline responsive. However, you should test your own images in the Playground before committing, as leaderboard performance doesn't always translate directly to a specific dataset.

## Building the Workflow: Adding an LLM to a Vision Pipeline

With the detector deployed and LLM selected, open the [ Roboflow Workflows](https://roboflow.com/workflows/build?ref=blog.roboflow.com) builder to connect everything.

### 1. Initialize the Workflow

Navigate to the Workflows tab in your dashboard sidebar, click Create Workflow, and select Build Your Own to start from a blank canvas.

### 2. Object Detection Block

Add an **Object Detection Model** block and connect it to your trained book detector. This scans the full input image and returns bounding box coordinates for every detected book.

### 3. Detections Filter

Real images have a lot of noise, like overlapping boxes, low-confidence predictions on partially visible spines, and background clutter. A **Detections Filter** block discards anything below your confidence threshold, ensuring the LLM only receives clean, high-probability detections to prevent unnecessary API calls on ambiguous regions.

Add a detection filter block and configure the JSON like so:

```
{
  "type": "roboflow_core/detections_filter@v1",
  "name": "detections_filter",
  "predictions": "$steps.model.predictions",
  "operations": [
    {
      "type": "DetectionsFilter",
      "filter_operation": {
        "type": "StatementGroup",
        "operator": "and",
        "statements": [
          {
            "type": "BinaryStatement",
            "left_operand": {
              "type": "DynamicOperand",
              "operand_name": "_",
              "operations": [
                {
                  "type": "ExtractDetectionProperty",
                  "property_name": "confidence"
                }
              ]
            },
            "comparator": {
              "type": "(Number) >="
            },
            "right_operand": {
              "type": "StaticOperand",
              "value": 0.5
            },
            "negate": false
          }
        ]
      }
    }
  ],
  "operations_parameters": {}
}
```

### 4. Dynamic Crop

The **Dynamic Crop** block uses the bounding box coordinates from the filter step to isolate each detected book as its own high-resolution image. This ensures that the reasoning layer receives a clean, zoomed-in crop rather than a small region of a wide-angle shelf photo.

This detect-then-filter-then-crop-then-reason sequence is a consistent pattern across Roboflow Workflows, appearing in receipt scanning, shipping label extraction, and medical device inspection pipelines.

### 5. The Reasoning Layer: Gemini 2.5 Flash

This is the most important block in many cases. The brain of the workflow is the LLM (the **Vision Agent** block). For our use case, we will use an OpenAI block and set the task to **Structured Output Generation**, connect the image input to the Dynamic Crop output, and select GPT-5-mini as the model.

Define the JSON schema to be returned for each book cover:

```
{
  "status": "SUCCESS, UNREADABLE, or FLAG_RENEWAL",
  "book_title": "The primary title of the book visible on the cover.",
  "author_name": "The identified author or authors.",
  "estimated_genre": "The thematic genre based on title context (e.g., 'Science Fiction', 'Biography').",
  "primary_language": "The dominant language of the cover text.",
  "cover_condition_notes": "Brief observations regarding visible damage, tears, or staining.",
  "next_database_action": "INSERT_RECORD, FLAG_MANUAL_QC, or REFRESH_IMAGE"
}
```

A standard OCR tool returns raw strings and leaves all decision-making downstream. This prompt delegates that judgment to the model directly, allowing unreadable covers to get flagged, damaged books to get queued for manual review, and clean scans to go to the database. That's the reasoning layer doing work that a regular detection model cannot.

For guidance on improving LLM output quality in vision pipelines, the [ Roboflow prompting guide](https://blog.roboflow.com/prompting-tips-for-large-language-models-with-vision/) covers strategies applicable to OCR and structured output tasks.

### 6. JSON Parser

Finally, add a **JSON Parser** block after the Vision Agent and configure it to extract:

`status, book_title, author_name, estimated_genre, primary_language, cover_condition_notes, next_database_action`

Connect the all_properties output to the final Output block, and the workflow is complete.

## Where to Take It Next

The two-stage architecture generalizes well beyond this simple example. The same detection, then crop, then reason pattern applies across many different fields:

**Retail shelf auditing**: you can detect products, extract price tags and SKUs, and flag misplaced or out-of-stock items.** Document processing**: you can detect form regions, extract field values, and route records based on document type or completeness.** Industrial inspection**: you can detect components, read serial numbers or date codes, and flag parts outside acceptable parameters.

However, the structure remains largely constant. We use a fine-tuned detector that handles spatial localization, and the LLM is reserved for tasks that require reading, reasoning, or judgment.

## When to Add an LLM to a Vision Pipeline Conclusion

Whether to add an LLM comes down to what the pipeline needs to produce. Detection and counting tasks are better served by a specialized model alone, which is faster, cheaper, and simpler. When the task involves reading, reasoning, or conditional routing, a language model becomes the right tool for that stage of the pipeline.

[ Create a free Roboflow account](https://app.roboflow.com/login?ref=blog.roboflow.com) and explore datasets on

[to get started.](https://universe.roboflow.com/search?q=class%3Abook&ref=blog.roboflow.com)

__Roboflow Universe__**Cite this Post**

Use the following entry to cite this post in your research:

[Aarnav Shah](/author/aarnavshah/). (Jun 15, 2026).
How to Add an LLM to a Vision Pipeline (And When to Avoid It). Roboflow Blog: https://blog.roboflow.com/adding-an-llm-to-a-vision-pipeline/
