Don't wait for a multimodal model, you can use it now
Introduction #
Have you ever had that frustrating moment: you are coding with deepseek-v4
in OpenCode, your code throws an error, you want to screenshot it and send it to DeepSeek, and then you remember that DeepSeek cannot read images.
I have to say deepseek-v4
is cheap, easy to use, and has a long context. It has already become my main coding model. But as of mid-June, DeepSeek still hasn't released a multimodal version. That means anything involving images, like reading error screenshots, interpreting charts, or recreating pages from visual designs, it cannot do.
I am not the only one frustrated. My friends are all waiting eagerly too.
But I found a way: I developed a small plugin called observer
in OpenCode that lets deepseek-v4
call a multimodal agent to gain the ability to read images indirectly.
After more than a month of polishing, this plugin now handles all image-related coding tasks in my daily work. Today, I will share how I built this plugin, hoping it can help you too.
The plugin code and agent definitions mentioned in this article are at the end. Feel free to grab them.
Demo of Real-World Results #
Before diving into the long tutorial, you probably care most about how well this plugin works and whether it is worth your time to try. So let me show you some screenshots of the plugin in action.
1. Interpreting error stack traces
We start with the simplest task: have deepseek-v4
interpret a screenshot of an error stack trace and find key information. I randomly picked a screenshot of an error I encountered at work:
Then in OpenCode Desktop, I sent this image to the plan
agent using deepseek-v4-pro
and asked it to provide a solution:
As you can see, the plan
agent gave an answer based on the screenshot information.
2. Interpreting charts
Another multimodal use case is interpreting charts from documents. For this example, I took a screenshot of a company's annual revenue chart and tested it. I still used the plan
agent with deepseek-v4-pro
. For an extra challenge, I asked the agent to give some key insights on the numbers in the chart:
The agent read the numbers from the chart and provided some key insights:
3. Developing HTML pages from designs
In frontend development, the biggest demand for multimodal capability is recreating visual designs. Here I found a design with complex page elements to see if the build
agent using deepseek-v4-flash
could recreate the page:
Here is the recreated page:
One thing is sure: the deepseek-v4-flash
model generated the frontend code, and it only took one prompt to get this result. It did not get a 100% match, but with a few more rounds of conversation, you can tweak it until it is perfect. Keep in mind deepseek-v4-flash
is dirt cheap.
It costs several times or even ten times less than multimodal models like kimi k2.6
or qwen3.7 plus
. They are not in the same league.
Of course, you can also crop a section of the page, mark the areas that need attention, and ask DeepSeek to adjust them, like this:
The agent perceives the marked area and gives the primary agent an adjustment plan per your request.
4. Generating HTML pages from hand-drawn sketches
Maybe you are like me and have zero design skills. No problem. We can hand-draw rough sketches. The agent can understand them. For example, in a recent project, I hand-drew a few web page design sketches:
Then deepseek-v4-flash
helped me recreate the page:
Impressive, right?
Detailed Implementation Walkthrough #
I know you cannot wait any longer. Let me jump straight into the implementation details.
The whole image-reading plugin consists of two parts:
- A sub-agent configured with a multimodal LLM. It runs in a separate sub-session, reads the images uploaded by the user, parses them into detailed text descriptions based on the scenario, and returns the results to the DeepSeek model in the main session.
- An OpenCode plugin that intercepts images uploaded by the user, saves them as files, and triggers the sub-agent to read the images at the right time.
In other words, the plugin is the "dispatcher," and the sub-agent is the "image reader." They work together through independent sub-sessions without messing up the main session's context.
Let me start with the design of the agent.
Designing the image-reading agent
Since the source code is at the end of this article, I won't paste it here. I will only cover the design thinking behind the agent.
This agent does the actual image reading, so make sure it uses a multimodal LLM. Here I used the kimi-for-coding/k2p6
model. The setup is simple. Just put the configuration in the frontmatter of the agent's Markdown file (that's the YAML block wrapped in three dashes at the very top of the file).
Of course, it won't match a native multimodal agent. This approach, where another multimodal agent converts an image to a text description and then passes it back to DeepSeek, inevitably loses a lot of information.
To capture as many image details as possible, I broke the reading process into different scenarios. Each scenario corresponds to a different working mode, with its own trigger keywords and output format:
Mode A: Page Restoration. Keywords like: restore, HTML, page, design mockup, etc. Main task: describe the image at the pixel level with precision to help the main agent write an identical HTML page:
### Mode A: Page Restoration
**Signal words**: Reproduce, HTML, page, design mockup, screenshot reproduction, refactor, frontend, CSS, layout, slice images, implement, pixel-perfect, 1:1, precise reproduction, replicate, mobile, app screenshot, component, visual design, Figma, XD
**Task**: Describe the webpage/app interface screenshot with pixel-level precision, helping the main agent write an identical HTML/CSS page.
**Simplified mode**: If the signal words contain one of `rough`, `approximate`, `briefly describe`, `quick and simple`, only output A1 (page overview) + A5 (page text list), skip the rest of the sections.
To describe the page layout well, I also told the agent to output the page structure using ASCII art. My experiments show this ASCII approach is effective.
Mode B: Issue Location and Fix. When my screenshot has areas marked with red boxes, arrows, etc., and I ask the agent to pay special attention, this mode kicks in.
### Mode B: Issue Location and Fix
**Signal words**: issue, fix, adjust, wrong, error, bug, tweak, something off, not normal, mark, red box, arrow, circle, look here, this area, this part, skewed, misaligned, spacing, not aligned, wrong color, wrong font, overflow, overlap
**Task**: Identify the problem areas marked or pointed out in the screenshot, analyze the symptoms and possible causes, and give specific fix suggestions.
Mode C: Error Log Extraction. I use this mode a lot in daily work. For example, when a remote computer blocks the clipboard, we can take a screenshot and let the agent analyze the error stack trace in the image.
### Mode C: Error Log Extraction
**Signal words**: error, log, error, stack, stack trace, exception, exception, crash, traceback, warning, warning, fail, crash, 500, 404, timeout, panic, fail
**Task**: Extract the error/log text from the screenshot precisely, word for word, keeping all technical details so the main agent can locate and fix the code.
Mode D: Text/Conversation Extraction and Analysis. This is the basic OCR function. Just recognize the conversation roles, text hierarchy, and content relationships.
### Mode D: Text/Conversation Extraction and Analysis (Default)
**Signal words**: extract text, OCR, recognize text, read text, conversation, copywriting, clarify, content relationships, what was said, transcribe, organize
**Task**: Extract all text from the image, clarify conversation roles, text hierarchy, content relationships, and logical structure.
Mode E: Chart Interpretation. This mode extracts data metrics from charts, as I demonstrated earlier.
### Mode E: Chart/Data Visualization Extraction
**Signal words**: charts, line charts, bar charts, pie charts, scatter plots, radar charts, heatmaps, area charts, trends, data visualization
**Task**: Extract data points, axis labels, and trend info from chart screenshots for the main agent to analyze data relationships.
I did not decide on these five modes all at once. I added them one by one as I needed them in real scenarios.
This brings up a problem: several modes share overlapping signal words, but each mode has a different output format. If there is a conflict, which mode should the agent use to read the image?
Here I used a priority method. I defined five priority levels as follows:
C (Error Log Extraction) > E (Chart Data Extraction) > B (Issue Location and Fix) > A (Page Restoration) > D (Text/Conversation Extraction and Analysis)
If I add new modes later, I will adjust the priorities.
Finally, I set Mode D as the default. When none of the previous modes match the user's request, it will use OCR mode to handle the image reading task.
Once the image-reading agent is designed, just drop the Markdown file into the ~/.config/opencode/agents/
directory and it will take effect.
Designing the OpenCode plugin
Compared to the agent, the plugin design is much simpler. It is just a single JS file with a little over 100 lines, and it handles two main jobs:
Use the experimental.chat.system.transform
hook. Every time a request is sent to the model, it checks if the model has multimodal capability. If not, it adds this prompt to the system prompt:
## Image Reading
- You should use the @observer sub-agent to read images.
- When a message like [Image saved to: <path>] appears in the conversation, call the @observer sub-agent and tell it to read the image file at that path.
Use the chat.message
hook. When the user pastes an image into the input box and sends a message, the image exists as base64-encoded text in the message body. The plugin intercepts this data, decodes it, and saves it as an image file in a temp directory. Then it replaces the original image content in the message with text like [Image saved to: <path>]
.
The plugin's entire workflow looks like this:
You need to place the plugin in the ~/.config/opencode/plugin/
directory for it to work.
That covers the implementation principles of the OpenCode plugin and the agent that add image-reading capability to deepseek-v4
.
Why Did I Design This Plugin? #
After reading this whole practice, you might have a question: why don't I just start a new conversation and use a multimodal model to read images directly? Why go through all this trouble? What is the benefit?
I plan to answer your question from the following angles:
- Cost. We use
deepseek-v4
's two models for their huge cost-effectiveness. Switching to a multimodal model directly would defeat the purpose of using them. - Keep the conversation context. For example, interpreting an error stack trace or extracting text from a conversation screenshot. These actions happen during a coding task. The existing conversation context is crucial for finding a fix based on the error. If you start a new conversation to read the image, you lose that context.
- Different models support different context lengths. The
deepseek-v4
model supports a 1M context length, while models likekimi k2.6
only support 256k. If you temporarily switch to k2.6 in the middle of a conversation to read an image, it might immediately trigger OpenCode's context compression, causing key info loss. So the best solution is to use a sub-agent that reads the image in a separate sub-session and returns the result to the main agent.
Conclusion #
We still wait, excited for DeepSeek to release a multimodal model. Once native multimodal capability arrives, we can pack up all this fussing with plugins and sub-agents.
But until then, the observer
plugin is a workable solution. And the process itself has value: once you understand OpenCode's plugin and agent orchestration, you can apply the same thinking to patch capability gaps in other models.
This mechanism is not limited to OpenCode. It works on other coding agent platforms too, just adjust the implementation a bit.
The plugin code and agent definitions mentioned in this article are all at the end. Grab them there. If you think this article helps you, feel free to share it with your friends.
Further Reading #
In the previous article, I added a reflection mechanism to the OpenSpec workflow, making deepseek-v4-pro
match or even exceed the coding performance of the opus
model. Click here to learn more:
Here's the source code for this article, feel free to grab them: