DeepSeek-V4 Can't Read Images? I Made It Read A developer created a plugin called 'observer' for OpenCode that enables the DeepSeek-V4 language model to read images by calling a multimodal agent, allowing it to interpret error screenshots, charts, and visual designs despite lacking native multimodal capabilities. The plugin has been refined over a month and is now used for daily coding tasks, with code and agent definitions shared publicly. Harness Engineering https://www.dataleadsfuture.com/tag/harness-engineering/ DeepSeek-V4 Can't Read Images? I Made It Read Don't wait for a multimodal model, you can use it now Introduction Have you ever had that frustrating moment: you are coding with deepseek-v4 in OpenCode, your code throws an error, you want to screenshot it and send it to DeepSeek, and then you remember that DeepSeek cannot read images. I have to say deepseek-v4 is cheap, easy to use, and has a long context. It has already become my main coding model. But as of mid-June, DeepSeek still hasn't released a multimodal version. That means anything involving images, like reading error screenshots, interpreting charts, or recreating pages from visual designs, it cannot do. I am not the only one frustrated. My friends are all waiting eagerly too. But I found a way: I developed a small plugin called observer in OpenCode that lets deepseek-v4 call a multimodal agent to gain the ability to read images indirectly. After more than a month of polishing, this plugin now handles all image-related coding tasks in my daily work. Today, I will share how I built this plugin, hoping it can help you too. The plugin code and agent definitions mentioned in this article are at the end. Feel free to grab them. Demo of Real-World Results Before diving into the long tutorial, you probably care most about how well this plugin works and whether it is worth your time to try. So let me show you some screenshots of the plugin in action. 1. Interpreting error stack traces We start with the simplest task: have deepseek-v4 interpret a screenshot of an error stack trace and find key information. I randomly picked a screenshot of an error I encountered at work: Then in OpenCode Desktop, I sent this image to the plan agent using deepseek-v4-pro and asked it to provide a solution: As you can see, the plan agent gave an answer based on the screenshot information. 2. Interpreting charts Another multimodal use case is interpreting charts from documents. For this example, I took a screenshot of a company's annual revenue chart and tested it. I still used the plan agent with deepseek-v4-pro . For an extra challenge, I asked the agent to give some key insights on the numbers in the chart: The agent read the numbers from the chart and provided some key insights: 3. Developing HTML pages from designs In frontend development, the biggest demand for multimodal capability is recreating visual designs. Here I found a design with complex page elements to see if the build agent using deepseek-v4-flash could recreate the page: Here is the recreated page: One thing is sure: the deepseek-v4-flash model generated the frontend code, and it only took one prompt to get this result. It did not get a 100% match, but with a few more rounds of conversation, you can tweak it until it is perfect. Keep in mind deepseek-v4-flash is dirt cheap. It costs several times or even ten times less than multimodal models like kimi k2.6 or qwen3.7 plus . They are not in the same league. Of course, you can also crop a section of the page, mark the areas that need attention, and ask DeepSeek to adjust them, like this: The agent perceives the marked area and gives the primary agent an adjustment plan per your request. 4. Generating HTML pages from hand-drawn sketches Maybe you are like me and have zero design skills. No problem. We can hand-draw rough sketches. The agent can understand them. For example, in a recent project, I hand-drew a few web page design sketches: Then deepseek-v4-flash helped me recreate the page: Impressive, right? Detailed Implementation Walkthrough I know you cannot wait any longer. Let me jump straight into the implementation details. The whole image-reading plugin consists of two parts: - A sub-agent configured with a multimodal LLM. It runs in a separate sub-session, reads the images uploaded by the user, parses them into detailed text descriptions based on the scenario, and returns the results to the DeepSeek model in the main session. - An OpenCode plugin that intercepts images uploaded by the user, saves them as files, and triggers the sub-agent to read the images at the right time. In other words, the plugin is the "dispatcher," and the sub-agent is the "image reader." They work together through independent sub-sessions without messing up the main session's context. Let me start with the design of the agent. Designing the image-reading agent Since the source code is at the end of this article, I won't paste it here. I will only cover the design thinking behind the agent. This agent does the actual image reading, so make sure it uses a multimodal LLM. Here I used the kimi-for-coding/k2p6 model. The setup is simple. Just put the configuration in the frontmatter of the agent's Markdown file that's the YAML block wrapped in three dashes at the very top of the file . Of course, it won't match a native multimodal agent. This approach, where another multimodal agent converts an image to a text description and then passes it back to DeepSeek, inevitably loses a lot of information. To capture as many image details as possible, I broke the reading process into different scenarios. Each scenario corresponds to a different working mode, with its own trigger keywords and output format: Mode A: Page Restoration. Keywords like: restore, HTML, page, design mockup, etc. Main task: describe the image at the pixel level with precision to help the main agent write an identical HTML page: Mode A: Page Restoration Signal words : Reproduce, HTML, page, design mockup, screenshot reproduction, refactor, frontend, CSS, layout, slice images, implement, pixel-perfect, 1:1, precise reproduction, replicate, mobile, app screenshot, component, visual design, Figma, XD Task : Describe the webpage/app interface screenshot with pixel-level precision, helping the main agent write an identical HTML/CSS page. Simplified mode : If the signal words contain one of rough , approximate , briefly describe , quick and simple , only output A1 page overview + A5 page text list , skip the rest of the sections. To describe the page layout well, I also told the agent to output the page structure using ASCII art. My experiments show this ASCII approach is effective. Mode B: Issue Location and Fix. When my screenshot has areas marked with red boxes, arrows, etc., and I ask the agent to pay special attention, this mode kicks in. Mode B: Issue Location and Fix Signal words : issue, fix, adjust, wrong, error, bug, tweak, something off, not normal, mark, red box, arrow, circle, look here, this area, this part, skewed, misaligned, spacing, not aligned, wrong color, wrong font, overflow, overlap Task : Identify the problem areas marked or pointed out in the screenshot, analyze the symptoms and possible causes, and give specific fix suggestions. Mode C: Error Log Extraction. I use this mode a lot in daily work. For example, when a remote computer blocks the clipboard, we can take a screenshot and let the agent analyze the error stack trace in the image. Mode C: Error Log Extraction Signal words : error, log, error, stack, stack trace, exception, exception, crash, traceback, warning, warning, fail, crash, 500, 404, timeout, panic, fail Task : Extract the error/log text from the screenshot precisely, word for word, keeping all technical details so the main agent can locate and fix the code. Mode D: Text/Conversation Extraction and Analysis. This is the basic OCR function. Just recognize the conversation roles, text hierarchy, and content relationships. Mode D: Text/Conversation Extraction and Analysis Default Signal words : extract text, OCR, recognize text, read text, conversation, copywriting, clarify, content relationships, what was said, transcribe, organize Task : Extract all text from the image, clarify conversation roles, text hierarchy, content relationships, and logical structure. Mode E: Chart Interpretation. This mode extracts data metrics from charts, as I demonstrated earlier. Mode E: Chart/Data Visualization Extraction Signal words : charts, line charts, bar charts, pie charts, scatter plots, radar charts, heatmaps, area charts, trends, data visualization Task : Extract data points, axis labels, and trend info from chart screenshots for the main agent to analyze data relationships. I did not decide on these five modes all at once. I added them one by one as I needed them in real scenarios. This brings up a problem: several modes share overlapping signal words, but each mode has a different output format. If there is a conflict, which mode should the agent use to read the image? Here I used a priority method. I defined five priority levels as follows: C Error Log Extraction E Chart Data Extraction B Issue Location and Fix A Page Restoration D Text/Conversation Extraction and Analysis If I add new modes later, I will adjust the priorities. Finally, I set Mode D as the default. When none of the previous modes match the user's request, it will use OCR mode to handle the image reading task. Once the image-reading agent is designed, just drop the Markdown file into the ~/.config/opencode/agents/ directory and it will take effect. Designing the OpenCode plugin Compared to the agent, the plugin design is much simpler. It is just a single JS file with a little over 100 lines, and it handles two main jobs: Use the experimental.chat.system.transform hook. Every time a request is sent to the model, it checks if the model has multimodal capability. If not, it adds this prompt to the system prompt: Image Reading - You should use the @observer sub-agent to read images. - When a message like Image saved to: