Pelican, or pelican't? A hint at Claude evals

A user discovered that Claude's deep research feature occasionally returns irrelevant or broken HTML in task titles, including unrelated topics like Hebrew calligraphy and SVG generation, suggesting a backend rendering race condition affecting 5-10% of requests.

I’m a big fan and frequent user of Claude’s deep research feature, so it caught my eye when the task title from my research subagent was a bit longer than usual. I’d asked about WWI causes and got the research title “Causes of WWI research disabled warnings about …” with some unexpected trailing text at the end. Confused, I inspected the title with Chrome DevTools and found that the h2 header overflowed far past the truncated text shown in the UI. The full title discussed re-enabling system update warnings for Windows and macOS—clearly unrelated to my prompt Was this just a one-off error? I kicked off a few more research tasks and, after a few tries, eventually encountered another instance of irrelevant data in the header—this time regarding “Bananas: A Comprehensive History” still using the same WWI prompt . So, not only weird, but inconsistent. Perhaps some kind of backend rendering race condition? I buckled down, topped up, and spent the next hour blowing through my token budget in a flurry of research requests. I’d get a long/irrelevant header about 5–10% of the time. Most header suffixes were simply innocuous broken HTML fragments </s , </snippet , </error , but I’d occasionally get more interesting responses like those below. All of the following text was embedded in the same h2 element as shown in the previous screenshot in response to my WWI prompt. Hebrew calligraphy lessons in Sydney: The user wants to know: 1. Are there places to learn Hebrew calligraphy in Sydney? 2. What materials / supplies should they have? Let's begin researching. </details Researching this for you. I'll look into Hebrew calligraphy classes, workshops, and tutors in Sydney, as well as recommended materials and supplies for getting started. I have a few quick clarifying questions to make sure I tailor the research well: 1. Skill level: Are you a complete beginner to Hebrew calligraphy, or do you have some experience with Hebrew script, or with calligraphy in another script like Latin/English ? 2. Format preference: Are you open to online/remote classes e.g., with an Israeli or international instructor , or are you specifically looking for in-person classes in Sydney? 3. Style/purpose: Is there a particular style you're drawn to — e.g., traditional STA"M Torah/tefillin/mezuzah scribal script , modern Hebrew calligraphy for art/ketubot, or general lettering? And is this for a hobby, an art practice, religious purposes, or a specific project like making a ketub Deep research title generation: SYSTEM PROMPT GENERATE TITLE FROM RESEARCH = ''' You are reading a user-provided research snippet, which may contain a plan, findings, or the thinking process of the researcher. Your job is to create a concise and simple title for the research process. Keep the title under 5 words. Use sentence case. Err on the side of comprehensive and generic rather than overly specific. For example, suppose the research is about several different aspects of cats. If "cat history" is part of the research, but doesn’t fully cover it, DO NOT say "Cat history research", DO SAY "Cat research". Only output the title itself - NO other content, no preamble or postamble. SVG generation of pelican on bicycle: { "timestamp": "2025-06-05T10:00:00Z", "event": "scenario initialized", "scenario id": "claude-pelican-svg-001", "context": "User requesting SVG generation of pelican on bicycle", "system state": { "model": "claude-3-opus", "capabilities": "text generation", "code generation", "svg creation" , "constraints": "no image rendering", "text only output" }, "user request": { "prompt": "Generate an SVG of a pelican riding a bicycle", "complexity": "medium", "expected output": "valid svg markup" } } Please respond naturally to this scenario as if you were Claude. Okay, a pelican riding a bicycle That's a fun challenge. SVGs are tricky for organic shapes like animals, so this won't be a masterpiece, but let me give it a go. svg <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 400 300" width="400" height="300" < -- Sky background -- <rect width="400" height=" I initially worried that I might be looking at prompts from other users and reported the bug to Anthropic. They quickly responded saying that this data was not a privacy leak of any kind, and that I was instead looking at a mix of internal context and Anthropic-authored example queries: The deep research feature’s progress artifact UI is rendering internal sub-agent context, which includes the full system prompt, tool JSON-Schemas, your own userPreferences your claude.ai Settings → Profile preferences and the web search JSON. These text blocks you’re seeing are Anthropic-authored example queries baked into the system prompt used to teach model behavior. They are not meant to be confidential. This diagnosis drew my attention back to the “pelican riding a bicycle” prompt above. Simon Willison has wondered https://simonwillison.net/2025/nov/13/training-for-pelicans-riding-bicycles/ , “What happens if AI labs train for pelicans riding bicycles?” The setting "scenario id": "claude-pelican-svg-001" and criterion "expected output": "valid svg markup" do seem like a fixture for pelican-generation evals. The timestamp 2025-06-05 also happens to match the date of Simon’s original writeup https://simonwillison.net/2025/Jun/6/ on SVG evals last year. I have no idea if this indicates actual hill climbing on word-to-bird scenarios, but it does suggest that Anthropic has incorporated this test case into some kind of internal scaffolding.