Stop reading logs: Debugging Ray on Anyscale with Agent Skillsan Anyscale released agent skills for debugging Ray workloads, including /anyscale-platform-fix and /anyscale-platform-inspect, which automate troubleshooting of failing pipelines. A user demonstrated fixing a vLLM KV cache sizing issue on a 24 GB L4 GPU by running a single prompt, with the agent performing static validation, live inspection, and configuration edits without manual log reading. Stop reading logs: Debugging Ray on Anyscale with Agent Skillsan Aydin Abiar /blog?author=aydin-abiar and Philip Wang /blog?author=philip-wang | June 11, 2026 My Ray Data and vLLM pipeline that captions video frames won’t start. vLLM bails during engine initialization: it can’t fit a KV cache for Qwen2.5-VL-7B on the 24 GB L4, and the number it prints for free cache memory is negative . I have a pipeline.py, a job spec pinning L4 GPU workers, a 7B vision-language model, and a line that reads MAX MODEL LEN = 32768. One of those numbers is wrong for the others, but which, and to what value? The startup log dead-ends in an allocator error, the vLLM memory docs are open in one browser tab, a half-remembered GitHub issue about KV-cache sizing in another. The actual debugging work out how many tokens the workload really needs, how much the card can hold, and which knob to turn is mechanical work. It just takes hours of doc-diving. Most of that work shouldn’t need a human. Anyscale just shipped a set of agent skills for exactly this, available through its CLI for use with the coding agent of your choice: Claude Code, Cursor, Codex, and others. Today I want to show, not tell. One file, one prompt, and /anyscale-platform-fix takes the pipeline from won’t-start to a validated production job. What follows is my actual session, condensed. LinkWhat Anyscale shipped Anyscale ships two families of skills. Workload skills write the Ray and Anyscale code itself: a Ray Train loop, a Ray Serve deployment, a Ray Data pipeline. Platform skills run that code on Anyscale, inspect it when something goes wrong, and fix it. This post is about the platform skills, and specifically about the two debugging skills /anyscale-platform-inspect and /anyscale-platform-fix you reach for when a workload misbehaves. Three user-invocable platform skills, ready to drop into your coding agent: /anyscale-platform-fix : debug and fix failing workloads end-to-end. The orchestrator. /anyscale-platform-inspect : read-only diagnosis. Static validation of local directories, live inspection of running workloads via the Anyscale API, log/event/metric retrieval, source-grounded reports. /anyscale-platform-run : generate Anyscale configs and execute. Workspace setup, service deploy, job submit. Install once, via the Anyscale CLI: anyscale skills install See the install guide https://docs.anyscale.com/agent-skills/install for prerequisites. Open your coding agent in any project and the three slash commands are available. Back to that pipeline. I’m sitting in the repo where pipeline.py lives, and my entire prompt is one line: /anyscale-platform-fix pipeline.py From here on, every step below is the agent’s. I type that line, answer when the agent stops to ask the skill never invents a choice that carries real trade-offs; it surfaces it , and give a few quick go-aheads in between. Everything else inspect, reproduce, diagnose, edit, re-run, validate the agent does on its own, following the fixed loop the skill defines. Link1. Static validation first Before touching a cluster, the agent runs a static pass over the project, calling /anyscale-platform-inspect on the directory pipeline.py lives in. By the way: /anyscale-platform-inspect is the read-only half of the toolkit. Hand it a directory, a job/service/workspace ID, a console URL, or even a pasted error string, and it pulls logs, events, metrics, and state, grounds what it finds against Ray and Anyscale source, and hands back a structured report. It never edits or deploys anything; that’s /anyscale-platform-fix ’s job. This time something does come back, just not a blocking bug. The agent grounds the Ray Data LLM API against Ray 2.54.0 source and confirms the code is sound: the ray.data.llm imports resolve, vLLMEngineProcessorConfig is used correctly, the job YAML and the Dockerfile validate. What it can’t settle without a live run are three runtime risks it flags anyway: a numpy==2.x pin that could break torch and vLLM’s ABI at install, an apt package libgl1-mesa-glx that’s been renamed since the base image was cut, and, most important, MAX MODEL LEN = 32768 against a 24 GB L4, which looks like it’ll leave no room for a KV cache. Static analysis can name the risk; only a run settles it. So the agent surfaces the three and asks me how to proceed. Link2. Push to a workspace and reproduce I pick “validate in workspace”; static can only flag the risks, but a run confirms them. The agent moves into the execution skill, builds the image, and brings up a debug-video-frame-captioning workspace on an L4 to run python pipeline.py , calling /anyscale-platform-run to do it. Another aside: /anyscale-platform-run is the execution half. It turns a workload directory into the right Anyscale artifact workspace.yaml, service.yaml, or job.yaml , runs the anyscale command, and waits for the workload to reach the state it should. /anyscale-platform-fix leans on it for every push, re-run, and final deploy. The build finishes, the workspace comes up, and the run gets as far as constructing the vLLM engine, where it stops cold. vLLM’s startup profiler does its pass and finds that, after loading Qwen2.5-VL-7B’s weights, the KV cache it would need for MAX MODEL LEN = 32768 doesn’t fit: available cache memory comes out negative , about −0.24 GiB. The engine declines to start. This is the cleanest kind of failure to catch: not a mid-run OOM an hour in, but the engine refusing to initialize at all, in the first couple of minutes, every time. Worth a note before the diagnosis: /anyscale-platform-inspect and /anyscale-platform-run are user-invocable on their own, and /anyscale-platform-fix just chains them with the loop and the asks in between. Want a one-off postmortem? /anyscale-platform-inspect prodjob