{"slug": "from-30-minutes-to-8-how-llm-mode-reflect-works", "title": "From 30 Minutes to 8: How LLM-Mode Reflect Works", "summary": "A developer reduced the reflect phase of the `akm improve` pipeline from 35 minutes to 8 minutes by eliminating unnecessary overhead. The new LLM mode makes direct HTTP calls to the LLM endpoint instead of spawning full agent subprocesses, cutting per-call latency from 30 seconds to 6-10 seconds. The approach also enables multi-turn self-refine, where the model iterates on its own output within a single call, improving draft quality without additional round trips.", "body_md": "This is part thirteen in a series about managing the growing pile of skills, scripts, and context that AI coding agents depend on. [Part ten](https://dev.to/itlackey/the-improvement-loop-how-akm-keeps-your-agent-sharp-2d4d) covered the full improve pipeline — all five phases and how they connect. [Part fourteen](https://dev.to/itlackey/your-agent-has-a-memory-that-runs-while-you-sleep-20oh) covers what 48 runs per day looks like in practice, including hardware benchmarks and the reliability bugs that surface at that frequency.\n\nThe reflect pass inside `akm improve`\n\nhas three execution modes. Most installs are still running the slowest one.\n\nAgent mode — the original — spawns an opencode or claude subprocess for each reflect call. The subprocess starts cold, acquires a session, assembles context, makes its LLM call, and exits. That cold-start overhead is real: each call takes approximately 30 seconds on a quiet machine. Run `akm improve`\n\nagainst a 69-ref stash and the reflect phase alone costs about 35 minutes.\n\nSDK mode eliminated the subprocess. The reflect call runs in-process, cutting per-call latency to 10–15 seconds. A 69-ref run drops to 12–17 minutes — better, but still bounded by round-trip overhead that the reflect task does not actually need.\n\nLLM mode removes the round trip entirely. The context for reflect is statically pre-assembled — no live tool calls, no file reads, no external context needed. A direct HTTP call to the LLM endpoint is sufficient, and it costs 6–10 seconds per call. A 69-ref run completes in 8–10 minutes.\n\n| Mode | Per-call latency | 69-ref run |\n|---|---|---|\n| agent (CLI subprocess) | ~30s | ~35 min |\n| sdk (in-process) | ~10–15s | ~12–17 min |\n| llm (direct HTTP) | ~6–10s | ~8–10 min |\n\nThe 3–4× end-to-end improvement is from eliminating overhead that was never necessary for what reflect does.\n\nThe reflect pass takes a stash asset, examines its current content, and proposes a refined version. The inputs are fixed before the pass starts: the asset text, its metadata, and the improvement prompt. Nothing changes mid-call. No files need to be opened. No search queries need to fire. No external context needs to be pulled in.\n\nAgent mode was useful when akm's improve pipeline was first built — the agent subprocess was already the primary execution model, and reflect rode along. But the properties that make agents valuable (tool use, live context access, multi-step reasoning over changing state) are not exercised by reflect. Spawning a full agent process for a stateless inference call trades 20+ seconds of overhead for no quality benefit.\n\nLLM mode makes the execution match the task: assemble the context once, make one HTTP call, get the result.\n\nLLM mode adds a capability that agent mode does not have: multi-turn self-refine.\n\nWhen reflect runs in LLM mode, it sends the initial draft back as an assistant turn. The model sees its own prior output and the refine prompt together in the same context window. This is a standard multi-turn pattern for iterative generation — the model can catch inconsistencies, tighten reasoning, and improve the draft without requiring a second top-level call.\n\nAgent mode, by contrast, passes context forward through prompt text. Each subprocess run starts fresh. There is no conversation history to reason against.\n\nThe practical difference shows on longer or more complex assets, where a single forward pass produces a draft with inconsistencies the model catches immediately when it sees its own output. Multi-turn self-refine handles this inside the single reflect call.\n\nFor providers that advertise `supportsJsonSchema: true`\n\nin their profile config, LLM mode requests structured JSON output. The response is validated against the reflect output schema before being accepted as a proposal.\n\nThis eliminates a class of parse failures that occurs when a model returns well-formed prose but with section markers or formatting that does not align with the expected output shape. The model knows the schema before it generates the response, so the output conforms rather than being post-hoc parsed.\n\nAgent mode produces unstructured text that the pipeline parses with heuristics. LLM mode with `supportsJsonSchema: true`\n\neliminates the heuristics.\n\nLLM mode requires Config v2 (`configVersion: \"0.8.0\"`\n\n). If you have not migrated yet:\n\n```\n# Preview the transformation\nakm config migrate --dry-run\n\n# Apply (writes a timestamped backup first)\nakm config migrate\n```\n\nWith v2 in place, add a named LLM profile and point the reflect process at it:\n\n```\n{\n  \"configVersion\": \"0.8.0\",\n  \"profiles\": {\n    \"llm\": {\n      \"openai-mini\": {\n        \"endpoint\": \"https://api.openai.com/v1/chat/completions\",\n        \"model\": \"gpt-4o-mini\",\n        \"apiKey\": \"${OPENAI_API_KEY}\",\n        \"supportsJsonSchema\": true\n      }\n    },\n    \"improve\": {\n      \"default\": {\n        \"processes\": {\n          \"reflect\": { \"mode\": \"llm\", \"profile\": \"openai-mini\" }\n        }\n      }\n    }\n  },\n  \"defaults\": { \"llm\": \"openai-mini\" }\n}\n```\n\nThat is the complete change. On the next `akm improve`\n\nrun, reflect dispatches HTTP calls to the `openai-mini`\n\nprofile instead of spawning subprocesses. The proposal queue, review workflow, and everything downstream are unchanged.\n\nThe profile config is an endpoint and a model name. Nothing in the LLM mode path is OpenAI-specific — it issues standard chat completions requests. Any OpenAI-compatible server works, including LM Studio running locally.\n\nTo point reflect at a local LM Studio instance:\n\n```\n{\n  \"configVersion\": \"0.8.0\",\n  \"profiles\": {\n    \"llm\": {\n      \"local-reflect\": {\n        \"endpoint\": \"http://192.168.1.100:1234/v1/chat/completions\",\n        \"model\": \"your-local-model-name\",\n        \"supportsJsonSchema\": false\n      }\n    },\n    \"improve\": {\n      \"default\": {\n        \"processes\": {\n          \"reflect\": { \"mode\": \"llm\", \"profile\": \"local-reflect\" }\n        }\n      }\n    }\n  }\n}\n```\n\nSet `supportsJsonSchema: false`\n\nunless you have confirmed that the local model and LM Studio version support structured output. Most local models handle the reflect task correctly through standard chat completions without schema enforcement — the output is smaller and more predictable than consolidation plans, so parse failures are rare.\n\nFor a machine running a 9B model on an RTX 4060 Ti, LLM mode reflect benchmarks in the 8–12 second range per call — comparable to the cloud figures in the table above, with no API costs and no data leaving your network.\n\nLLM mode is appropriate for reflect because reflect has static inputs. Other improve processes do not share that property.\n\nStay on agent mode when the process needs live tool calls. If you have a custom improve workflow that reads files, calls `akm search`\n\n, or pulls external context mid-run, that process requires an agent that can execute tools. LLM mode does not have tool dispatch — it is a direct HTTP call to a completions endpoint, nothing more.\n\nStay on agent mode when the reflect task for a specific asset type requires context that is assembled dynamically — search results, graph lookups, or file reads that depend on the asset's content. Those lookups require a running agent.\n\nThe standard reflect pass — refining an existing asset based on its content and metadata — does not require either of these. LLM mode is the right default for it.\n\nA 69-ref `akm improve`\n\nrun that used to block for 35 minutes now completes in under 10. The reflect proposals are the same quality — in some cases better, because multi-turn self-refine catches first-draft inconsistencies. Structured output for cloud providers eliminates parse failures that previously required manual retries.\n\nThe change is a config update:\n\n```\n# Migrate config if still on v1\nakm config migrate\n\n# Then add the llm profile + reflect process entry (see snippet above)\n# Preview what the next run would process without writing anything\nakm improve --dry-run\n```\n\nThe next improve run after that shows reflect calls completing in the 6–10 second range instead of 30.\n\nLLM mode reflect is available in akm 0.8.0. The full configuration reference is in [docs/configuration.md](https://github.com/itlackey/akm/blob/main/docs/configuration.md). The Config v2 key mapping is in the [v0.7 to v0.8 migration guide](https://github.com/itlackey/akm/blob/main/docs/migration/v0.7-to-v0.8.md#config-v2-migration-reflect-multi-mode).\n\nFor a broader view of the improve pipeline — all five phases, scheduling, and how reflect feeds the downstream consolidation and distill passes — see [The Improvement Loop: How akm Keeps Your Agent Sharp](https://dev.to/itlackey/the-improvement-loop-how-akm-keeps-your-agent-sharp-2d4d). For debugging improve runs when something goes wrong (stale DB entries, hallucinated merge plans, pre-flight filters), see [Your Agent Has a Memory That Runs While You Sleep](https://dev.to/itlackey/your-agent-has-a-memory-that-runs-while-you-sleep-20oh).", "url": "https://wpnews.pro/news/from-30-minutes-to-8-how-llm-mode-reflect-works", "canonical_source": "https://dev.to/itlackey/from-30-minutes-to-8-how-llm-mode-reflect-works-3eef", "published_at": "2026-06-04 00:31:15+00:00", "updated_at": "2026-06-04 00:42:31.690396+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "ai-tools", "ai-infrastructure", "mlops"], "entities": ["akm", "opencode", "claude", "LLM"], "alternates": {"html": "https://wpnews.pro/news/from-30-minutes-to-8-how-llm-mode-reflect-works", "markdown": "https://wpnews.pro/news/from-30-minutes-to-8-how-llm-mode-reflect-works.md", "text": "https://wpnews.pro/news/from-30-minutes-to-8-how-llm-mode-reflect-works.txt", "jsonld": "https://wpnews.pro/news/from-30-minutes-to-8-how-llm-mode-reflect-works.jsonld"}}