llm-research-manager

wpnews.pro

A pattern for maintaining the live epistemic state of long-horizon research projects using Agents/LLMs. An LLM Research Manager is a deterministic project-state layer that helps agents and humans keep track of hypotheses, evidence, decisions, questions, risks, contradictions, and next-decision pressure over time. This idea file is meant to be copied into your LLM agent (OpenAI Codex, Claude Code, Google Antigravity, OpenCode, etc). It communicates the high-level idea; your agent builds out the specifics in collaboration with you.

Most people's experience using LLMs to understand a long-running project looks like RAG: point the agent at the files, let it find the relevant pieces, and ask it to generate an answer. This works, but every session is still a fresh reconstruction. The agent can describe what the project looks like now, but it does not reliably know how the hypotheses got there, which evidence moved them, what became stale, or which decisions still depend on old assumptions. A wiki helps organize knowledge, but a research project needs more than organized files. It needs memory for how the reasoning changed.

LLM Research Manager is a pattern for maintaining that live research state. Instead of asking the agent to keep one giant summary in its head, you give it an explicit schema for hypotheses, evidence, decisions, questions, risks, contradictions, support, opposition, dependency, and drift. When new material arrives, the agent reads it in that frame: what does this touch, what does it strengthen, what does it weaken, what question did it answer, what decision might need to reopen? The goal is not to manage every idea in the project. The goal is to keep the important deltas from disappearing.

The useful artifact is the maintained state: hypotheses tied to the evidence that moves them, decisions tied to the hypotheses they rely on, questions kept visible until they are resolved, and risks or contradictions no longer buried in prose. Each refresh gives the next session a compact account of what changed, why it changed, and where the project is now under pressure. The state compounds because it records movement, not just content.

You do not usually write this state by hand. You guide the project, provide sources, challenge interpretations, and decide what is worth keeping. The agent does the translation work: reading sources, drafting records, linking evidence, updating questions, and explaining the generated views. The manager is the discipline around that work: its help and schema coach the agent to think in terms of linkage, support, opposition, dependency, staleness, and falsification instead of producing another loose summary. The human decides meaning, the agent interprets, and the manager keeps the interpretation structured enough to compare over time.

While this tool is best used in research programs, it is useful anywhere reasoning evolves over many sessions: AI or ML development, scientific and engineering investigations, long software projects, due diligence, technical analysis, strategy work, policy work, product investigations, any project where the hard part is not finding more files, but seeing how the core ideas moved. There are four layers:

Project sources - your ordinary project files. Papers, notes, READMEs, notebooks, experiment outputs, code, meeting notes, reports, images, and anything else you already use. These are the raw material, not the maintained state. The agent reads them for interpretation. The manager tells the agent how to treat them: sources can be searched, cited, and discussed, but they do not move a hypothesis, reopen a decision, or resolve a question unless that change is captured in a structured record.

Structured records - small explicit files that track the maintained interpretation and linkage. Hypotheses, evidence, experiments, decisions, questions, and risks live here. They do not replace the source material; they make the research state explicit enough to inspect and update. The LLM can draft these records from sources or conversation, but the record-level change should be visible before it changes the tracked state. Evidence should point back to the result, paper, memo, run, or observation it came from, so later review can inspect why a hypothesis changed instead of trusting a detached note.

Maintained artifacts - generated files maintained from the structured records and project sources. This can include a compiled state file, a source index, history snapshots, and recent-change diffs. The agent uses these files as its starting point for the next session, and you can ask the agent to turn them into prose when you need a briefing. They are inspection surfaces, not the underlying source material: they show what changed, what is linked, what is stale, and where the project is under pressure, but durable tracking changes belong back in the structured records.

The schema and help surface - help.txt

. This is the manager. It tells the LLM how this project tracks research state: what record types exist, how IDs are named, which artifacts should be maintained, what checks to perform, and what workflows to follow. It is also the coaching layer. It teaches the agent to think in terms of hypothesis linkage, support and opposition, dependencies, uncertainty, staleness, and falsification, instead of treating every new file as another summarization task.

The ownership boundary is simple:

The human adds sources and research insights, directs judgment, and corrects the state when needed.
The agent translates sources into structured records, maintains the artifacts, and explains the state.
The manager defines the schema and coaches the agent on how to use it.

This boundary is the point. The human should not have to manually maintain a giant status document. The LLM should not be left to invent a private structure every session. The manager keeps the agent's work explicit, comparable, and easy to audit.

Onboard. This happens when you start tracking a project. You tell the agent you want to track the project with the manager, and you give it the initial hypothesis, research question, or decision frame. The agent asks the manager how this project tracks state, which means it reads help.txt

before changing anything. The manager tells the agent what record types exist, how artifacts are laid out, how names work, which generated state and evolution files to maintain, and where source material, tracking records, and maintained artifacts separate. The agent creates the initial tracking structure, writes the first records, and builds the first maintained artifacts before deeper research work begins. This compounds because future sessions start from maintained project state instead of a cold reread.

Ingest. This happens when you add a new file, update an existing file, or bring in new material from a conversation. You tell the agent to ingest it through the manager. The agent asks the manager how this project represents new material, then reads the source itself. It identifies what changed, which hypotheses, decisions, questions, or risks are touched, and what record updates are needed. A result can become evidence. An emerging interpretation can become a hypothesis. A blocker can become a question. A choice can become a decision. A tension can become a risk. The agent writes or updates the structured records, then updates the maintained artifacts so the new material is connected to the current research state rather than left as an isolated summary. This compounds because each source changes the maintained network of hypotheses, evidence, and open questions.

Review. This happens when you ask where the project stands, what changed, what is weak, or what deserves attention next. The agent asks the manager how review works, then reads the structured records and maintained artifacts. It checks the relationships the manager says to care about: support, opposition, dependency, drift, stale assumptions, open questions, and risks. Then it gives you the review output: a status answer, briefing, comparison, current-state note, contradiction pass, next-decision scaffold, or direct answer with links back to records and sources. If the review discovers a durable tracking change, the agent does not leave it trapped in chat or in a prose answer. It turns that finding back into structured records and regenerated artifacts. This compounds because useful analysis becomes part of the maintained state.

Evolve. This happens when the judgment changes. Ingest starts from new material; evolve starts from the conclusion that the research state itself needs to move. You tell the agent that a hypothesis, decision, question, or risk needs to change, or you ask it to reconcile the state after a review. The agent asks the manager how hypothesis changes should be represented. Then it edits the relevant structured records, preserves or updates the evidence and decision linkages, and updates the maintained artifacts so the movement is visible. A hypothesis can be strengthened, weakened, split, rewritten, retired, or replaced. A decision can be reopened. A question can be resolved or reframed. A risk can be accepted, closed, or made more specific. This compounds because the project remembers not only the current hypothesis, but how the core ideas moved and why.

Keep the role split visible while doing this. The human directs judgment, challenges interpretations, and corrects the state when needed. The agent reads sources, writes records, maintains artifacts, and explains the project state. The manager defines the schema and coaches the agent on how to keep hypotheses, evidence, decisions, questions, and risks linked over time.

For example: a new experiment report lands in the project. The agent reads it and writes one evidence record linked to two existing hypotheses. For one hypothesis the evidence supports the current hypothesis; for the other it opposes it, but the hypothesis is not automatically rewritten. The maintained artifacts now show changed evidence pressure. During review, you and the agent decide the opposed hypothesis should be narrowed rather than retired. That is an evolve move: the agent updates the hypothesis record, preserves the evidence linkage, and updates the artifacts so future sessions can see what changed. In practice, this is a local working loop. The agent works in the project directory, reads the schema/help, edits records and artifacts, and uses diffs as an inspection surface. The human asks about the maintained views, asks questions, and corrects the state when the agent gets the meaning wrong. The next session starts from maintained research state and recent movement instead of a cold reread of the entire project.

A research manager works because the project state is split into files with different jobs. Sources carry the underlying material. Structured records track the interpretation and linkages. Maintained artifacts make that tracked state cheap to inspect. Evolution artifacts show how it changed. help.txt

teaches the agent how to keep those roles separate.

The schema and help file, help.txt

, tells the agent how to operate the loop. It defines the record types, naming rules, link semantics, review workflows, and operational expectations. This is the coaching layer: before the agent writes or reviews anything, it reads help.txt

to understand what counts as a hypothesis, evidence, support, opposition, dependency, stale state, contradiction, or durable change.

The source files are the underlying material: papers, notes, READMEs, notebooks, experiment outputs, code, meeting notes, reports, images, and anything else the project already uses. The agent reads these files directly when it needs evidence or context. They are not replaced by the manager.

The structured records are the tracking layer: hypotheses.yaml

, evidence.yaml

, decisions.yaml

, questions.yaml

, risks.yaml

, experiments.yaml

, or the same records split into smaller files. They make the research state explicit by recording what the project is tracking and how each hypothesis, evidence item, decision, question, risk, or experiment points back to sources and to each other. The agent writes or updates these records during onboard, ingest, and evolve. Sources carry the underlying material; structured records carry the maintained interpretation and linkage.

The maintained artifacts, such as state.json

and index.json

, make the tracked state cheap to inspect. state.json

is the compiled current view. index.json

tells the agent where sources, records, artifacts, and ID mentions live. The agent reads these files before drilling into the project, then answers in prose when you ask for a briefing, review, contradiction pass, or next-decision scaffold. The files are generated from the records and sources, not edited as truth.

The evolution artifacts, such as history/

and last_diff.json

, preserve movement over time. History keeps previous states or events. The latest diff shows what changed since the previous refresh: added records, removed records, modified records, renamed IDs, and changed derived state. These files let the agent explain how the tracked state moved instead of only describing where it stands now.

The result is a loop the next session can enter quickly. Sources provide material, records preserve interpretation and linkage, maintained artifacts make the tracked state navigable, evolution artifacts preserve change, and help.txt

teaches the agent how to keep those pieces coherent. The goal is not to build a search system or prewrite every human-readable report; the goal is to make hypothesis movement explicit enough that the next session can see what changed, why it changed, and what still needs judgment.

Once the manual ingest loop works, a periodic cron

job or standing agent routine can keep the project from drifting out of date. The useful version is "ingest everything changed since the last review": the trigger finds files added or modified after the last review marker, passes that file list to the agent, and asks the agent to ingest them through the manager. The trigger does not decide what the change means; it starts the workflow and provides the changed files or context. The agent still reads help.txt

, reads the sources, decides which hypotheses, evidence, decisions, questions, or risks are touched, updates the structured records, and regenerates the maintained and evolution artifacts. Keep the boundary simple: triggers start work, the agent interprets and writes, and the manager defines the schema and coaching rules. Automation should reduce missed updates, not silently create hypotheses, rewrite decisions, or resolve questions.

Use git. Keep sources, structured records, and generated state in the same repo when possible. The diff is the easiest way to inspect what the agent changed, and history gives you a rollback path when a record update is wrong. - Start sessions from state, index, and diff. Have the agent read the compiled state, navigation index, and latest diff before opening raw sources. This keeps the session focused on what moved instead of reconstructing the whole project. - Make evidence do the linking. Add evidence links as soon as the evidence is created. Unlinked evidence is just a note, and claims should not have to carry their own manually maintained evidence lists. - Prefer small records over catch-all records. One focused evidence item, decision, question, or risk is easier to update, retire, rename, and review than a large record that mixes several meanings. - Watch warnings and empty results. A useful manager should tell the agent when references are broken, enums were normalized, files were skipped, or a query found nothing. Treat those signals as part of the workflow, not as noise. - Rename explicitly. Stable IDs matter because records link to each other over time. If an ID needs to change, use an explicit rename flow or make the change in a way that preserves old links. - Leave rich document understanding to the agent. Do not make the manager responsible for OCR, image interpretation, spreadsheet reasoning, or deep PDF understanding. The agent can read and interpret sources with its own tools, then write the structured records that matter. The manager should stay focused onhelp.txt

, the schema, link rules, generated state, and evolution tracking. - Automate only after the manual loop works. A periodic cron job or standing agent routine can ingest files changed since the last review, but set it up after the schema, warnings, and review flow are reliable. Automation should reduce bookkeeping, not hide interpretation. - Keep Agents follow compact instructions better. Put record types, link rules, workflow triggers, and warning-handling rules inhelp.txt

short and operational.help.txt

; keep long rationale elsewhere. - Generate readable outputs on demand. Ask the agent for memos, reviews, plots, slide decks, dashboards, or narrative summaries when useful. Those outputs do not need to be part of the required manager state. If they contain a durable tracking change, convert it back into structured records.

The hard part of long research projects is not just remembering each component; it is remembering the current relationship between hypotheses, evidence, decisions, risks, and unresolved questions as that relationship changes. Status docs get stale, agent chats disappear into history, notes pile up, and decisions keep moving forward after their evidence has weakened. Contradictions that should force a judgment can sit in different files and never meet each other. This gets harder when more than one person is involved: each person may know their own thread, but the shared hypothesis state is harder to see, including which result moved the project, which assumption weakened, which question was answered, and which decision now depends on evidence that no longer holds. Without a maintained state layer, the project starts relying on meetings, memory, and private summaries to keep the reasoning coherent.

LLMs can help, but only if they have a maintained surface to work against. If every session starts by rereading the whole project, the agent spends its effort reconstructing context instead of improving it. The research manager makes the bookkeeping explicit by giving the agent a current state to read, a schema to follow, and coaching rules for what counts as support, opposition, dependency, staleness, contradiction, and durable change. The useful split is straightforward: the human directs judgment, the agent handles translation and synthesis, and the manager keeps the state explicit enough to review, compare, and update over time.

This document describes the pattern, not one required implementation. The exact record format, file layout, commands, generated artifacts, and automation loop should fit the project and the agent environment: a small project may only need a few structured records, a state file, and a diff, while a larger project may want history, review queues, stable rename flows, and scheduled ingestion. The invariant is that research state should be explicit, local, reviewable, and easy for an agent to update. Hypotheses, evidence, decisions, questions, and risks should not live only in chat history or scattered notes; they should be written into a maintained structure the next session can read. Give the pattern to an LLM agent and have it build the smallest manager that fits the project: one that keeps hypotheses, evidence, decisions, questions, risks, and changes explicit enough for the next session to understand what moved, why it moved, and what still needs judgment.

source & further reading

gist.github.com — original article A battle-tested AGENTS.md for higher quality code output. Helps avoid "AI slop creep" by keeping your codebase lean and clean. A New Gist from my side Claude Prompt for Work Log.md

llm-research-manager

Run your AI side-project on zahid.host