# Findings from running remyxai-cli autoresearch across 5 production repos — per-repo inventory of architectural extension points missing to receive recent AI methods

> Source: <https://gist.github.com/smellslikeml/3ec1a655433f622c8e8cdd11b04c8cdf>
> Published: 2026-07-03 17:40:00+00:00

Recent AI research lands in existing codebases through specific extension points — modules, callbacks, or data-structure fields where a new method can plug in. Which extension points a repo provides determines which methods can be tried against it without a rewrite. We ran an agentic method-search loop that dispatches recent arxiv papers as draft integrations against 6 production repos; the by-product across 36 cycles was a per-repo inventory of the specific extension points those repos are missing.

Packaged as a CLI subcommand in [remyxai-cli #46](https://github.com/remyxai/remyxai-cli/pull/46):

```
remyxai outrider autoresearch --repo owner/name \
    --cycles 5 --budget 25 \
    --provider zai --model glm-5.2
```

Candidates come from the Remyx engine's ranker for the target's Research Interest, auto-extracted via `remyxai interests from-repo <owner/name>`

. Three dispatch modes: default (ranker's top-N), `--pin-arxiv <id>`

for an exact paper, `--search-method "..."`

for a free-text query resolved to a top hit.

Per cycle: a hypothesis LLM checks capability-fit against the target and picks a candidate (or SKIPs with alternative-query suggestions if none fit). If picked, [Outrider](https://github.com/remyxai/outrider) drafts an integration and audits fidelity against the paper's reference implementation. A decision LLM classifies the resulting artifact as MERGE / ITERATE / LEAD (paper mismatched but preflight surfaced a viable experiment) / REJECT / SKIP / INFRA_FAIL. The trace persists between cycles so subsequent picks read prior failure modes.

36 cycles across 6 targets, under $22 total.

| Target | Cycles | Result mix |
|---|---|---|
`remyxai/VQASynth` |
10 | 1 LEAD, 4 REJECT, 5 SKIP |
`smellslikeml/OpenRLHF` |
5 | 4 LEAD, 1 INFRA_FAIL |
`smellslikeml/OLMo-core` |
5 | 3 LEAD, 1 ITERATE, 1 SKIP |
`smellslikeml/open-instruct` |
5 | 3 LEAD, 1 ITERATE, 1 SKIP |
`smellslikeml/atg-research` |
2 | 2 SKIP (see below) |
`smellslikeml/diffusers` |
9 | 4 LEAD, 2 REJECT, 1 INFRA_FAIL, 2 SKIP |

Every artifact is public on the target fork. Each bullet below is a named integration gap with a link to the artifact where preflight described it; papers listed are ones currently blocked by that gap.

- No trainer / optimizer / loss scaffolding. Training-side papers like
[CoLT](https://arxiv.org/abs/2606.31986)and[ScreenAnnotator](https://arxiv.org/abs/2606.18846)can't land. ([#103](https://github.com/remyxai/VQASynth/issues/103),[#105](https://github.com/remyxai/VQASynth/issues/105)) - No agent loop, tool dispatch, or memory.
[S-Agent](https://arxiv.org/abs/2606.20515)can't land in the single-shot inference path — preflight suggested wrapping`depth.py/localize.py/scene_fusion.py`

as tools behind a training-free planner in a new`vqasynth/agent.py`

. ([#106](https://github.com/remyxai/VQASynth/issues/106)) - No proxy-reward layer over
`evaluation.py`

. A rejected test-time-scaling paper suggested using ground-truth scoring as an ORM stand-in. ([#104](https://github.com/remyxai/VQASynth/issues/104))

→ **Suggested next step:** add `vqasynth/agent.py`

— preflight sketched the shape, opens the agentic-VLM inference family.

- No step-level reward assignment.
[MRPO](https://arxiv.org/abs/2606.31825)needs per-step credit assignment; trainer is trajectory-level. Preflight named`openrlhf/trainer/ppo_utils/step_reward.py`

. ([#10](https://github.com/smellslikeml/OpenRLHF/issues/10)) - No optimizer-benchmarking harness. Muon wired at
`openrlhf/utils/deepspeed/deepspeed.py:148`

, no A/B path against AdamW. ([#11](https://github.com/smellslikeml/OpenRLHF/issues/11)) - No reward-labeler plug-in point.
[PBRS](https://arxiv.org/abs/2606.27180)proposed a VLM preference labeler; reward-computation has no injection hook. ([#12](https://github.com/smellslikeml/OpenRLHF/issues/12)) - No explicit warm-up phase separated from steady-state.
[FORCE](https://arxiv.org/abs/2606.26006)needs one. ([#13](https://github.com/smellslikeml/OpenRLHF/issues/13))

→ **Suggested next step:** create `openrlhf/trainer/ppo_utils/step_reward.py`

. Most-cited missing host across cycles; downstream additions layer on once it exists.

- No learned budget-constrained hybrid-attention layer selection.
[FlashMorph](https://arxiv.org/abs/2606.30562)'s per-layer gating maps onto existing attention types via a gate module — preflight scoped a minimal experiment on a synthetic needle-in-haystack retrieval set with the backbone frozen. ([#6](https://github.com/smellslikeml/OLMo-core/issues/6)) [BluTrain](https://arxiv.org/abs/2606.24780)(full C++/CUDA training system) has no module-level import surface. Preflight suggested picking one technique (memory allocator setting, Triton kernel, backend config) and prototyping PyTorch-native via`src/integration_tests/test_train_small_model.py`

. ([#7](https://github.com/smellslikeml/OLMo-core/issues/7))[HyperQuant](https://arxiv.org/abs/2606.23406)'s Tensor-Core MMA kernels can't land, but the RHT → lattice → Rice-coding weight-compression path can, as a standalone utility on pretrained OLMo-2 checkpoints, measured against paper Table 1 operating points. ([#8](https://github.com/smellslikeml/OLMo-core/issues/8))- ITERATE on
[DMuon](https://arxiv.org/abs/2606.27153)— distributed Muon optimizer variant that may already be reachable through the existing`dion`

dependency pin; needs a version check before committing to import. ([#5](https://github.com/smellslikeml/OLMo-core/issues/5)) - SKIP cycle correctly identified 6 inference-serving papers (KV-cache eviction, disaggregated serving) that don't fit a training framework. Refinement suggestions map to first-class abstractions:
`MoERouterV2`

,`SequenceMixer`

, MXFP8 stack,`MultimodalLM`

.

→ **Suggested next step:** the MoE-routing refinement query is the highest-leverage — OLMo-core has first-class MoE infrastructure (MoERouterV2, SharedExperts, fused-MoE) that could host improved routing / load-balancing methods currently outside the ranker's top-N.

All 3 LEADs cluster in the GRPO rollout/reward path — a coherent surface for research absorption:

[RolloutPipe](https://arxiv.org/abs/2606.26997)'s CGP+FGD pipelining maps onto GRPO's disaggregated rollout/train loop at the`grpo_olmo_core_actor.py`

/`actor_manager.py`

group-dispatch boundary. Preflight suggested the lighter CGP-only variant as an incremental change. ([#11](https://github.com/smellslikeml/open-instruct/issues/11))[GeoAlign](https://arxiv.org/abs/2606.26917)needs hidden-state displacement plumbing the GRPO loop doesn't expose yet — preflight scoped a logging-only probe in`scripts/train/debug/single_gpu_on_beaker.sh`

to confirm the metric is observable before building the full forward-pass callback. ([#12](https://github.com/smellslikeml/open-instruct/issues/12))[SCPO](https://arxiv.org/abs/2606.25852)'s step-level credit shaping fits GRPO's reward path as a config-gated module, validated on the repo's existing tool-use rollouts via`environments/`

. ([#13](https://github.com/smellslikeml/open-instruct/issues/13))- ITERATE on
[a supervisory-signal paper](https://arxiv.org/abs/2606.26027)— landed as too broad but preflight surfaced a control-token probability-mass diagnostic callback in`grpo_callbacks.py`

as the minimum-viable probe. ([#14](https://github.com/smellslikeml/open-instruct/issues/14)) - SKIP identified vision/multimodal papers (CFPO, VeriEvol, MRPO) that don't fit text-only open-instruct. Refinement suggestions referenced IcePop (the repo's own off-policy correction module) by name.

→ **Suggested next step:** SCPO's step-level credit shaping is the tightest fit — reuses the existing reward-path abstraction and can be validated against tool-use rollouts the repo already runs.

- Video-scheduler injection points less flexible than image pipelines.
[Vera](https://arxiv.org/abs/2606.23610)and[PRISM](https://arxiv.org/abs/2606.20310)rejected on the same axis. - No preference-head over intermediate latents. PRISM couldn't land.
- SKIP outcomes named frontier surfaces the ranker isn't reaching: DiT attention quantization (4-bit W4A16), few-step consistency-model schedulers, LoRA/PEFT for FLUX/SD3, modular composable pipelines.

→ **Suggested next step:** extend the existing quantization workstream (AutoRound, torchao) to DiT attention. Fits an active axis, lands the "efficient diffusion" family currently backlogged.

Pinterest's `atg-research`

is a catch-all monorepo organizing multiple unrelated research directions (joint-rl-diffusion, InteractRank, others) as sibling top-level directories. The interest anchored on one direction; preflight read a different subdirectory. Both cycles skipped honestly.

Not a bug — a repo shape the current per-repo targeting doesn't accommodate. Follow-up: extend Research Interest configuration to accept a `subdir`

scope.

Gaps across targets cluster into two shapes:

**Missing hosts**— a file or module that doesn't exist but would need to (VQASynth's`agent.py`

, OpenRLHF's`step_reward.py`

). Higher effort, unlocks entire method families.**Missing hooks in existing infrastructure**— a callback, data-structure field, or plug-in point in code that already exists (open-instruct's hidden-state displacement plumbing for the GRPO forward pass, diffusers' preference-scoring callback during denoising). Lower effort, unlocks specific variants.

Producing this inventory manually would require reading every recent paper in the target's subfield and imagining each integration path. The connection between a small architectural change and the methods it would unblock is often invisible from inside the codebase. The dispatch mode surfaces that connection as a by-product.

Cost profile: SKIP outcome $0.02, dispatched REJECT ~$1-2, LEAD ~$1-2. Under $22 across 36 cycles produced the whole inventory above.

The dispatch mode handles find-and-draft. Design and review remain the maintainer's — the inventory just makes those decisions cheaper because the specific missing piece is named.
