Recent AI research lands in existing codebases through specific extension points — modules, callbacks, or data-structure fields where a new method can plug in. Which extension points a repo provides determines which methods can be tried against it without a rewrite. We ran an agentic method-search loop that dispatches recent arxiv papers as draft integrations against 6 production repos; the by-product across 36 cycles was a per-repo inventory of the specific extension points those repos are missing.
Packaged as a CLI subcommand in remyxai-cli #46:
remyxai outrider autoresearch --repo owner/name \
--cycles 5 --budget 25 \
--provider zai --model glm-5.2
Candidates come from the Remyx engine's ranker for the target's Research Interest, auto-extracted via remyxai interests from-repo <owner/name>
. Three dispatch modes: default (ranker's top-N), --pin-arxiv <id>
for an exact paper, --search-method "..."
for a free-text query resolved to a top hit.
Per cycle: a hypothesis LLM checks capability-fit against the target and picks a candidate (or SKIPs with alternative-query suggestions if none fit). If picked, Outrider drafts an integration and audits fidelity against the paper's reference implementation. A decision LLM classifies the resulting artifact as MERGE / ITERATE / LEAD (paper mismatched but preflight surfaced a viable experiment) / REJECT / SKIP / INFRA_FAIL. The trace persists between cycles so subsequent picks read prior failure modes.
36 cycles across 6 targets, under $22 total.
| Target | Cycles | Result mix |
|---|---|---|
remyxai/VQASynth |
||
| 10 | 1 LEAD, 4 REJECT, 5 SKIP | |
smellslikeml/OpenRLHF |
||
| 5 | 4 LEAD, 1 INFRA_FAIL | |
smellslikeml/OLMo-core |
||
| 5 | 3 LEAD, 1 ITERATE, 1 SKIP | |
smellslikeml/open-instruct |
||
| 5 | 3 LEAD, 1 ITERATE, 1 SKIP | |
smellslikeml/atg-research |
||
| 2 | 2 SKIP (see below) | |
smellslikeml/diffusers |
||
| 9 | 4 LEAD, 2 REJECT, 1 INFRA_FAIL, 2 SKIP |
Every artifact is public on the target fork. Each bullet below is a named integration gap with a link to the artifact where preflight described it; papers listed are ones currently blocked by that gap.
- No trainer / optimizer / loss scaffolding. Training-side papers like
CoLTandScreenAnnotatorcan't land. (#103,#105) - No agent loop, tool dispatch, or memory.
S-Agentcan't land in the single-shot inference path — preflight suggested wrapping
depth.py/localize.py/scene_fusion.py
as tools behind a training-free planner in a newvqasynth/agent.py
. (#106) - No proxy-reward layer over
evaluation.py
. A rejected test-time-scaling paper suggested using ground-truth scoring as an ORM stand-in. (#104)
→ Suggested next step: add vqasynth/agent.py
— preflight sketched the shape, opens the agentic-VLM inference family.
- No step-level reward assignment.
MRPOneeds per-step credit assignment; trainer is trajectory-level. Preflight named
openrlhf/trainer/ppo_utils/step_reward.py
. (#10) - No optimizer-benchmarking harness. Muon wired at
openrlhf/utils/deepspeed/deepspeed.py:148
, no A/B path against AdamW. (#11) - No reward-labeler plug-in point. PBRSproposed a VLM preference labeler; reward-computation has no injection hook. (#12) - No explicit warm-up phase separated from steady-state. FORCEneeds one. (#13)
→ Suggested next step: create openrlhf/trainer/ppo_utils/step_reward.py
. Most-cited missing host across cycles; downstream additions layer on once it exists.
- No learned budget-constrained hybrid-attention layer selection.
FlashMorph's per-layer gating maps onto existing attention types via a gate module — preflight scoped a minimal experiment on a synthetic needle-in-haystack retrieval set with the backbone frozen. (#6) BluTrain(full C++/CUDA training system) has no module-level import surface. Preflight suggested picking one technique (memory allocator setting, Triton kernel, backend config) and prototyping PyTorch-native via
src/integration_tests/test_train_small_model.py
. (#7)HyperQuant's Tensor-Core MMA kernels can't land, but the RHT → lattice → Rice-coding weight-compression path can, as a standalone utility on pretrained OLMo-2 checkpoints, measured against paper Table 1 operating points. (#8)- ITERATE on
DMuon— distributed Muon optimizer variant that may already be reachable through the existingdion
dependency pin; needs a version check before committing to import. (#5) - SKIP cycle correctly identified 6 inference-serving papers (KV-cache eviction, disaggregated serving) that don't fit a training framework. Refinement suggestions map to first-class abstractions:
MoERouterV2
,SequenceMixer
, MXFP8 stack,MultimodalLM
.
→ Suggested next step: the MoE-routing refinement query is the highest-leverage — OLMo-core has first-class MoE infrastructure (MoERouterV2, SharedExperts, fused-MoE) that could host improved routing / load-balancing methods currently outside the ranker's top-N.
All 3 LEADs cluster in the GRPO rollout/reward path — a coherent surface for research absorption:
RolloutPipe's CGP+FGD pipelining maps onto GRPO's disaggregated rollout/train loop at thegrpo_olmo_core_actor.py
/actor_manager.py
group-dispatch boundary. Preflight suggested the lighter CGP-only variant as an incremental change. (#11)GeoAlignneeds hidden-state displacement plumbing the GRPO loop doesn't expose yet — preflight scoped a logging-only probe inscripts/train/debug/single_gpu_on_beaker.sh
to confirm the metric is observable before building the full forward-pass callback. (#12)SCPO's step-level credit shaping fits GRPO's reward path as a config-gated module, validated on the repo's existing tool-use rollouts viaenvironments/
. (#13)- ITERATE on
a supervisory-signal paper— landed as too broad but preflight surfaced a control-token probability-mass diagnostic callback ingrpo_callbacks.py
as the minimum-viable probe. (#14) - SKIP identified vision/multimodal papers (CFPO, VeriEvol, MRPO) that don't fit text-only open-instruct. Refinement suggestions referenced IcePop (the repo's own off-policy correction module) by name.
→ Suggested next step: SCPO's step-level credit shaping is the tightest fit — reuses the existing reward-path abstraction and can be validated against tool-use rollouts the repo already runs.
- Video-scheduler injection points less flexible than image pipelines. VeraandPRISMrejected on the same axis. - No preference-head over intermediate latents. PRISM couldn't land.
- SKIP outcomes named frontier surfaces the ranker isn't reaching: DiT attention quantization (4-bit W4A16), few-step consistency-model schedulers, LoRA/PEFT for FLUX/SD3, modular composable pipelines.
→ Suggested next step: extend the existing quantization workstream (AutoRound, torchao) to DiT attention. Fits an active axis, lands the "efficient diffusion" family currently backlogged.
Pinterest's atg-research
is a catch-all monorepo organizing multiple unrelated research directions (joint-rl-diffusion, InteractRank, others) as sibling top-level directories. The interest anchored on one direction; preflight read a different subdirectory. Both cycles skipped honestly.
Not a bug — a repo shape the current per-repo targeting doesn't accommodate. Follow-up: extend Research Interest configuration to accept a subdir
scope.
Gaps across targets cluster into two shapes:
Missing hosts— a file or module that doesn't exist but would need to (VQASynth'sagent.py
, OpenRLHF'sstep_reward.py
). Higher effort, unlocks entire method families.Missing hooks in existing infrastructure— a callback, data-structure field, or plug-in point in code that already exists (open-instruct's hidden-state displacement plumbing for the GRPO forward pass, diffusers' preference-scoring callback during denoising). Lower effort, unlocks specific variants.
Producing this inventory manually would require reading every recent paper in the target's subfield and imagining each integration path. The connection between a small architectural change and the methods it would unblock is often invisible from inside the codebase. The dispatch mode surfaces that connection as a by-product.
Cost profile: SKIP outcome $0.02, dispatched REJECT ~$1-2, LEAD ~$1-2. Under $22 across 36 cycles produced the whole inventory above.
The dispatch mode handles find-and-draft. Design and review remain the maintainer's — the inventory just makes those decisions cheaper because the specific missing piece is named.