{"slug": "findings-from-running-remyxai-cli-autoresearch-across-5-production-repos-per-of", "title": "Findings from running remyxai-cli autoresearch across 5 production repos — per-repo inventory of architectural extension points missing to receive recent AI methods", "summary": "A developer ran an agentic method-search loop across 6 production repositories, spending under $22 on 36 cycles to inventory missing architectural extension points for recent AI methods. The tool, packaged as a CLI subcommand in remyxai-cli, identified gaps such as missing trainer scaffolding, agent loops, and reward-labeler plug-in points, with results publicly available on target forks.", "body_md": "Recent AI research lands in existing codebases through specific extension points — modules, callbacks, or data-structure fields where a new method can plug in. Which extension points a repo provides determines which methods can be tried against it without a rewrite. We ran an agentic method-search loop that dispatches recent arxiv papers as draft integrations against 6 production repos; the by-product across 36 cycles was a per-repo inventory of the specific extension points those repos are missing.\n\nPackaged as a CLI subcommand in [remyxai-cli #46](https://github.com/remyxai/remyxai-cli/pull/46):\n\n```\nremyxai outrider autoresearch --repo owner/name \\\n    --cycles 5 --budget 25 \\\n    --provider zai --model glm-5.2\n```\n\nCandidates come from the Remyx engine's ranker for the target's Research Interest, auto-extracted via `remyxai interests from-repo <owner/name>`\n\n. Three dispatch modes: default (ranker's top-N), `--pin-arxiv <id>`\n\nfor an exact paper, `--search-method \"...\"`\n\nfor a free-text query resolved to a top hit.\n\nPer cycle: a hypothesis LLM checks capability-fit against the target and picks a candidate (or SKIPs with alternative-query suggestions if none fit). If picked, [Outrider](https://github.com/remyxai/outrider) drafts an integration and audits fidelity against the paper's reference implementation. A decision LLM classifies the resulting artifact as MERGE / ITERATE / LEAD (paper mismatched but preflight surfaced a viable experiment) / REJECT / SKIP / INFRA_FAIL. The trace persists between cycles so subsequent picks read prior failure modes.\n\n36 cycles across 6 targets, under $22 total.\n\n| Target | Cycles | Result mix |\n|---|---|---|\n`remyxai/VQASynth` |\n10 | 1 LEAD, 4 REJECT, 5 SKIP |\n`smellslikeml/OpenRLHF` |\n5 | 4 LEAD, 1 INFRA_FAIL |\n`smellslikeml/OLMo-core` |\n5 | 3 LEAD, 1 ITERATE, 1 SKIP |\n`smellslikeml/open-instruct` |\n5 | 3 LEAD, 1 ITERATE, 1 SKIP |\n`smellslikeml/atg-research` |\n2 | 2 SKIP (see below) |\n`smellslikeml/diffusers` |\n9 | 4 LEAD, 2 REJECT, 1 INFRA_FAIL, 2 SKIP |\n\nEvery artifact is public on the target fork. Each bullet below is a named integration gap with a link to the artifact where preflight described it; papers listed are ones currently blocked by that gap.\n\n- No trainer / optimizer / loss scaffolding. Training-side papers like\n[CoLT](https://arxiv.org/abs/2606.31986)and[ScreenAnnotator](https://arxiv.org/abs/2606.18846)can't land. ([#103](https://github.com/remyxai/VQASynth/issues/103),[#105](https://github.com/remyxai/VQASynth/issues/105)) - No agent loop, tool dispatch, or memory.\n[S-Agent](https://arxiv.org/abs/2606.20515)can't land in the single-shot inference path — preflight suggested wrapping`depth.py/localize.py/scene_fusion.py`\n\nas tools behind a training-free planner in a new`vqasynth/agent.py`\n\n. ([#106](https://github.com/remyxai/VQASynth/issues/106)) - No proxy-reward layer over\n`evaluation.py`\n\n. A rejected test-time-scaling paper suggested using ground-truth scoring as an ORM stand-in. ([#104](https://github.com/remyxai/VQASynth/issues/104))\n\n→ **Suggested next step:** add `vqasynth/agent.py`\n\n— preflight sketched the shape, opens the agentic-VLM inference family.\n\n- No step-level reward assignment.\n[MRPO](https://arxiv.org/abs/2606.31825)needs per-step credit assignment; trainer is trajectory-level. Preflight named`openrlhf/trainer/ppo_utils/step_reward.py`\n\n. ([#10](https://github.com/smellslikeml/OpenRLHF/issues/10)) - No optimizer-benchmarking harness. Muon wired at\n`openrlhf/utils/deepspeed/deepspeed.py:148`\n\n, no A/B path against AdamW. ([#11](https://github.com/smellslikeml/OpenRLHF/issues/11)) - No reward-labeler plug-in point.\n[PBRS](https://arxiv.org/abs/2606.27180)proposed a VLM preference labeler; reward-computation has no injection hook. ([#12](https://github.com/smellslikeml/OpenRLHF/issues/12)) - No explicit warm-up phase separated from steady-state.\n[FORCE](https://arxiv.org/abs/2606.26006)needs one. ([#13](https://github.com/smellslikeml/OpenRLHF/issues/13))\n\n→ **Suggested next step:** create `openrlhf/trainer/ppo_utils/step_reward.py`\n\n. Most-cited missing host across cycles; downstream additions layer on once it exists.\n\n- No learned budget-constrained hybrid-attention layer selection.\n[FlashMorph](https://arxiv.org/abs/2606.30562)'s per-layer gating maps onto existing attention types via a gate module — preflight scoped a minimal experiment on a synthetic needle-in-haystack retrieval set with the backbone frozen. ([#6](https://github.com/smellslikeml/OLMo-core/issues/6)) [BluTrain](https://arxiv.org/abs/2606.24780)(full C++/CUDA training system) has no module-level import surface. Preflight suggested picking one technique (memory allocator setting, Triton kernel, backend config) and prototyping PyTorch-native via`src/integration_tests/test_train_small_model.py`\n\n. ([#7](https://github.com/smellslikeml/OLMo-core/issues/7))[HyperQuant](https://arxiv.org/abs/2606.23406)'s Tensor-Core MMA kernels can't land, but the RHT → lattice → Rice-coding weight-compression path can, as a standalone utility on pretrained OLMo-2 checkpoints, measured against paper Table 1 operating points. ([#8](https://github.com/smellslikeml/OLMo-core/issues/8))- ITERATE on\n[DMuon](https://arxiv.org/abs/2606.27153)— distributed Muon optimizer variant that may already be reachable through the existing`dion`\n\ndependency pin; needs a version check before committing to import. ([#5](https://github.com/smellslikeml/OLMo-core/issues/5)) - SKIP cycle correctly identified 6 inference-serving papers (KV-cache eviction, disaggregated serving) that don't fit a training framework. Refinement suggestions map to first-class abstractions:\n`MoERouterV2`\n\n,`SequenceMixer`\n\n, MXFP8 stack,`MultimodalLM`\n\n.\n\n→ **Suggested next step:** the MoE-routing refinement query is the highest-leverage — OLMo-core has first-class MoE infrastructure (MoERouterV2, SharedExperts, fused-MoE) that could host improved routing / load-balancing methods currently outside the ranker's top-N.\n\nAll 3 LEADs cluster in the GRPO rollout/reward path — a coherent surface for research absorption:\n\n[RolloutPipe](https://arxiv.org/abs/2606.26997)'s CGP+FGD pipelining maps onto GRPO's disaggregated rollout/train loop at the`grpo_olmo_core_actor.py`\n\n/`actor_manager.py`\n\ngroup-dispatch boundary. Preflight suggested the lighter CGP-only variant as an incremental change. ([#11](https://github.com/smellslikeml/open-instruct/issues/11))[GeoAlign](https://arxiv.org/abs/2606.26917)needs hidden-state displacement plumbing the GRPO loop doesn't expose yet — preflight scoped a logging-only probe in`scripts/train/debug/single_gpu_on_beaker.sh`\n\nto confirm the metric is observable before building the full forward-pass callback. ([#12](https://github.com/smellslikeml/open-instruct/issues/12))[SCPO](https://arxiv.org/abs/2606.25852)'s step-level credit shaping fits GRPO's reward path as a config-gated module, validated on the repo's existing tool-use rollouts via`environments/`\n\n. ([#13](https://github.com/smellslikeml/open-instruct/issues/13))- ITERATE on\n[a supervisory-signal paper](https://arxiv.org/abs/2606.26027)— landed as too broad but preflight surfaced a control-token probability-mass diagnostic callback in`grpo_callbacks.py`\n\nas the minimum-viable probe. ([#14](https://github.com/smellslikeml/open-instruct/issues/14)) - SKIP identified vision/multimodal papers (CFPO, VeriEvol, MRPO) that don't fit text-only open-instruct. Refinement suggestions referenced IcePop (the repo's own off-policy correction module) by name.\n\n→ **Suggested next step:** SCPO's step-level credit shaping is the tightest fit — reuses the existing reward-path abstraction and can be validated against tool-use rollouts the repo already runs.\n\n- Video-scheduler injection points less flexible than image pipelines.\n[Vera](https://arxiv.org/abs/2606.23610)and[PRISM](https://arxiv.org/abs/2606.20310)rejected on the same axis. - No preference-head over intermediate latents. PRISM couldn't land.\n- SKIP outcomes named frontier surfaces the ranker isn't reaching: DiT attention quantization (4-bit W4A16), few-step consistency-model schedulers, LoRA/PEFT for FLUX/SD3, modular composable pipelines.\n\n→ **Suggested next step:** extend the existing quantization workstream (AutoRound, torchao) to DiT attention. Fits an active axis, lands the \"efficient diffusion\" family currently backlogged.\n\nPinterest's `atg-research`\n\nis a catch-all monorepo organizing multiple unrelated research directions (joint-rl-diffusion, InteractRank, others) as sibling top-level directories. The interest anchored on one direction; preflight read a different subdirectory. Both cycles skipped honestly.\n\nNot a bug — a repo shape the current per-repo targeting doesn't accommodate. Follow-up: extend Research Interest configuration to accept a `subdir`\n\nscope.\n\nGaps across targets cluster into two shapes:\n\n**Missing hosts**— a file or module that doesn't exist but would need to (VQASynth's`agent.py`\n\n, OpenRLHF's`step_reward.py`\n\n). Higher effort, unlocks entire method families.**Missing hooks in existing infrastructure**— a callback, data-structure field, or plug-in point in code that already exists (open-instruct's hidden-state displacement plumbing for the GRPO forward pass, diffusers' preference-scoring callback during denoising). Lower effort, unlocks specific variants.\n\nProducing this inventory manually would require reading every recent paper in the target's subfield and imagining each integration path. The connection between a small architectural change and the methods it would unblock is often invisible from inside the codebase. The dispatch mode surfaces that connection as a by-product.\n\nCost profile: SKIP outcome $0.02, dispatched REJECT ~$1-2, LEAD ~$1-2. Under $22 across 36 cycles produced the whole inventory above.\n\nThe dispatch mode handles find-and-draft. Design and review remain the maintainer's — the inventory just makes those decisions cheaper because the specific missing piece is named.", "url": "https://wpnews.pro/news/findings-from-running-remyxai-cli-autoresearch-across-5-production-repos-per-of", "canonical_source": "https://gist.github.com/smellslikeml/3ec1a655433f622c8e8cdd11b04c8cdf", "published_at": "2026-07-03 17:40:00+00:00", "updated_at": "2026-07-04 16:18:40.342270+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-research", "developer-tools", "ai-agents"], "entities": ["remyxai-cli", "remyxai/VQASynth", "smellslikeml/OpenRLHF", "smellslikeml/OLMo-core", "smellslikeml/open-instruct", "smellslikeml/atg-research", "smellslikeml/diffusers", "Outrider"], "alternates": {"html": "https://wpnews.pro/news/findings-from-running-remyxai-cli-autoresearch-across-5-production-repos-per-of", "markdown": "https://wpnews.pro/news/findings-from-running-remyxai-cli-autoresearch-across-5-production-repos-per-of.md", "text": "https://wpnews.pro/news/findings-from-running-remyxai-cli-autoresearch-across-5-production-repos-per-of.txt", "jsonld": "https://wpnews.pro/news/findings-from-running-remyxai-cli-autoresearch-across-5-production-repos-per-of.jsonld"}}