{"slug": "coding-a-classical-robot-controller-in-the-age-of-coding-agents", "title": "Coding a Classical Robot Controller in the Age of Coding Agents", "summary": "A team of two fathers competing as MacCody in the AI for Industry Challenge spent six weeks using coding agents to build a classical robot controller for a UR5e to insert fiber connectors into a randomized task board. The team did not make the top 30 but documented their mixed results, finding that a human-plus-coding-agent workflow provided real leverage for iterating on explicit, rule-based control code while also revealing clear limits compared to learned-policy approaches. The experiment tested whether classical controller development could become meaningfully competitive in a task typically requiring training data and retraining loops.", "body_md": "Hello everyone!\n\nWe competed in the AI for Industry Challenge under the team name **MacCody** and wanted to share our experiences in the Qualification Phase. We unfortunately did not make it in the top 30 but wanted to contribute this piece of knowledge, it may be interesting for Roboticists and people working with LLMs / Coding Agents.\n\nDisclaimer: this is not a polished success story, and it is not an anti-agent story either. It is a retrospective about what happened when we used coding agents heavily on a hard robotics task for six weeks, from **April 1 to May 15** (not full-time, but fitted around day jobs and family life). The result was mixed: real leverage, but also clear limits.\n\nWhat we were really trying to find out\n\nThe [AI for Industry Challenge](https://www.intrinsic.ai/events/ai-for-industry-challenge) asked teams to control a UR5e in Gazebo and insert SFP and SC fiber connectors into ports on a randomized task board, using camera images, force/torque sensing, and joint state information.\n\nOne obvious reaction to a task like that is: this sounds like a learned-policy problem, i.e. collecting a big set of training data and training a model to control the robot.\n\nWe were interested in a different question.\n\nCould a human plus coding-agent workflow actually build, iterate on, and debug a **classical controller** (that is, ordinary source code with explicit rules, rather than a policy learned from training data) far enough that it becomes meaningfully competitive in a task where many people would otherwise assume they need training, datasets, and retraining loops?\n\nThat was the real experiment for us and may also serve as a benchmarking sandbox to evaluate coding agents for hard tasks.\n\nAnother reason this route felt worth trying was that coding agents had changed the economics of experimentation. In our case, tokens and code generation felt cheap enough to make broad exploration attractive, partly because model access was subsidized. That cost picture may look different in the future, but during this project it mattered. It made it feel reasonable to try explicit ideas, build tooling, throw away weak directions, and keep going.\n\nWe were not only chasing one lucky run. We were trying to find explicit strategies that still worked when seeds changed, board poses changed, NIC placements changed, and grasps were perturbed.\n\nTeam context\n\nThis was a small-team effort by two fathers, done around day jobs and family life.\n\nWe were not starting from zero, but we were also not equally prepared. Prior ROS/Gazebo and robotics-research experience existed mainly on one side of the team, and that clearly lowered the early friction. Still, looking back, the biggest speedup did not come from raw prior knowledge alone. It came from **adjacent experience multiplied by coding-agent automation**. The agent made it cheaper to turn partial understanding into code, tools, overlays, wrappers, and experiments.\n\nWe also ended up staying with a mostly **single-agent + human-in-the-loop** workflow. Part of that was preference: in a noisy, contact-rich task like this, preserving one shared mental picture of what the controller was trying to do often felt more valuable than maximizing autonomous parallelism.\n\nPart of it was also timing. Multi-agent workflows and more autonomous “dark factory” patterns still felt quite experimental to us at that point. On a problem this entangled, we suspected they were more likely to create extra chaos than useful structure.\n\nSetup\n\nOur setup was intentionally simple:\n\n-\na rented **RTX 3090** server with **10 CPU cores** and **more than 32 GB RAM**\n\n-\n**Pi** as the coding agent interface\n\n-\n`pi-autoresearch`\n\nfor autonomous benchmark / edit / analyze loops\n\n-\nlater `pi-telegram`\n\nfor asynchronous notifications and image pushes\n\n-\na **GitHub Copilot Business** subscription for sustained model access\n\nThe compute side was rented by the hour rather than built as a dedicated training setup. That does not mean the project was cheap in human time. It does mean the cost profile felt very different from a training-heavy route.\n\nOutcome\n\nEarly on, when the leaderboard was still sparse, we briefly sat at **rank 6 with a score of 47.34**. By the end, our final public position was **rank 65 with 112.75**. Our **best recorded internal score** reached **151.71**. But because the task had real variance and the policy was still moving during implementation sprints, we treat that as a peak observation rather than a stable final level.\n\nThat is **not** proof that classical control beats learned methods. It is **not** proof that we solved the challenge. We did not finish in the top 30.\n\nBut it was enough to change our prior. It made us take seriously the idea that coding agents can materially accelerate classical robotics work, to the point where explicit controllers are worth reconsidering in places where many people might otherwise default to training.\n\nController evolution\n\nThe policy was not one clever trick. Over time, it turned into a staged classical visuotactile controller.\n\nA rough summary is:\n\n-\nvisual acquisition and target selection\n\n-\nprealignment above the board\n\n-\nguarded descent\n\n-\ncontact classification\n\n-\nlocal search near the socket mouth\n\n-\ncompliant insertion\n\n-\nverification and recovery\n\nThat summary is cleaner than the lived development process. In practice, the controller became more explicit because we kept running into the same hard question in different forms: **what should the system trust right now?** Vision? Contact? Recovery logic? Retreat logic? Retry logic?\n\nEarly phase: just getting a classical baseline to work\n\nThe earliest viable versions were comparatively simple:\n\n-\ncolor-based segmentation for SFP and SC cues\n\n-\niterative image-based correction loops\n\n-\nboard-orientation estimation for prealignment\n\n-\nguarded descent with force-aware break and retreat logic\n\nThis was enough to produce a real classical baseline. It was not enough once randomization and late-stage contact started dominating the failures.\n\nMid phase: the system got more geometric and more explicit\n\nLater, the perception stack became more geometric and more constrained by the board model:\n\n-\nboard silhouette detection\n\n-\nhomography / perspective rectification into a canonical board view\n\n-\nURDF-guided ROI targeting for likely NIC / rail locations\n\n-\nGeneralized Hough Transform proposals for faceplate candidates\n\n-\nchamfer matching for edge-template verification\n\n-\ngeometric validation of aperture spacing and faceplate dimensions\n\n-\ncircular ferrule detection for SC with geometric consistency checks\n\nAt the implementation level, a lot of this sat on top of classical [OpenCV](https://github.com/opencv/opencv) building blocks rather than a learned perception stack.\n\nAt the same time, the controller became more state-machine-like, not because we had started with a love of state machines, but because the handoffs kept becoming decisive. A lot of the real work was no longer “detect the board” or “move down.” It was questions like:\n\n-\nwhen do we trust the current visual evidence enough to descend?\n\n-\nwhen has descent really become contact?\n\n-\nwhen does contact look like likely entry rather than just bumping the world?\n\n-\nwhen should the controller stop trying to force the current plan and instead recover?\n\nLate phase: the hard part was no longer mainly finding the target\n\nBy the later branch, the controller increasingly looked like a classical peg-in-hole controller rather than a generic vision-guided one:\n\n-\ncontact classification instead of pure force thresholding\n\n-\nlocal micro-map style search\n\n-\nring-based retries\n\n-\nboundary-aware probing\n\n-\nentrance-search behavior near the socket mouth\n\n-\nverify/recover logic that treated partial progress, likely seating, and true insertion as different states\n\nUnder stronger randomization, the targeting logic also became more structural. Later detector work used classical geometric methods such as:\n\n-\nGaussian smoothing and Canny edge detection\n\n-\nprobabilistic Hough line detection\n\n-\nlength-weighted angle histograms to estimate the dominant board direction\n\n-\nprojection profiles and peak finding to identify rail / bracket structure\n\n-\nphysical spacing constraints and geometric pairing rules for plausible SFP mouth candidates\n\nThese two images show that phase well.\n\n**Figure 1: late-stage edge detector overlay**\n\nLate-stage live overlay from the edge-based detector work. Candidate lines, selected SFP pair hypotheses, and the board structure are all made explicit rather than hidden behind a single detector output.\n\n**Figure 2: edge projection and peak selection**\n\nROI view together with a projection profile and selected peaks. This is a good example of how the later randomization-heavy phase relied on explicit geometric signals instead of only broad region heuristics.\n\nBy that point, the bottleneck was no longer mainly target finding. It was converting near-mouth alignment into real seating. That led to more careful jam logic, stronger but still bounded compliant pushes, explicit cable-drift handling, and retry geometries that tried to use cable flex constructively instead of treating it only as noise.\n\n**Figure 3: full insertion milestone**\n\nOne of our observed full insertions in simulation. For us this was an important breakthrough, but not yet a statistically robust solution.\n\nAgent leverage\n\nThe most obvious benefit was speed, but the more important story was where that speed actually showed up.\n\nThe agent helped most in the engineering loop **around** the controller, not only in the controller logic itself. It helped with simulator bring-up, wrappers, benchmark orchestration, telemetry, observability, documentation, memory artifacts, perception tooling, and bounded controller evolution inside a fast experiment loop.\n\nThat distinction matters. A lot of the leverage did not look like “the agent invented the winning idea.” It looked more like this:\n\n-\nbuilding one more debug overlay so we could see candidate geometry instead of guessing\n\n-\nwriting one more benchmark wrapper so variants could be compared faster\n\n-\nrestructuring a phase boundary so retries made more sense\n\n-\nturning vague suspicions into inspectable outputs\n\nThere were also two less obvious benefits.\n\nThe first was **domain ramp-up**. This was a robotics subdomain we did not work in every day. The agent helped with vocabulary, decomposition, and building a more explicit mental model of insertion, contact, seating, jamming, recovery, and verification.\n\nThe second was **controller restructuring**, not only tuning. A meaningful part of the value was that the agent helped repeatedly reshape the controller itself: clearer phases, better instrumentation, sharper retries, and more explicit recovery behavior.\n\nPerception is a good example. We stayed with classical vision partly because it was much faster to iterate on than standing up a learned perception stack with datasets, labeling, training, and retraining loops. The detectors were explicit and inspectable. The agent could help build geometry checks, debug overlays, URDF-based validations, and image-processing logic while the human often stayed in a fuzzier steering role: “this is probably the wrong region,” “show the candidate mouth pairs,” “show the peaks,” “show me where contact likely began.”\n\n**Figure 4: agent-built classical vision debug overlay**\n\nDebug overlay from the classical perception pipeline. A lot of this detector logic and its overlays were developed with the coding agent, while the human often steered more by pointing out what looked wrong or what should be made visible next.\n\nAgent limits\n\nThe strongest early gains came from [Autoresearch](https://github.com/karpathy/autoresearch)-style hill climbing. That worked well while the task still had many accessible local improvements.\n\nLater, once the problem became more nonlinear, contact-rich, variance-heavy, and entangled across perception, contact, and recovery, the limitations became much harder to ignore.\n\nOne recurring pattern was this: a local change could look promising, and still disappear once the full loop was running again.\n\nWe could isolate some perception questions. But a clean integrated benchmark split was often impossible. The behavior that mattered emerged from the interaction between visual targeting, guarded motion, contact classification, compliance, cable effects, and recovery. In a continuous state space with continuous dynamics, small local improvements did not reliably compose into robust end-to-end gains.\n\nThat was true not only for the agent, but also for us as collaborators. A clean division of ownership across subsystems often broke down because the decisive failures lived in the interactions between them.\n\nThe recurring failure modes were:\n\n-\nlocal-optima chasing\n\n-\nelaborating the current direction instead of challenging it\n\n-\nweak taste about which branch deserved attention\n\n-\ndifficulty identifying the true latent bottleneck\n\n-\npoor translation from offline improvement to online robustness\n\n-\nfuzzy or wrong physical assumptions about contact, motion, and geometry\n\nA concise way to say it is this: the agent was often good at **improving a direction**, but much worse at deciding whether the direction itself should survive.\n\nAnother especially important lesson was that the agent often **needed more context than it knew how to acquire by itself**.\n\nIt could use extra context very well once it existed. But it was not reliable at noticing that its world model was under-specified and then building the right context-sourcing loop on its own. We repeatedly had to intervene with more explicit simulator details, sensor-frame clarifications, robot and connector geometry, URDF constraints, and written physical assumptions.\n\nAutoresearch worked - until it didn’t\n\nOne part of the story that is easy to understate is that we **wanted** autoresearch to work. We were quite taken with the Karpathy-style idea of an agent that could keep proposing, testing, and refining improvements in a mostly autonomous loop, and early results encouraged that optimism.\n\nEarly on, that loop was highly productive because the task still had many accessible local improvements. Later, once the problem became dominated by variance, branching, contact mechanics, and ambiguous physical evidence, it became much less trustworthy. The loop could keep improving a direction long after that direction should have been questioned or abandoned. That was the point where autoresearch had to be demoted from main driver to subordinate tool.\n\nThe bottleneck moved from generation to selection\n\nThat shift changed the human role as well. Once coding and local iteration became cheap, the scarcer skill was no longer generating one more idea, but deciding which idea deserved attention, which evidence could be trusted, and which branch needed to be killed.\n\nIn that sense, coding agents changed what had to be learned. More of the work moved upward into problem framing, experiment design, algorithm selection, deciding what details mattered in the larger scheme, and steering generated subsystems rather than hand-authoring every line.\n\nThat was often a real advantage. But it also had a downside: on later branches the controller became highly monolithic, and working at a more abstract level may have made it easier to defer structural cleanup for too long. Since our final result was meaningful but not top-tier, we do not want to romanticize that tradeoff.\n\nObservability and tooling\n\nA recurring practical problem was that a score would move, but we still would not know why.\n\nScalar scores and arbitrary logging were too impoverished to let the agent form a good high-level model of what had happened in simulation. They were often too impoverished for us too.\n\nThat is a big part of why we built `query_trial_images.sh`\n\n, a small utility that extracted a few representative screenshots around selected trial events and timestamps. It was not just a convenience script. It was a way to turn a long messy run into a compact visual summary that both the humans and the agent could actually reason about.\n\nThat changed the workflow. Better decisions often required better packaging of evidence first.\n\nThat was also where harness constraints started to matter directly. Modern multimodal LLMs can consume images as well as text, but providers still limit how much image material can be sent in a single request, even when the nominal context window is very large. We hit those limits often enough, and they consistently surfaced as `HTTP 413`\n\nerrors, that we had to move toward more selective, timeline-aware review instead of just dumping screenshots into the loop.\n\nThis was not only an internal tooling annoyance. It was a concrete limitation at the LLM/provider interface, and it shaped how evidence had to be packaged for the agent.\n\nWorkflow shift\n\nOne part of the project that felt genuinely new to us was the **shape of the workflow around the controller**.\n\nBy the end, the agent had become part of an asynchronous supervision loop:\n\n-\nlong-running experiments were launched on a rented remote machine\n\n-\nmilestone updates and selected images were pushed into Telegram\n\n-\nSSH and VNC let us inspect runs from elsewhere\n\n-\nsmall image-query tools let us ask targeted visual questions instead of rereading huge logs\n\n-\na foldable-phone workflow was sometimes enough to keep the loop moving\n\nFor a small team working around day jobs and family life, this changed the practical shape of robotics development. It no longer required continuous local attendance at the workstation. It became something we could supervise intermittently and remotely, with the agent surfacing the moments worth attention.\n\nBut the interesting part was not only that the workflow became more mobile. It was that it became more **chat-native**. The agent was helping write code, run experiments, collect evidence, summarize outcomes, and push important artifacts into communication channels.\n\n**Figure 5: mobile debugging / remote supervision workflow**\n\nFoldable-phone split-screen workflow: offline detector analysis on one side, visual overlay on the other.\n\n**Figure 6: asynchronous experiment updates**\n\nTelegram updates from long-running experiment loops, including insertion-rate summaries and milestone events such as rare full insertions.\n\nCollaboration under constraints\n\nIn theory, two people plus coding agents should have enabled a lot of parallel exploration. But another part of the story was simply the reality of the setup. We had two humans, but effectively one shared experimental setup: one rented hardware instance, one primary agent interface, and benchmarks that often ran for a long time. In practice, that often meant one person was waiting on a long integrated run while the other had ideas that could not yet be validated. The bottleneck was not just writing more code. It was getting access to trustworthy feedback from the shared setup.\n\nThat made collaboration less about independent parallel tracks and more about careful staging, handoff, and deciding which hypothesis deserved the shared experimental slot. Handoffs mattered. Shared context mattered. Staying aligned on what the system was actually doing mattered. For us, this started to feel like one of the frontier questions in the whole project: not only what coding agents can do, but what human collaboration around them should look like.\n\nReason for sharing\n\nThe public discussion around coding agents often feels too binary: either polished success stories or dismissive takes.\n\nOur experience was mixed in a way that feels more useful:\n\nIf we had to compress the main takeaway, it would be this:\n\ncoding agents already make classical robotics development faster and cheaper than we would have expected, but they do not remove the need for architecture, observability, physical grounding, or human judgment.\n\nThat is enough, in our view, to make explicit classical approaches worth reconsidering in places where many people might assume only learned policies are viable.\n\nTransferable lessons\n\nWe do not think the lessons here are specific only to robotics. Since our normal day jobs span other domains, projects, and industries as well, part of what makes this exercise interesting to us is precisely the question of what transfers and what does not.\n\nA few patterns seem likely to transfer to other hard engineering domains as well:\n\n-\n**Implementation gets cheaper faster than validation does.** Coding agents make implementation, instrumentation, and variant generation much cheaper, but they do not remove the need for validation. The bottleneck often moves upward: away from writing code and toward selecting hypotheses, deciding what evidence matters, and recognizing the real bottleneck.\n\n-\n**Observability becomes even more valuable.** Agents are much more effective when the system is observable. Logs, screenshots, targeted summaries, and compact review tools do not just help humans; they help make the problem legible to the agent.\n\n-\n**Decomposition only works where the domain allows it.** In tightly coupled systems, local improvements often fail to compose cleanly, which limits both autonomous hill-climbing and naive parallelization.\n\n-\n**Collaboration may be limited by experimental bandwidth, not coding capacity.** In practice, the scarce resource may be one simulator, one staging environment, one hardware target, or one benchmark queue. We suspect that mix - cheap implementation, expensive validation, and scarce experimental slots - will show up in many domains beyond robotics.\n\nA compact way to say it is this: coding agents seem to be changing the economics of **implementation** faster than the economics of **validation**.\n\nFinal methodological note\n\nThis retrospective itself followed the same basic human-in-the-loop pattern.\n\nIt did not come purely from memory, but also not purely from the archive. A lot of the first material existed only as messy recollections, hunches, and half-formed notes accumulated during the project. We used those as prompts, then checked and refined them against git history, archived Pi sessions, internal screenshots, and repository notes, with an agent helping turn that mix into a more coherent account.\n\nOne thing that became quite clear to us through this is that the sessions with coding agents are increasingly becoming artifacts in their own right. Not just the final code, not just the final score, but also the session history, the reasoning path, the tooling decisions, the dead ends, and the retries.\n\nSo even this writeup is, in a small second-order sense, part of the same experiment.\n\n**We wish all other participants a successful time in the later phases and would like to thank the organizers of this challenge for providing us the toolset to experiment.**\n\nLinks", "url": "https://wpnews.pro/news/coding-a-classical-robot-controller-in-the-age-of-coding-agents", "canonical_source": "https://discourse.openrobotics.org/t/post-qualification-contribution-coding-a-classical-robot-controller-in-the-age-of-coding-agents-an-honest-assessment/55120", "published_at": "2026-05-27 21:52:43+00:00", "updated_at": "2026-05-27 22:14:56.775673+00:00", "lang": "en", "topics": ["robotics", "large-language-models", "ai-agents", "ai-tools", "ai-research"], "entities": ["MacCody", "AI for Industry Challenge", "Intrinsic", "UR5e", "Gazebo", "LLMs", "Coding Agents", "SFP"], "alternates": {"html": "https://wpnews.pro/news/coding-a-classical-robot-controller-in-the-age-of-coding-agents", "markdown": "https://wpnews.pro/news/coding-a-classical-robot-controller-in-the-age-of-coding-agents.md", "text": "https://wpnews.pro/news/coding-a-classical-robot-controller-in-the-age-of-coding-agents.txt", "jsonld": "https://wpnews.pro/news/coding-a-classical-robot-controller-in-the-age-of-coding-agents.jsonld"}}