Coding a Classical Robot Controller in the Age of Coding Agents

wpnews.pro

Hello everyone!

We competed in the AI for Industry Challenge under the team name MacCody and wanted to share our experiences in the Qualification Phase. We unfortunately did not make it in the top 30 but wanted to contribute this piece of knowledge, it may be interesting for Roboticists and people working with LLMs / Coding Agents.

Disclaimer: this is not a polished success story, and it is not an anti-agent story either. It is a retrospective about what happened when we used coding agents heavily on a hard robotics task for six weeks, from April 1 to May 15 (not full-time, but fitted around day jobs and family life). The result was mixed: real leverage, but also clear limits.

What we were really trying to find out

The AI for Industry Challenge asked teams to control a UR5e in Gazebo and insert SFP and SC fiber connectors into ports on a randomized task board, using camera images, force/torque sensing, and joint state information.

One obvious reaction to a task like that is: this sounds like a learned-policy problem, i.e. collecting a big set of training data and training a model to control the robot.

We were interested in a different question.

Could a human plus coding-agent workflow actually build, iterate on, and debug a classical controller (that is, ordinary source code with explicit rules, rather than a policy learned from training data) far enough that it becomes meaningfully competitive in a task where many people would otherwise assume they need training, datasets, and retraining loops?

That was the real experiment for us and may also serve as a benchmarking sandbox to evaluate coding agents for hard tasks.

Another reason this route felt worth trying was that coding agents had changed the economics of experimentation. In our case, tokens and code generation felt cheap enough to make broad exploration attractive, partly because model access was subsidized. That cost picture may look different in the future, but during this project it mattered. It made it feel reasonable to try explicit ideas, build tooling, throw away weak directions, and keep going.

We were not only chasing one lucky run. We were trying to find explicit strategies that still worked when seeds changed, board poses changed, NIC placements changed, and grasps were perturbed.

Team context

This was a small-team effort by two fathers, done around day jobs and family life.

We were not starting from zero, but we were also not equally prepared. Prior ROS/Gazebo and robotics-research experience existed mainly on one side of the team, and that clearly lowered the early friction. Still, looking back, the biggest speedup did not come from raw prior knowledge alone. It came from adjacent experience multiplied by coding-agent automation. The agent made it cheaper to turn partial understanding into code, tools, overlays, wrappers, and experiments.

We also ended up staying with a mostly single-agent + human-in-the-loop workflow. Part of that was preference: in a noisy, contact-rich task like this, preserving one shared mental picture of what the controller was trying to do often felt more valuable than maximizing autonomous parallelism.

Part of it was also timing. Multi-agent workflows and more autonomous “dark factory” patterns still felt quite experimental to us at that point. On a problem this entangled, we suspected they were more likely to create extra chaos than useful structure.

Setup

Our setup was intentionally simple:

a rented RTX 3090 server with 10 CPU cores and more than 32 GB RAM

Pi as the coding agent interface

pi-autoresearch

for autonomous benchmark / edit / analyze loops #

later pi-telegram

for asynchronous notifications and image pushes #

a GitHub Copilot Business subscription for sustained model access

The compute side was rented by the hour rather than built as a dedicated training setup. That does not mean the project was cheap in human time. It does mean the cost profile felt very different from a training-heavy route.

Outcome

Early on, when the leaderboard was still sparse, we briefly sat at rank 6 with a score of 47.34. By the end, our final public position was rank 65 with 112.75. Our best recorded internal score reached 151.71. But because the task had real variance and the policy was still moving during implementation sprints, we treat that as a peak observation rather than a stable final level.

That is not proof that classical control beats learned methods. It is not proof that we solved the challenge. We did not finish in the top 30.

But it was enough to change our prior. It made us take seriously the idea that coding agents can materially accelerate classical robotics work, to the point where explicit controllers are worth reconsidering in places where many people might otherwise default to training.

Controller evolution

The policy was not one clever trick. Over time, it turned into a staged classical visuotactile controller.

A rough summary is:

visual acquisition and target selection

prealignment above the board

guarded descent

contact classification

local search near the socket mouth

compliant insertion

verification and recovery

That summary is cleaner than the lived development process. In practice, the controller became more explicit because we kept running into the same hard question in different forms: what should the system trust right now? Vision? Contact? Recovery logic? Retreat logic? Retry logic?

Early phase: just getting a classical baseline to work

The earliest viable versions were comparatively simple:

color-based segmentation for SFP and SC cues

iterative image-based correction loops

board-orientation estimation for prealignment

guarded descent with force-aware break and retreat logic

This was enough to produce a real classical baseline. It was not enough once randomization and late-stage contact started dominating the failures.

Mid phase: the system got more geometric and more explicit

Later, the perception stack became more geometric and more constrained by the board model:

board silhouette detection

homography / perspective rectification into a canonical board view

URDF-guided ROI targeting for likely NIC / rail locations

Generalized Hough Transform proposals for faceplate candidates

chamfer matching for edge-template verification

geometric validation of aperture spacing and faceplate dimensions

circular ferrule detection for SC with geometric consistency checks

At the implementation level, a lot of this sat on top of classical OpenCV building blocks rather than a learned perception stack.

At the same time, the controller became more state-machine-like, not because we had started with a love of state machines, but because the handoffs kept becoming decisive. A lot of the real work was no longer “detect the board” or “move down.” It was questions like:

when do we trust the current visual evidence enough to descend?

when has descent really become contact?

when does contact look like likely entry rather than just bumping the world?

when should the controller stop trying to force the current plan and instead recover?

Late phase: the hard part was no longer mainly finding the target

By the later branch, the controller increasingly looked like a classical peg-in-hole controller rather than a generic vision-guided one:

contact classification instead of pure force thresholding

local micro-map style search

ring-based retries

boundary-aware probing

entrance-search behavior near the socket mouth

verify/recover logic that treated partial progress, likely seating, and true insertion as different states

Under stronger randomization, the targeting logic also became more structural. Later detector work used classical geometric methods such as:

Gaussian smoothing and Canny edge detection

probabilistic Hough line detection

length-weighted angle histograms to estimate the dominant board direction

projection profiles and peak finding to identify rail / bracket structure

physical spacing constraints and geometric pairing rules for plausible SFP mouth candidates

These two images show that phase well.

Figure 1: late-stage edge detector overlay

Late-stage live overlay from the edge-based detector work. Candidate lines, selected SFP pair hypotheses, and the board structure are all made explicit rather than hidden behind a single detector output.

Figure 2: edge projection and peak selection

ROI view together with a projection profile and selected peaks. This is a good example of how the later randomization-heavy phase relied on explicit geometric signals instead of only broad region heuristics.

By that point, the bottleneck was no longer mainly target finding. It was converting near-mouth alignment into real seating. That led to more careful jam logic, stronger but still bounded compliant pushes, explicit cable-drift handling, and retry geometries that tried to use cable flex constructively instead of treating it only as noise.

Figure 3: full insertion milestone

One of our observed full insertions in simulation. For us this was an important breakthrough, but not yet a statistically robust solution.

Agent leverage

The most obvious benefit was speed, but the more important story was where that speed actually showed up.

The agent helped most in the engineering loop around the controller, not only in the controller logic itself. It helped with simulator bring-up, wrappers, benchmark orchestration, telemetry, observability, documentation, memory artifacts, perception tooling, and bounded controller evolution inside a fast experiment loop.

That distinction matters. A lot of the leverage did not look like “the agent invented the winning idea.” It looked more like this:

building one more debug overlay so we could see candidate geometry instead of guessing

writing one more benchmark wrapper so variants could be compared faster

restructuring a phase boundary so retries made more sense

turning vague suspicions into inspectable outputs

There were also two less obvious benefits.

The first was domain ramp-up. This was a robotics subdomain we did not work in every day. The agent helped with vocabulary, decomposition, and building a more explicit mental model of insertion, contact, seating, jamming, recovery, and verification.

The second was controller restructuring, not only tuning. A meaningful part of the value was that the agent helped repeatedly reshape the controller itself: clearer phases, better instrumentation, sharper retries, and more explicit recovery behavior.

Perception is a good example. We stayed with classical vision partly because it was much faster to iterate on than standing up a learned perception stack with datasets, labeling, training, and retraining loops. The detectors were explicit and inspectable. The agent could help build geometry checks, debug overlays, URDF-based validations, and image-processing logic while the human often stayed in a fuzzier steering role: “this is probably the wrong region,” “show the candidate mouth pairs,” “show the peaks,” “show me where contact likely began.”

Figure 4: agent-built classical vision debug overlay

Debug overlay from the classical perception pipeline. A lot of this detector logic and its overlays were developed with the coding agent, while the human often steered more by pointing out what looked wrong or what should be made visible next.

Agent limits

The strongest early gains came from Autoresearch-style hill climbing. That worked well while the task still had many accessible local improvements.

Later, once the problem became more nonlinear, contact-rich, variance-heavy, and entangled across perception, contact, and recovery, the limitations became much harder to ignore.

One recurring pattern was this: a local change could look promising, and still disappear once the full loop was running again.

We could isolate some perception questions. But a clean integrated benchmark split was often impossible. The behavior that mattered emerged from the interaction between visual targeting, guarded motion, contact classification, compliance, cable effects, and recovery. In a continuous state space with continuous dynamics, small local improvements did not reliably compose into robust end-to-end gains.

That was true not only for the agent, but also for us as collaborators. A clean division of ownership across subsystems often broke down because the decisive failures lived in the interactions between them.

The recurring failure modes were:

local-optima chasing

elaborating the current direction instead of challenging it

weak taste about which branch deserved attention

difficulty identifying the true latent bottleneck

poor translation from offline improvement to online robustness

fuzzy or wrong physical assumptions about contact, motion, and geometry

A concise way to say it is this: the agent was often good at improving a direction, but much worse at deciding whether the direction itself should survive.

Another especially important lesson was that the agent often needed more context than it knew how to acquire by itself.

It could use extra context very well once it existed. But it was not reliable at noticing that its world model was under-specified and then building the right context-sourcing loop on its own. We repeatedly had to intervene with more explicit simulator details, sensor-frame clarifications, robot and connector geometry, URDF constraints, and written physical assumptions.

Autoresearch worked - until it didn’t

One part of the story that is easy to understate is that we wanted autoresearch to work. We were quite taken with the Karpathy-style idea of an agent that could keep proposing, testing, and refining improvements in a mostly autonomous loop, and early results encouraged that optimism.

Early on, that loop was highly productive because the task still had many accessible local improvements. Later, once the problem became dominated by variance, branching, contact mechanics, and ambiguous physical evidence, it became much less trustworthy. The loop could keep improving a direction long after that direction should have been questioned or abandoned. That was the point where autoresearch had to be demoted from main driver to subordinate tool.

The bottleneck moved from generation to selection

That shift changed the human role as well. Once coding and local iteration became cheap, the scarcer skill was no longer generating one more idea, but deciding which idea deserved attention, which evidence could be trusted, and which branch needed to be killed.

In that sense, coding agents changed what had to be learned. More of the work moved upward into problem framing, experiment design, algorithm selection, deciding what details mattered in the larger scheme, and steering generated subsystems rather than hand-authoring every line.

That was often a real advantage. But it also had a downside: on later branches the controller became highly monolithic, and working at a more abstract level may have made it easier to defer structural cleanup for too long. Since our final result was meaningful but not top-tier, we do not want to romanticize that tradeoff.

Observability and tooling

A recurring practical problem was that a score would move, but we still would not know why.

Scalar scores and arbitrary logging were too impoverished to let the agent form a good high-level model of what had happened in simulation. They were often too impoverished for us too.

That is a big part of why we built query_trial_images.sh

, a small utility that extracted a few representative screenshots around selected trial events and timestamps. It was not just a convenience script. It was a way to turn a long messy run into a compact visual summary that both the humans and the agent could actually reason about.

That changed the workflow. Better decisions often required better packaging of evidence first.

That was also where harness constraints started to matter directly. Modern multimodal LLMs can consume images as well as text, but providers still limit how much image material can be sent in a single request, even when the nominal context window is very large. We hit those limits often enough, and they consistently surfaced as HTTP 413

errors, that we had to move toward more selective, timeline-aware review instead of just dumping screenshots into the loop.

This was not only an internal tooling annoyance. It was a concrete limitation at the LLM/provider interface, and it shaped how evidence had to be packaged for the agent.

Workflow shift

One part of the project that felt genuinely new to us was the shape of the workflow around the controller.

By the end, the agent had become part of an asynchronous supervision loop:

long-running experiments were launched on a rented remote machine

milestone updates and selected images were pushed into Telegram

SSH and VNC let us inspect runs from elsewhere

small image-query tools let us ask targeted visual questions instead of rereading huge logs

a foldable-phone workflow was sometimes enough to keep the loop moving

For a small team working around day jobs and family life, this changed the practical shape of robotics development. It no longer required continuous local attendance at the workstation. It became something we could supervise intermittently and remotely, with the agent surfacing the moments worth attention. But the interesting part was not only that the workflow became more mobile. It was that it became more chat-native. The agent was helping write code, run experiments, collect evidence, summarize outcomes, and push important artifacts into communication channels.

Figure 5: mobile debugging / remote supervision workflow

Foldable-phone split-screen workflow: offline detector analysis on one side, visual overlay on the other.

Figure 6: asynchronous experiment updates

Telegram updates from long-running experiment loops, including insertion-rate summaries and milestone events such as rare full insertions.

Collaboration under constraints

In theory, two people plus coding agents should have enabled a lot of parallel exploration. But another part of the story was simply the reality of the setup. We had two humans, but effectively one shared experimental setup: one rented hardware instance, one primary agent interface, and benchmarks that often ran for a long time. In practice, that often meant one person was waiting on a long integrated run while the other had ideas that could not yet be validated. The bottleneck was not just writing more code. It was getting access to trustworthy feedback from the shared setup.

That made collaboration less about independent parallel tracks and more about careful staging, handoff, and deciding which hypothesis deserved the shared experimental slot. Handoffs mattered. Shared context mattered. Staying aligned on what the system was actually doing mattered. For us, this started to feel like one of the frontier questions in the whole project: not only what coding agents can do, but what human collaboration around them should look like.

Reason for sharing

The public discussion around coding agents often feels too binary: either polished success stories or dismissive takes.

Our experience was mixed in a way that feels more useful:

If we had to compress the main takeaway, it would be this: coding agents already make classical robotics development faster and cheaper than we would have expected, but they do not remove the need for architecture, observability, physical grounding, or human judgment.

That is enough, in our view, to make explicit classical approaches worth reconsidering in places where many people might assume only learned policies are viable.

Transferable lessons

We do not think the lessons here are specific only to robotics. Since our normal day jobs span other domains, projects, and industries as well, part of what makes this exercise interesting to us is precisely the question of what transfers and what does not.

A few patterns seem likely to transfer to other hard engineering domains as well:

Implementation gets cheaper faster than validation does. Coding agents make implementation, instrumentation, and variant generation much cheaper, but they do not remove the need for validation. The bottleneck often moves upward: away from writing code and toward selecting hypotheses, deciding what evidence matters, and recognizing the real bottleneck.

Observability becomes even more valuable. Agents are much more effective when the system is observable. Logs, screenshots, targeted summaries, and compact review tools do not just help humans; they help make the problem legible to the agent.

Decomposition only works where the domain allows it. In tightly coupled systems, local improvements often fail to compose cleanly, which limits both autonomous hill-climbing and naive parallelization.

Collaboration may be limited by experimental bandwidth, not coding capacity. In practice, the scarce resource may be one simulator, one staging environment, one hardware target, or one benchmark queue. We suspect that mix - cheap implementation, expensive validation, and scarce experimental slots - will show up in many domains beyond robotics.

A compact way to say it is this: coding agents seem to be changing the economics of implementation faster than the economics of validation.

Final methodological note This retrospective itself followed the same basic human-in-the-loop pattern.

It did not come purely from memory, but also not purely from the archive. A lot of the first material existed only as messy recollections, hunches, and half-formed notes accumulated during the project. We used those as prompts, then checked and refined them against git history, archived Pi sessions, internal screenshots, and repository notes, with an agent helping turn that mix into a more coherent account.

One thing that became quite clear to us through this is that the sessions with coding agents are increasingly becoming artifacts in their own right. Not just the final code, not just the final score, but also the session history, the reasoning path, the tooling decisions, the dead ends, and the retries.

So even this writeup is, in a small second-order sense, part of the same experiment.

We wish all other participants a successful time in the later phases and would like to thank the organizers of this challenge for providing us the toolset to experiment.

Links

source & further reading

discourse.openrobotics.org — original article

Coding a Classical Robot Controller in the Age of Coding Agents

for autonomous benchmark / edit / analyze loops #

for asynchronous notifications and image pushes #

Run your AI side-project on zahid.host