A Game of Robot Telephone

A developer tested LLM code translation by passing a Go program through 10 languages and back, finding the final Go version grew from 94 to 443 lines while retaining correctness. The experiment used a chain of LLM-generated rewrites through TypeScript, Python, Ruby, C++, Java, Haskell, Common Lisp, Zig, Rust, and back to Go, with each step verified against a test API.

A Game of Robot Telephone Intro Way back when AltaVista Babel Fish https://en.wikipedia.org/wiki/Babel Fish website first appeared online, it became a fun game on IRC to take a phrase, translate it through a chain of languages, and then back into English. We all also played the game of telephone as kids, trying to pass a message through the class by whispering in each other's ears. Sometimes the result was surprisingly poetic, but often the result was complete gibberish as errors compounded and mutated along the way. With the endless stream of LLM-fueled "rewrite this in X" posts doing the rounds, I thought it would be fun to try a similar game, but with code: - Start with a small but non-trivial program in Go - Pass it through a chain of LLM-generated rewrites - Bring it back to Go - See what survived The final program produced the right answer, but grew to nearly five times due to a grab bag of semantic souvenirs it had picked up from the languages it passed through. The Task The task the code performs should have enough moving parts to make translation interesting, but be common-day enough to be achievable without heavy frameworks or a multitude of libraries. The program I settled on does the following: - Accepts and validates a URL from the command line - Makes an HTTP request to the URL to retrieve a list of TODOs in JSON format - Parses the JSON response into values representing TODOs - Reads the current local date - Parses and validates the TODO deadline dates YYYY-MM-DD format - Groups TODOs by user ID - Counts completed and overdue TODOs per user - Sorts summaries by completed/overdue counts - Formats a fixed-width table of results and prints to stdout The initial implementation relies entirely on go's well-suited standard library. Example API Response { "userId": 1, "id": 1, "title": "delectus aut autem", "completed": false, "dueDate": "1900-01-01" }, { "userId": 1, "id": 2, "title": "quis ut nam facilis et officia qui", "completed": true, "dueDate": "2999-12-31" }, ... - The input has twenty TODOs spread across seven users. - Due dates range from 1900 to 2999, so the completed dates may vary at the time of running. - There are no "poorly formatted" inputs. Example output USER COMPLETED MISSED 3 2 2 2 2 1 4 2 1 5 1 3 1 1 1 7 0 1 6 0 0 Links in the Chain The full chain of languages used was: php Go - TypeScript - Python - Ruby - C++ - Java - Haskell - Common Lisp - Zig - Rust - Go Every step was run by a fresh Codex process, and executed in an isolated worktree. The prompt used is given in full below. It details what to deliver and how to check the results detailed below . It also encourages use of idiomatic code, installation of a toolchain and use of popular libraries. I found that Codex was VERY prone to, for example, writing it's own JSON parser instead of installing a toolchain. The Prompt Each generated project contains its own build files, dependencies and run.sh , a wrapper script accepting a single argument the TODOs endpoint and tasked with running the newly generated language's code. An Oracle to Guide Us The API also exposed a POST /conform endpoint. Codex could run the newly generated project against /todos , gather the stdout output and post to this endpoint to get feedback as to whether it performed its task correctly. If it failed, Codex could inspect the program locally, repair it and try again. This was to prevent codex from being sneaky and peaking at expected outputs or another canonical implementation. And Codex is really sneaky. Results Every language in the chain was able to generate the expected table. The original Go implementation was 94 lines, but the final one grew to 443 lines: | Original Go | Final Go | | |---|---|---| | Main implementation | 94 lines | 443 lines | | User IDs | integers | arbitrary JSON values | | Completed | Boolean | true , "true" or numeric 1 | | Fields | read when needed | all required up front | | HTTP timeout | 10 seconds | none | | Date parsing | Go standard library | hand-written validation | | HTTP status descriptions | Go standard library | hand-written lookup table | None of the extra machinery changed the fixture output. Most of it existed to preserve decisions made by intermediate implementations, and preserve accumulated "backward compatibility" behavior of previous steps. Truth Gets Complicated In the original Go implementation, completed was a simple boolean value https://github.com/minikomi/semantic drift/blob/ed8ce179b9ebd278c49706f3828d40fe11fb5603/runs/latest/01-go/project/main.go L12-L17 within the JSON emitted by the API: Completed bool json:"completed" Immediately, the Typescript rewrite changes this behavior subtly. In go, a type mismatch for the JSON, like a string where a bool is expected, will return an error. TypeScript with axios however, trusts the type annotation at compile time but does no runtime validation. A string value for completed: would register as true, regardless of the content. In practice, the API never returns strange data here, but it has repercussions down the line. Python and Ruby supplied their own ideas of truthiness, simply checking if todo.completed . Java then used Jackson's coercion rules, and by the Haskell stage the behavior had been made explicit: Boolean true , the string "true" and numeric 1 were true. That rule survives and becomes encoded in the Common Lisp, Zig and Rust implementations. As Boolean Lisp defun as-boolean value or eq value 'yason:true and stringp value string= value "true" and numberp value = value 1 Zig fn asBoolean value: std.json.Value bool { return switch value { .bool = |b| b, .string = |s| std.mem.eql u8, s, "true" , .integer = |n| n == 1, .float = |n| n == 1.0, .number string = |s| std.mem.eql u8, s, "1" or std.mem.eql u8, s, "1.0" , else = false, }; } Rust php fn as boolean value: &Value - bool { match value { Value::Bool value = value, Value::String value = value == "true", Value::Number value = number is one value , = false, } } When the program returns to Go it looks like this: func asBoolean value any bool { switch v := value. type { case bool: return v case string: return v == "true" case json.Number: return numberIsOne v default: return false } } The final Go program carefully preserves the semantics of the boolean coercion inherited from the stages before it. This is the main pattern in the chain. A language adds an interpretation, the next translation treats it as intentional, and eventually it becomes explicit compatibility code. Because of the oracle, these have no effect on the output, but contribute extra cognitive cruft. What's in An ID The original program decoded userId directly into an int . The TypeScript annotation looked equally strict but again provides no runtime validation. Python and Ruby use dicts keyed by todo "userId" , allowing any type to be used as a hash key. Once we hit the C++ stage, the ambiguity is made official by accepting any JSON value as a user ID. From that point onward, user IDs could be strings, numbers, Booleans, null, arrays or objects. The implementation needed rules for grouping, displaying and sorting all of them. The final Go program therefore has: jsonKey to serialize values for grouping displayValue to print arbitrary JSON values compareUserID to order mixed types cloneJSONValue , which no longer did anything It also preserved JSON numbers as text. 1 , 1.0 and 1e0 could become three different grouping keys while all being displayed as 1 . None of this was needed by the task - the fixture only contains integer IDs, a fact which was encoded succinctly in the original implementation by the type in the Go structs. Again the ambiguity of the oracle and the passage through dynamic languages caused downstream ambiguities to congregate and gave us a whole bunch of fresh souvenirs. Error Handling Fossilizes The original Go client got HTTP status text from the standard library. Later languages exposed status text differently, or not at all. By the Common Lisp stage the program carried its own table of HTTP reason phrases. Zig copied it. Rust copied it. The final Go translation copied it back into a language which already knew how to produce those strings. The final program contains a switch covering status codes from 400 Bad Request to 505 HTTP Version Not Supported . This is not an LLM inventing random code. It is doing exactly what the setup asked: preserving observable behavior from its source. The problem is that an incidental workaround had quietly become observable behavior. This is mainly a failure of the prompt, too strictly encouraging using the previous stage's source as the canonical example. Useful Behavior Disappears Drift did not only add things. The original Go HTTP client had a ten-second timeout. TypeScript, Python, Ruby, C++, Java, Haskell and Common Lisp all retained some form of timeout. During the Zig translation step, the timeout was dropped. With no signal from there onward, the Rust translation and the final Go program result in calling http.Get directly with no timeout specified. In our case, the TODOs API never serves a slow response and all stages pass without a problem. This highlights another side to fixture-driven conformance: behavior tested by the fixture becomes sacred. Behavior outside it can disappear without a trace. Side Trips Smaller Chains I also ran three shorter round trips, inspired by failed long-chain experiments. | Chain | Final Go | Main souvenir | |---|---|---| | Go - Bash - Go | 105 lines | dates compared as strings | | Go - PHP - Go | 272 lines | explicit emulation of PHP casting and truthiness | | Go - Erlang - Go | 130 lines | Erlang-style errors and a non-strict sort function | - The Bash version originally poisoned all implementations downstream by doing fancy jq manipulations. It also compared ISO dates lexically instead of using some form of date parsing. This works for valid YYYY-MM-DD values, so the returning Go program kept doing it. Invalid or missing dates no longer behaved like the original. - The PHP round trip had the clearest example of semantic baggage. The returning Go implementation added helpers for PHP integer conversion, Boolean truthiness, string conversion, associative-array iteration and PHP-shaped date errors. - The Erlang round trip stayed fairly close to the original, but introduced a subtle bug in the sort. Erlang's sort predicate uses "less than or equal to". When translated into Go, that became =<==. Go's sort.Slice needs a strict "less than" function, so passing it "less than or equal to" is incorrect. The fixture data never triggered the bad case, so the bug went unnoticed. Adjusted Prompt Run Another run with a slightly adjusted prompt https://github.com/minikomi/semantic drift/blob/b31dc1d865ba151fbe3075054cb730da5ed05c1c/prompts/rewrite-neutral.md heavily favored DIY solutions to the translation of ambiguous truthiness and dict keys once exiting the archipelago of dynamic languages. A huge amount of code https://github.com/minikomi/semantic drift/blob/b31dc1d865ba151fbe3075054cb730da5ed05c1c/runs/run-20260621-202351/11-go/project/main.go L16-L369 found its way to the final go implementation just to parse the json TODOs. It seems it appeared in the C++ implementation https://github.com/minikomi/semantic drift/blob/b31dc1d865ba151fbe3075054cb730da5ed05c1c/runs/run-20260621-202351/05-cpp/project/main.cpp L189-L250 and grew from there. It seems that the system prompt still has a lot of sway over how the translation progresses, especially when the task runs in a loop building off of itself. Often this happens in unexpected ways. What does it tell us? Can we take a lot away from this quick experiment? The oracle was deliberately stunted, the task trivial and the prompt underspecified. I could have gone back and made the prompt more deliberately strict about preserving types, timeouts and so on. Still, I think there's a kernel of wisdom to be had here when it comes to working with agentic coding setups. The final implementation was a suitcase bearing the stickers of each language we visited along the way: JavaScript truthiness, generic JSON values from C++, coercion rules made explicit in Haskell, an HTTP status table from Common Lisp, date-validation rules passed through Zig and Rust, all of it rendered back into Go switches and helper functions. The translation Codex performed is conservative, but has its quirks. It preserves what the source does, including workarounds and accidents, because the only way it can check correctness did not distinguish those from the point of the program. The key takeaway is that LLM translations relying purely on test conformance are prone to sneaky behavior: satisfying the outcome while cutting corners elsewhere. Knowing this, setup, prompting, testing and final review are all extremely important to get right. LLMs are just as lazy as we are, and considerably better at hiding it. Notes - Bash was originally in the full chain. It introduced enough shell-specific behavior that I replaced it with PHP. - PHP then pushed its coercion rules into every downstream translation, so I replaced it with C++ for the full run. - The first setup used libfaketime https://github.com/wolfcw/libfaketime to freeze the date. It proved troublesome and exposed details of the harness to generated programs. The current fixture spans enough dates to remain useful while using the real local clock. - The oracle calculates its expected output when its server starts. Crossing local midnight during a run remains a small race. - Step specific analysis can be seen in CHAIN ANALYSIS.md https://github.com/minikomi/semantic drift/blob/5ff9907f7d0b70d7812296eaf52a5183f57fa453/CHAIN ANALYSIS.md