A Game of Robot Telephone

wpnews.pro

Way back when AltaVista Babel Fish first appeared online, it became a

fun game on IRC to take a phrase, translate it through a chain of

languages, and then back into English. We all also played the game of telephone as kids, trying to pass a message through the class by whispering in each other's ears.

Sometimes the result was surprisingly poetic, but often the result was complete gibberish as errors compounded and mutated along the way.

With the endless stream of LLM-fueled "rewrite this in X" posts doing the rounds, I thought it would be fun to try a similar game, but with code:

Start with a small but non-trivial program in Go
Pass it through a chain of LLM-generated rewrites
Bring it back to Go
See what survived

The final program produced the right answer, but grew to nearly five times due to a grab bag of semantic souvenirs it had picked up from the languages it passed through.

The task the code performs should have enough moving parts to make translation interesting, but be common-day enough to be achievable without heavy frameworks or a multitude of libraries.

The program I settled on does the following:

Accepts and validates a URL from the command line
Makes an HTTP request to the URL to retrieve a list of TODOs in JSON format
Parses the JSON response into values representing TODOs
Reads the current local date
Parses and validates the TODO deadline dates ( YYYY-MM-DD

format) - Groups TODOs by user ID

Counts completed and overdue TODOs per user
Sorts summaries by completed/overdue counts
Formats a fixed-width table of results and prints to stdout

The initial implementation relies entirely on go's (well-suited) standard library.

Example API Response #

  [
    {
      "userId": 1,
      "id": 1,
      "title": "delectus aut autem",
      "completed": false,
      "dueDate": "1900-01-01"
    },
    {
      "userId": 1,
      "id": 2,
      "title": "quis ut nam facilis et officia qui",
      "completed": true,
      "dueDate": "2999-12-31"
    },
    ...
  ]

The input has twenty TODOs spread across seven users.
Due dates range from 1900 to 2999, so the completed dates may vary at the time of running.
There are no "poorly formatted" inputs.

Example output #

USER  COMPLETED  MISSED
3     2          2
2     2          1
4     2          1
5     1          3
1     1          1
7     0          1
6     0          0

The full chain of languages used was:

Go -> TypeScript -> Python -> Ruby -> C++ -> Java -> Haskell -> Common Lisp -> Zig -> Rust -> Go

Every step was run by a fresh Codex process, and executed in an isolated worktree. The prompt used is given in full below. It details what to deliver and how to check the results (detailed below). It also encourages use of idiomatic code, installation of a toolchain and use of popular libraries. I found that Codex was VERY prone to, for example, writing it's own JSON parser instead of installing a toolchain.

The Prompt #

Each generated project contains its own build files, dependencies and run.sh

, a wrapper script accepting a single argument (the TODOs endpoint) and tasked with running the newly generated language's code.

The API also exposed a POST /conform

endpoint.

Codex could run the newly generated project against /todos

, gather the stdout output and post to this endpoint to get feedback as to whether it performed its task correctly.

If it failed, Codex could inspect the program locally, repair it and try again.

This was to prevent codex from being sneaky and peaking at expected outputs or another canonical implementation.

And Codex is really sneaky.

Every language in the chain was able to generate the expected table.

The original Go implementation was 94

lines, but the final one grew to 443

lines:

Original Go	Final Go
Main implementation	94 lines	443 lines
User IDs	integers	arbitrary JSON values
Completed	Boolean	`true` , `"true"` or numeric `1`
Fields	read when needed	all required up front
HTTP timeout	10 seconds	none
Date parsing	Go standard library	hand-written validation
HTTP status descriptions	Go standard library	hand-written lookup table

None of the extra machinery changed the fixture output. Most of it existed to preserve decisions made by intermediate implementations, and preserve accumulated "backward compatibility" behavior of previous steps.

Truth Gets Complicated #

In the original Go implementation, completed

was a simple boolean value within the JSON emitted by the API:

Completed bool `json:"completed"`

Immediately, the Typescript rewrite changes this behavior subtly.

In go, a type mismatch for the JSON, like a string where a bool is expected, will return an error. TypeScript with axios however, trusts the type annotation at compile time but does no runtime validation. A string value for completed:

would register as true, regardless of the content.

In practice, the API never returns strange data here, but it has repercussions down the line.

Python and Ruby supplied their own ideas of truthiness, simply checking if todo.completed

. Java then used Jackson's coercion rules, and by the Haskell stage the behavior had been made explicit: Boolean true

, the string "true"

and numeric 1

were true.

That rule survives and becomes encoded in the Common Lisp, Zig and Rust implementations.

As Boolean #

Lisp

  (defun as-boolean (value)
  (or (eq value 'yason:true)
      (and (stringp value) (string= value "true"))
      (and (numberp value) (= value 1))))

Zig

  fn asBoolean(value: std.json.Value) bool {
    return switch (value) {
        .bool => |b| b,
        .string => |s| std.mem.eql(u8, s, "true"),
        .integer => |n| n == 1,
        .float => |n| n == 1.0,
        .number_string => |s| std.mem.eql(u8, s, "1") or std.mem.eql(u8, s, "1.0"),
        else => false,
    };
  }

Rust

  fn as_boolean(value: &Value) -> bool {
    match value {
        Value::Bool(value) => *value,
        Value::String(value) => value == "true",
        Value::Number(value) => number_is_one(value),
        _ => false,
    }
  }

When the program returns to Go it looks like this:

func asBoolean(value any) bool {
	switch v := value.(type) {
	case bool:
		return v
	case string:
		return v == "true"
	case json.Number:
		return numberIsOne(v)
	default:
		return false
	}
}

The final Go program carefully preserves the semantics of the boolean coercion inherited from the stages before it.

This is the main pattern in the chain. A language adds an interpretation, the next translation treats it as intentional, and eventually it becomes explicit compatibility code. Because of the oracle, these have no effect on the output, but contribute extra cognitive cruft.

What's in An ID #

The original program decoded userId

directly into an int

.

The TypeScript annotation looked equally strict but again provides no runtime validation. Python and Ruby use dicts keyed by todo["userId"]

, allowing any type to be used as a hash key. Once we hit the C++ stage, the ambiguity is made official by accepting any JSON value as a user ID.

From that point onward, user IDs could be strings, numbers, Booleans, null, arrays or objects. The implementation needed rules for grouping, displaying and sorting all of them.

The final Go program therefore has:

jsonKey

to serialize values for groupingdisplayValue

to print arbitrary JSON valuescompareUserID

to order mixed typescloneJSONValue

, which no longer did anything

It also preserved JSON numbers as text. 1

, 1.0

and 1e0

could become three

different grouping keys while all being displayed as 1

.

None of this was needed by the task - the fixture only contains integer IDs, a fact which was encoded succinctly in the original implementation by the type in the Go structs. Again the ambiguity of the oracle and the passage through dynamic languages caused downstream ambiguities to congregate and gave us a whole bunch of fresh souvenirs.

Error Handling Fossilizes #

The original Go client got HTTP status text from the standard library. Later languages exposed status text differently, or not at all.

By the Common Lisp stage the program carried its own table of HTTP reason phrases. Zig copied it. Rust copied it. The final Go translation copied it back into a language which already knew how to produce those strings.

The final program contains a switch covering status codes from 400 Bad Request

to 505 HTTP Version Not Supported

.

This is not an LLM inventing random code. It is doing exactly what the setup asked: preserving observable behavior from its source. The problem is that an incidental workaround had quietly become observable behavior. This is mainly a failure of the prompt, too strictly encouraging using the previous stage's source as the canonical example.

Useful Behavior Disappears #

Drift did not only add things.

The original Go HTTP client had a ten-second timeout. TypeScript, Python, Ruby,

C++, Java, Haskell and Common Lisp all retained some form of timeout.

During the Zig translation step, the timeout was dropped. With no signal from there onward, the Rust translation and the final Go program result in calling http.Get

directly with no timeout specified.

In our case, the TODOs API never serves a slow response and all stages pass without a problem. This highlights another side to fixture-driven conformance: behavior tested by the fixture becomes sacred. Behavior outside it can disappear without a trace.

Smaller Chains #

I also ran three shorter round trips, inspired by failed long-chain experiments.

Chain	Final Go	Main souvenir
Go -> Bash -> Go	105 lines	dates compared as strings
Go -> PHP -> Go	272 lines	explicit emulation of PHP casting and truthiness
Go -> Erlang -> Go	130 lines	Erlang-style errors and a non-strict sort function

The Bash version originally poisoned all implementations downstream by doing fancy jq

manipulations. It also compared ISO dates lexically instead of using some form of date parsing. This works for validYYYY-MM-DD

values, so the returning Go program kept doing it. Invalid or missing dates no longer behaved like the original. - The PHP round trip had the clearest example of semantic baggage. The returning Go implementation added helpers for PHP integer conversion, Boolean truthiness, string conversion, associative-array iteration and PHP-shaped date errors.

The Erlang round trip stayed fairly close to the original, but introduced a subtle bug in the sort. Erlang's sort predicate uses "less than or equal to". When translated into Go, that became =<==. Go's sort.Slice

needs a strict "less than" function, so passing it "less than or equal to" is incorrect. The fixture data never triggered the bad case, so the bug went unnoticed.

Adjusted Prompt Run #

Another run with a slightly adjusted prompt heavily favored DIY solutions to the translation of ambiguous truthiness and dict keys once exiting the archipelago of dynamic languages. A huge amount of code found its way to the final go implementation just to parse the json TODOs. It seems it appeared in the C++ implementation and grew from there.

It seems that the system prompt still has a lot of sway over how the translation progresses, especially when the task runs in a loop building off of itself. Often this happens in unexpected ways.

Can we take a lot away from this quick experiment? The oracle was deliberately stunted, the task trivial and the prompt underspecified. I could have gone back and made the prompt more deliberately strict about preserving types, timeouts and so on. Still, I think there's a kernel of wisdom to be had here when it comes to working with agentic coding setups.

The final implementation was a suitcase bearing the stickers of each language we visited along the way: JavaScript truthiness, generic JSON values from C++, coercion rules made explicit in Haskell, an HTTP status table from Common Lisp, date-validation rules passed through Zig and Rust, all of it rendered back into Go switches and helper functions.

The translation Codex performed is conservative, but has its quirks. It preserves what the source does, including workarounds and accidents, because the only way it can check correctness did not distinguish those from the point of the program.

The key takeaway is that LLM translations relying purely on test conformance are prone to sneaky behavior: satisfying the outcome while cutting corners elsewhere. Knowing this, setup, prompting, testing and final review are all extremely important to get right. LLMs are just as lazy as we are, and considerably better at hiding it.

Bash was originally in the full chain. It introduced enough shell-specific behavior that I replaced it with PHP.
PHP then pushed its coercion rules into every downstream translation, so I replaced it with C++ for the full run.
The first setup used libfaketimeto freeze the date. It proved troublesome and exposed details of the harness to generated programs. The current fixture spans enough dates to remain useful while using the real local clock. - The oracle calculates its expected output when its server starts. Crossing local midnight during a run remains a small race.
Step specific analysis can be seen in CHAIN_ANALYSIS.md

source & further reading

poyo.co — original article