{"slug": "a-new-approach-to-interpretability-round-trip-neural-network-compilation", "title": "A new approach to interpretability: round-trip neural network compilation-decompilation", "summary": "A new programming language called Sutra, developed by Emma Leonhart, compiles programs into tensor-op graphs that can be trained like neural networks and then decompiled back into symbolic source code with floating-point precision. The round-trip compilation-decompilation process creates a verified behavioral isomorphism between the symbolic program and the trained network, offering a mathematical property that could address interpretability challenges in neuro-symbolic AI systems. The approach is currently demonstrated as a proof of concept for specific trained parameters, with the author seeking feedback on whether this property constitutes a meaningful form of interpretability or falls to standard objections against neuro-symbolic methods.", "body_md": "From having gone down a wikipedia rabbit hole from [hyperdimensional computing](https://en.wikipedia.org/wiki/Hyperdimensional_computing) I ended up making a programming language that is quite different from programming languages I know of.\n\nSutra is a typed, GPU-native programming language I have been building. Its values are vectors and its programs compile to tensor-op graphs, the same kind of fused tensor computation a small neural network runs as. The paper is at [arXiv:2605.20919](https://arxiv.org/abs/2605.20919) and the compiler is on [GitHub](https://github.com/EmmaLeonhart/Sutra).\n\nThis post is about one specific property of that setup, which I will call the round-trip, and a question I genuinely do not know the answer to: whether the property is a useful kind of interpretability, or whether it falls to the standard objection.\n\nThe idea behind it is that a neural network created by it can be trained and decompiled into a different symbolic program. Right now it operates based off of changing set parameters in constrained training but my vision is to train an AI model to decompile compatible neural networks more generally.\n\n**What the round trip is**\n\nThe forward direction is just the compiler: a Sutra program compiles deterministically to a tensor-op graph. Because the graph is tensors, you can train it. The round-trip is the reverse direction. You take the trained parameters and write them back into Sutra source, and that source recompiles to a graph that reproduces the trained network's behavior to floating-point precision.\n\nThe symbolic source is therefore not a description sitting next to the network. It is a program that provably compiles to the exact computation the network performs.\n\nThis is demonstrated so far as a proof of concept: specific trained parameters writing back to source, not yet a general procedure across arbitrary program structures. I say more about the limits below, because they matter for how much weight the rest of this can carry.\n\n**Why I think the isomorphism matters**\n\nI want to be careful here because this is where I'm reasoning beyond what's demonstrated.\n\nThe standard objection to neuro-symbolic approaches on LessWrong is Wentworth's \"[Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc](https://www.lesswrong.com/posts/gebzzEwn2TaA6rGkc/deep-learning-systems-are-not-less-interpretable-than-logic-probability-etc)\", the argument that labeling nodes in a symbolic system doesn't actually give you interpretability, because you can't verify the labels still mean what they say after training.\n\nI think Sutra's claim is structurally different from labeling. The round-trip isn't about whether variable names are semantically accurate. It's about whether there's a verified behavioral isomorphism between the symbolic source and the compiled network, and that isomorphism is checkable without any reference to what the variable names mean. You verify it by checking that the compiled graph reproduces trained behavior to floating point precision. That's a mathematical property, not a semantic one.\n\nThis also speaks to a recurring question here, most directly Edy Nastase's thread [asking why neuro-symbolic systems get so little attention in alignment](https://www.lesswrong.com/posts/qkSrWqzJKrjbSZzvr/why-are-neuro-symbolic-systems-not-considered-when-it-comes). The strongest answer in that thread, from Tailcalled, is that no neurosymbolic architecture has demonstrated a meaningfully better safety property than deep learning, and Thane Ruthenis adds that part of why the research is missing is that the whole direction looks too intimidating to pursue. I am not claiming the round-trip clears that bar. I am trying to state one concrete, checkable property and put it in front of people who can tell me whether it is the kind of thing that would count, or whether it is another property that sounds useful and isn't.\n\nWhat I think this enables, if the round-trip can be made to work reliably at scale: a symbolic articulation of what process a neural network is executing, not what its representations mean. That's different from interpretability in the Wentworth sense. It's closer to being able to formally reason about the computation.\n\nI'm not claiming this solves alignment or that it's sufficient for safety. I'm claiming it's a different kind of property than what's usually discussed, and I'd like to understand whether people think it's a useful kind of property.\n\n**Where I actually am**\n\nThe round-trip is demonstrated as a proof of concept for specific trained parameters writing back to source. I'm currently working on making the training-back-to-code path work more generally across more program structures. The formal verification work — using the symbolic-neural correspondence as a basis for verifying properties of the training process — is a direction I want to pursue but haven't started yet, partly because I'd want a collaborator with more FV background than I have.\n\nThe longer arc: once round-tripping works reliably, you have a corpus of (original source, trained source, compiled graph) triples. That's training data for a learned decompiler — a model that takes a trained tensor and produces Sutra source whose compiled graph matches it. At that point the loop closes in a way that I think has interesting properties for self-improvement with maintained legibility.\n\n**What I'm looking for**\n\nPrimarily: people who think the isomorphism claim is wrong or uninteresting, and can tell me specifically why. Also anyone with formal verification background who finds the neural process verification angle interesting.\n\nGithub: [https://github.com/EmmaLeonhart/Sutra](https://github.com/EmmaLeonhart/Sutra)\n\nArxiv: [https://arxiv.org/abs/2605.20919](https://arxiv.org/abs/2605.20919)", "url": "https://wpnews.pro/news/a-new-approach-to-interpretability-round-trip-neural-network-compilation", "canonical_source": "https://www.lesswrong.com/posts/CCcDbSLBj5T7HzJKv/a-new-approach-to-interpretability-round-trip-neural-network", "published_at": "2026-05-29 22:41:08+00:00", "updated_at": "2026-05-29 22:48:45.073894+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "neural-networks", "ai-research"], "entities": ["Sutra", "Emma Leonhart", "arXiv", "GitHub", "Wikipedia", "Hyperdimensional computing"], "alternates": {"html": "https://wpnews.pro/news/a-new-approach-to-interpretability-round-trip-neural-network-compilation", "markdown": "https://wpnews.pro/news/a-new-approach-to-interpretability-round-trip-neural-network-compilation.md", "text": "https://wpnews.pro/news/a-new-approach-to-interpretability-round-trip-neural-network-compilation.txt", "jsonld": "https://wpnews.pro/news/a-new-approach-to-interpretability-round-trip-neural-network-compilation.jsonld"}}