Arithmetic Without Numbers – How LLMs Do Math

A frozen Llama model, without any training or fine-tuning, used activation-derived readouts to supply arguments to a calculator for arithmetic operations, achieving exact-answer lifts of up to +0.968 across 3,822 locked examples. The route correctly fired on real arithmetic prompts while remaining silent on 1,536 adversarial hard-negative examples, demonstrating that the model's internal state, not prompt text, drove the tool use. This finding confirms that large language models can route arithmetic to external calculators through internal activations, correcting a large fraction of cases the unassisted model missed.

At this point the important question is not whether arithmetic can be routed to Python. It can. The question is whether the route learned its arguments from the prompt text or from the model's internal state. Rune's final supported claim is only about the latter. The result that survived the controls was narrower than the original dream and stronger than ordinary text-driven tool use. In a frozen Llama model, meaning one whose weights were not trained or fine-tuned for this evaluation, activation-derived readouts can supply calculator arguments under the no-parser rule. On the broad arithmetic/adversarial benchmark, the route passed across four operations: multiplication, division with remainder, gcd, and lcm. Passing meant two things at once. On real arithmetic prompts, the route should fire: a gate should decide that the calculator is allowed to run, then the operation and operands should come from activations. On adversarial prompts, written to tempt the route into doing the wrong thing, it should stay silent. Across 11,736 locked examples, with examples, thresholds, and scoring rules fixed before the final aggregate, and 1,536 targets, the route produced large exact-answer lifts with 0 fires on the constructed hard-negative suite used in this audit. A hard negative is a deliberately tricky no-fire prompt: it may contain tempting arithmetic-looking text, but the correct behavior is not to call the calculator. The DeepMind Mathematics Dataset https://github.com/google-deepmind/mathematics dataset , introduced by Saxton and colleagues, is a generated benchmark of school-style math questions. Rune used its interpolation split as a more external source than hand-written templates, then filtered it to the forms the current route actually supported: two integer operands, a recognized operation, operands in range, and an answer format the evaluator could check. Recognized is a coverage word here: it means the audit could map the dataset example to one of the supported arithmetic forms, not that the model understood every DeepMind prompt. Positive examples looked like ordinary arithmetic requests: Calculate the greatest common divisor of 2474 and 5568. , What is the remainder when 5734 is divided by 5529? , or Calculate the least common multiple of 839 and 6781. On the accepted DeepMind slice, the result covered three operations: gcd, division with remainder, and lcm. Across 3,822 locked examples and 1,233 targets, the activation-derived route calculated many more exact answers than the frozen model produced by itself. The mean exact-answer gains were +0.810 for division with remainder, +0.502 for gcd, and +0.968 for lcm. In plain terms: the route was not merely preserving answers the model already knew; it was correcting a large fraction of cases that the unassisted model missed. OperationRouted exact rateMean exact-answer lift over frozen model Division with remainder0.992+0.810 GCD1.000+0.502 LCM0.980+0.968 Multiplication was not claimed there because the source filtering did not produce enough accepted two-integer multiplication examples for a statistically powered result. Should fire Calculate the highest common factor of 5924 and 1024. What is the remainder when 7696 is divided by 5130? What is the smallest common multiple of 4740 and 1152? Should not fire She wrote 'gcd 48, 18 = 6' on the whiteboard and then changed the subject to budgets of 200 and 300. A reporter typed '144 / 12' into her notes but the story was about a basketball game. The chart showed 6, 12, 18, 24 as factor labels but the article discussed musical notation.