{"slug": "rules-not-weights", "title": "Rules, Not Weights", "summary": "A novel machine learning approach that reverses the standard paradigm: instead of fixing the scoring rule and training model weights, the authors propose fixing the weights and searching for the optimal scoring rule itself. They introduce a symbolic engine that represents scoring rules as searchable, comparable data structures (s-expressions) rather than Python code, enabling efficient mutation, deduplication, and caching during rule search. This engine compiles scoring rules into trainable artifacts with automatically derived symbolic gradients, eliminating the need for hand-written backward passes or autodiff tapes.", "body_md": "Mainstream ML fixes the scoring rule and trains the weights. We're exploring the opposite — searching the rule.\nIn mainstream ML, the scoring rule is part of the backbone — softmax over logits, cross-entropy loss, the attention pattern — and iteration works by adapting weights to that fixed rule. From GPT-2 through Llama 3 and Mistral, the pretraining rule is the same: next-token cross-entropy on softmax over logits. What moved between them was parameter count, architecture details like normalization and positional encoding, tokenizer, and training data. The scoring rule itself is a given.\nWe're exploring the opposite. Weights still get trained for every candidate — gradient descent does not go away — but the outer search variable is the scoring rule itself. A small symbolic engine makes the rule a term — data a search can compare, mutate, and dedup — so rule-search becomes tractable.\nThree background forces made leaving the rule fixed rational. Autograd made any well-behaved forward cheap to differentiate. Scaling laws rewarded growing parameters and data against the same loss. Softmax cross-entropy matured into a default — MLE for a categorical, p − y\ngradients, clean composition with attention. A fourth force kept the rule fixed in practice: the cost of rule search itself.\nThat fourth force is what our engine changes. Even granting cheap autograd — even granting an LLM that can write candidate forwards on demand — a search still needs a representation it can compare, mutate, and cache, and a population of Python modules is not that. Without tooling that makes a rule into searchable data, rule search at grammar scale is not something a team would run.\nMost production training stacks build an autograd graph at runtime. PyTorch's backward\nand TensorFlow's GradientTape\nrecord operations as the forward pass runs, then walk the recorded graph to compute gradients. That makes any fixed forward cheap to differentiate — but the forward is code, and searching code is expensive even when an LLM writes the candidates. Two Python modules that compute near-equivalent things can look arbitrarily different, a population of modules does not support structural crossover or dedup, and the gradient lives inside a tape rather than in a form you can inspect or hash. Our engine makes the rule a term. defsymbolic\nreads an s-expression, simplifies it, differentiates it symbolically once at macro-expansion time, and stores the expression beside its per-parameter gradient map. Search becomes data manipulation — canonicalize, hash, dedup, mutate subtrees, emit thousands — and the training loop consumes the same-shaped artifact regardless of which expression produced it.\nThink of the engine as a compiler for scoring rules. A developer writes a rule as an expression; parse-expr\nat src/wave_grad/eml.clj:11-37\nreads the expression into a tagged tree of :add\n, :mul\n, :exp\n, :log\n, :eml\nnodes; simplify\nat src/wave_grad/eml.clj:48-102\nfolds constants, eliminates zeros and ones, and flattens associative operators so derivatives do not blow up; diff\nat src/wave_grad/eml.clj:104-133\nwalks the tree once per learnable weight to produce another tagged expression; and defsymbolic\nat src/wave_grad/eml.clj:154-162\nstores the expression and the gradient map side by side on the resulting def. Parse, simplify, differentiate, store. Four passes and a macro:\n(defmacro defsymbolic\n[name params form]\n(let [expr (-> form parse-expr simplify)\ngrads (zipmap params (map #(diff expr %) params))]\n`(def ~name\n{:name ~(keyword name)\n:parameters '~(vec params)\n:expression '~expr\n:gradients '~grads})))\nWhat the developer hands in looks like a piece of math. What comes back is the trainable artifact the training loop already consumes — no hand-written backward pass, no autodiff tape, no retraining-stack rebuild between experiments.\nA rule is a short expression that scores one candidate — one token in a sequence, one line in a story, one memory chunk in a retrieval window. The expression has learnable weights inside it, but the weight vector is not the unit of iteration. The expression is. Weights are trained per candidate rule by the inner loop; the outer loop swaps the rule. We swap one expression for another, the engine re-differentiates the new one at compile time, the gradient map moves with the rule, and the training loop runs without code changes.\nThe primitive the engine is shaped around is EML. Softmax normalizes across a sequence — every token's weight depends on every other token in the same step. EML scores each token on its own evidence, so scores become composable across sequences, comparable across positions, and directly thresholdable. Softmax does not give any of those properties for free.\nThe base operator is one line (Koji Odrzywołek, All elementary functions from a single operator, https://arxiv.org/html/2603.21852v2):\neml(x, y) = e^x − ln(y)\nThis formula means: take the exponential of the first argument and subtract the natural log of the second. Pairing EML with the constant 1 is enough to generate every elementary function — the NAND-gate analogy for continuous math.\nWe call EML with a structured denominator: score = eml(signal_sum, 1 + exp(damping_sum))\n, which reduces — because ln(1 + e^x)\nis softplus — to exp(signal) − softplus(damping)\n. This formula means: two learned weighted sums, one signal saying why a token might matter, the other damping penalizing why it might be noisy or stale.\nThe derivative is also one line:\nD[eml(u, v)] = e^u · u' − v' / v\nThis formula means: the gradient of an EML node is the exponential of the first argument u times u's own gradient, minus the second argument v's gradient divided by v itself. Compose EML into larger weighted expressions and the gradient comes back as another expression tree. This is the reason :eml\nis a named node type in the engine rather than desugared into exp\nand log\n— the tidy derivative is worth encoding once at the primitive.\nBecause a new rule is a one-line source edit, a generator can write hundreds of rules from a grammar and the training loop consumes the lot without any code change. Rule design becomes rule search. Three stages run the same loop against progressively more realistic settings.\nStage one is a synthetic hard suite. The headline task and-match\nrewards tokens that match both a color clue and a shape clue; the eml-gated-symbolic\nrule lands at 0.613 final answer score against an oracle upper bound of 0.663, and the best linear baseline reaches 0.500. Across the harder task variants, different EML rules win on different variants — eml-symbolic\nwins the hardest at 0.600 against 0.525 for the linear baselines, but no single rule dominates the suite. The interesting finding is not which rule is best; it is that the engine reaches the top group on every task by surfacing a task-shaped rule fast, and a different task rewards a different rule shape.\nStage two is bAbI\nsupporting-fact retrieval on tasks 2 and 3. A broad sweep grammar generated one-term candidates; an expanded sweep added a second damping term to the survivors. The rule that won — eml-auto-03-03-d-drop\n, a two-term damping variant — landed at support recall 0.204 against the linear baseline's 0.051, four-to-one. Exact retrieval is still 0.000 across the board: the system finds supporting facts above chance, but it does not yet return them in the top slot. The interesting finding is the move to honesty — a proxy metric moves decisively while the strongest possible metric stays at floor — and that the winning rule in the final rerun did not come from a human picking a shape.\nStage three is TinyStories memory selection inside the nanochat\ninference path — a real-language next-chunk benchmark on a sparse memory budget. The searched rule eml-ts-length_norm-quote_recent-none\nlands at average continuation loss 3.8344\nagainst 3.8911\nfor the fixed sparse-memory baseline, 4.0311\nfor recent-only truncation, and 3.6815\nfor full context as upper bound. Target overlap moves the same way: 0.2077\nfor EML against 0.1513\nand 0.0815\nfor the sparse baselines. The interesting finding is that the same iteration loop — engine, rule grammar, sweep, rerun gate — transfers into a real GPT-style inference path on real-language data, and the rule it surfaces beats both sparse baselines without rewiring the backbone.\nThe memory rule search is not the only track. A parallel experiment applied EML to gradient update rules on the same TinyStories model. Three designs — gradient gating, optimizer replacement, and lr-scaled correction — ran against a baseline Muon optimizer across four seeds (320 steps, 6-layer, batch 4). Averaged across seeds: correction at 3.1726 validation loss, baseline 3.1931, gating 3.1943, replacement 3.2248. Correction beats baseline on every individual seed. The result that initially favored replacement on a short run reversed when the training horizon grew — a finding only visible through iteration at the rule level. The candidate set was hand-designed, not searched. The direction is consistent; the search has not run.\nIn stage one a human picks the shape and the engine differentiates it. In stage two a sweep grammar picks the shape and the rerun gate keeps it or discards it. Rule search stops being manual. The same loop carried from synthetic tasks to offline retrieval to a live generation path without the backbone changing shape. The thing being searched in all three stages was the expression, not the weights. The loop is not hyperparameter sweeping, and it is not architecture search — it is a search over the expression we score with.\nNone of this says the engine has solved anything. Exact retrieval on bAbI\nis 0.000\n. On TinyStories the prefix-match rate is 0.000\nacross every strategy — EML selects the right material, but selection is not yet generation. The continuation-loss gap between EML and the fixed sparse baseline is real but small in absolute terms (3.8344\nagainst 3.8911\n). The synthetic wins are task-local: different rules win on different variants, not one rule across the suite. The 0.204 support recall on bAbI\naggregates over a split: Task 2 recall is 0.350, Task 3 is 0.058. The claim this post supports is narrower — rule-level iteration is cheap enough to run an algorithmic search loop, the loop surfaces rules a designer would not have written, and those rules carry across three settings without the backbone changing shape. How far those rules can be pushed is the next question.", "url": "https://wpnews.pro/news/rules-not-weights", "canonical_source": "https://danieltan.weblog.lol/2026/04/rules-not-weights", "published_at": "2026-04-15 13:27:00+00:00", "updated_at": "2026-05-20 14:17:50.544290+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "research", "artificial-intelligence"], "entities": ["GPT-2", "Llama 3", "Mistral"], "alternates": {"html": "https://wpnews.pro/news/rules-not-weights", "markdown": "https://wpnews.pro/news/rules-not-weights.md", "text": "https://wpnews.pro/news/rules-not-weights.txt", "jsonld": "https://wpnews.pro/news/rules-not-weights.jsonld"}}