{"slug": "synaxi-predict-i-m-trying-to-predict-token-cost-before-it-happens", "title": "Synaxi-predict: I'm trying to predict token cost before it happens", "summary": "Synaxi-predict, a new tool from Synaxi, predicts the token cost, turn count, and pass rate of a Claude Code task before execution, enabling users to select the optimal model and avoid wasted tokens. The tool uses an MLP trained on ~53k agent runs and integrates with Claude Code to automatically record actual results for continuous improvement. It is part of the Synaxi ecosystem, which also includes a macOS app that reduces Claude API costs by stripping token waste.", "body_md": "Predicts the cost, turn count, and pass rate of a Claude Code task before it runs — so you can pick the right model without wasting tokens on a bad fit. Closes the loop by capturing actual results and feeding them back into the model.\n\n**Part of the Synaxi ecosystem.**\n\n[Synaxi](https://synaxi.ai) is a macOS app that cuts Claude API costs by stripping token waste from every request before it leaves your machine — deduplicating tool schemas, pruning stale conversation history, compressing verbose JSON, and more. Average reduction: 40%+ per request, with no code changes and under 1ms added latency. Free for personal use.\n\n`synaxi-predict`\n\ntackles the complementary problem: picking the right model *before* the task runs. Together they cover both sides of Claude cost control — less waste per token, and fewer tokens on the wrong model.\n\n```\n/synaxi-predict Fix the failing migration\n        │\n        ▼\n  Model              Est. cost   Turns   Pass\n  ─────────────────────────────────────────────\n  single-haiku       $    0.35    28.1    8%  ◀ recommended\n  single-sonnet      $    0.62    18.4   11%\n        │\n        ▼  [you pick a model]\n        │\n        ▼\n  Subagent runs the task with the chosen model\n        │\n        ├─ bin/parse-session reads the subagent's session JSONL\n        │   → exact turns, token counts, real cost (not estimated)\n        │\n        ├─ Eval agent checks git diff + test output → passed: true/false\n        │\n        └─ bin/record-actual logs prediction vs. actuals\n           → feeds back into next training run\n```\n\nPredictions use an MLP trained on ~53k agent runs (SWE-bench, SWE-smith, OpenHands, loong0814, real Claude Code runs). Input features: TF-IDF on task text + tree-sitter code complexity features from the current repo (see [Features](#features)).\n\nInside any Claude Code session:\n\n```\n/plugin marketplace add BeadW/synaxi-predict\n/plugin install synaxi-predict\n```\n\nOn the next session start, Claude Code automatically:\n\n- Installs the Python package (\n`pip install -e`\n\n) - Downloads the model artifact (~190MB) from GitHub Releases into your platform data directory (\n`~/Library/Application Support/synaxi-predict/`\n\non macOS,`~/.local/share/synaxi-predict/`\n\non Linux)\n\nUpdates happen the same way — bump `version`\n\nin `.claude-plugin/plugin.json`\n\n, release, and the hook re-runs on next session.\n\nCopy `.env.example`\n\nto `.env`\n\nand add your `ANTHROPIC_API_KEY`\n\nif you plan to run benchmarks.\n\n```\ngit clone https://github.com/BeadW/synaxi-predict ~/synaxi-predict\ncd ~/synaxi-predict\ngit lfs pull        # download trained model (~190MB)\npip install -e .\n```\n\nIn any Claude Code session, type:\n\n```\n/synaxi-predict Fix the failing login migration\n```\n\nClaude runs the predictor, shows the table, and asks which model you want. After you pick, it dispatches a subagent with that model, then automatically records the actual cost and turns against the prediction.\n\nOnce installed, Claude invokes this skill automatically whenever it decides to spawn a subagent — no explicit command needed. The prediction table is computed at skill load time via dynamic injection (tree-sitter code features included), so there's no extra tool call overhead.\n\n```\n# Predict for a task (shows all models)\nbin/predict \"Add OAuth login\" --repo-path /path/to/project\n\n# Predict for Claude Code models only\nbin/predict \"Add OAuth login\" --models single --repo-path .\n\n# List all supported models\nbin/predict --list-models\n\n# Show model training date\nbin/predict --version\n\n# Parse a subagent session for exact metrics (agentId from Agent tool result)\nbin/parse-session <agentId> /path/to/project\n\n# Record actuals manually\nbin/record-actual <pred_id> --turns 18 --cost 0.42 --passed true\n```\n\nEach prediction combines three input groups:\n\n**1. Text features** — TF-IDF (L2-normalised) over the model name prepended to the task description. Captures task type, verb, domain keywords.\n\n**2. Tree-sitter code features** — extracted from the Python files in your repo at prediction time. Requires `tree-sitter`\n\nand `tree-sitter-python`\n\n(included in core dependencies). If unavailable the model falls back to text features only.\n\n| Feature | Description |\n|---|---|\n`loc` |\nTotal lines of code across changed/all `.py` files |\n`functions` |\nCount of `def` statements |\n`classes` |\nCount of `class` statements |\n`branches` |\n`if` + `for` + `while` statements |\n`try_blocks` |\n`try/except` blocks |\n`n_files` |\nNumber of `.py` files analysed |\n`avg_loc` |\n`loc / n_files` |\n`branch_density` |\n`branches / max(loc, 1)` |\n`has_code_features` |\n`1` if extraction succeeded, `0` if it fell back to zeros |\n\n**3. Model context** — per-model average prompt tokens from training data (proxy for context window pressure).\n\nEvery completed task produces a ground-truth record in `data/actuals_live.jsonl`\n\n, including the tree-sitter snapshot taken at prediction time:\n\n```\n{\n  \"prediction_id\": \"c7df172d\",\n  \"model\": \"single-haiku\",\n  \"pred_cost\": 0.504,  \"actual_cost\": 0.243,\n  \"pred_turns\": 86.8,  \"actual_turns\": 8,\n  \"passed\": true,\n  \"code_features\": {\n    \"loc\": 9042, \"functions\": 460, \"classes\": 94,\n    \"branches\": 623, \"try_blocks\": 51, \"n_files\": 30,\n    \"avg_loc\": 301.4, \"branch_density\": 0.069, \"has_code_features\": 1\n  }\n}\n```\n\nThe `code_features`\n\nfield is what makes contributed actuals valuable for retraining — it lets the model learn from real Claude Code runs on real codebases, not just SWE-bench benchmarks.\n\nAfter accumulating enough actuals, retrain:\n\n```\npython -m predictor.train\n```\n\n| Source | Records | Notes |\n|---|---|---|\n| SWE-smith | 25,826 | Synthetic SWE tasks, claude-3-7/3-5-sonnet/gpt-4o |\n| loong0814 mini | 9,990 | SWE-bench Verified, 5 models |\n| OpenHands SWE-bench Lite | 5,693 | 19 models, real pass/fail |\n| loong0814 full | 3,639 | Real API costs from accumulated_cost |\n| Claude Code runs | 725 | HumanEval/MBPP runs, upsampled ~14× |\n\n| Target | R² | MAE | Within 2× |\n|---|---|---|---|\n| turns | 0.59 | 10.4 turns | 91% |\n| completion tokens | 0.32 | 2,936 tok | 75% |\n| pass rate (AUC-ROC) | 0.91 | — | 84% acc |\n\nTurn predictions are calibrated for SWE-bench-style tasks; real Claude Code runs tend to use fewer turns than predicted. This improves as more actuals are recorded via the `synaxi-predict`\n\nskill.\n\n```\nbin/                     Executable wrappers (predict, record-actual, parse-session)\npredictor/               Core prediction, training, and session-parsing logic\n  predict.py             CLI entry point and cost calculation\n  train.py               MLP training pipeline\n  parse_session.py       Parses Claude Code session JSONL for exact metrics\n  record_actual.py       Records actuals against predictions\nscripts/                 Dataset importers and eval tools\n  import_*.py            Normalise benchmark datasets → data/runs/\n  extract_*.py           Compute tree-sitter features for benchmark repos\n  eval_holdout.py        R², MAE, within-2× on 20% holdout\n  eval_pass_rate.py      AUC-ROC, Brier score, calibration\nfeatures/code/           Benchmark task definitions (HumanEval, MBPP, etc.)\ndata/runs/               Training corpus (JSONL, one record per agent run)\ndata/models/             Trained model artifact (Git LFS)\ndata/code_features.json  Tree-sitter features per benchmark instance\n.claude-plugin/          Plugin metadata (plugin.json, marketplace.json)\ncommands/                /synaxi-predict slash command\nskills/                  synaxi-predict skill (auto-invoked on subagent spawn)\n```\n\nEach `scripts/import_*.py`\n\npulls a public benchmark and normalises it into `data/runs/`\n\n. New importers should produce JSONL with:\n\n```\n{\n  \"task_id\":           \"source/instance_id\",\n  \"strategy\":          \"model-id\",\n  \"task_text\":         \"description of the task\",\n  \"prompt_tokens_raw\": 45000,\n  \"completion_tokens\": 8200,\n  \"num_turns\":         32,\n  \"total_cost_usd\":    0.84,\n  \"passed_criteria\":   true,\n  \"mode\":              \"multi-turn\"\n}\n```\n\n`prompt_tokens_raw`\n\n= context size at the final API call. `completion_tokens`\n\n= total across all turns.\n\nActuals from real Claude Code runs are the most valuable training signal — benchmark data (SWE-bench etc.) doesn't capture how the model behaves on everyday coding tasks or typical codebases.\n\nEvery time the `synaxi-predict`\n\nskill completes a task it writes a record to `data/actuals_live.jsonl`\n\ncontaining the task text, actual turns/cost, pass result, and the tree-sitter code features of your repo at prediction time. You can share these records to improve the model for everyone:\n\n```\nbin/contribute      # shows uncontributed records, prompts to share via GitHub issue\nbin/contribute --all  # non-interactive, contributes everything\n```\n\nRequires the `gh`\n\nCLI to be authenticated (`gh auth login`\n\n). Each record is posted as a GitHub issue with the `contribution`\n\nlabel and validated before being merged into the training set.\n\n**What gets shared:** task text, model name, predicted vs actual turns/cost, pass/fail, and the `code_features`\n\nsnapshot. No file contents, no diffs, no personal information.\n\n**When to contribute:** after a few tasks have accumulated — check with `bin/contribute`\n\nto see what's pending. The more diverse the tasks and codebases, the better the calibration for real Claude Code usage.\n\nSee [CONTRIBUTING.md](/BeadW/synaxi-predict/blob/main/CONTRIBUTING.md). PRs welcome — especially new benchmark importers and actuals data.\n\n[MIT](/BeadW/synaxi-predict/blob/main/LICENSE) — use freely, attribution appreciated.", "url": "https://wpnews.pro/news/synaxi-predict-i-m-trying-to-predict-token-cost-before-it-happens", "canonical_source": "https://github.com/BeadW/synaxi-predict", "published_at": "2026-06-16 22:16:58+00:00", "updated_at": "2026-06-16 22:30:44.100698+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-tools", "large-language-models", "ai-agents"], "entities": ["Synaxi", "Claude Code", "Anthropic", "SWE-bench", "OpenHands", "GitHub", "Python", "macOS"], "alternates": {"html": "https://wpnews.pro/news/synaxi-predict-i-m-trying-to-predict-token-cost-before-it-happens", "markdown": "https://wpnews.pro/news/synaxi-predict-i-m-trying-to-predict-token-cost-before-it-happens.md", "text": "https://wpnews.pro/news/synaxi-predict-i-m-trying-to-predict-token-cost-before-it-happens.txt", "jsonld": "https://wpnews.pro/news/synaxi-predict-i-m-trying-to-predict-token-cost-before-it-happens.jsonld"}}