{"slug": "chronos-vs-toto-zero-shot-forecasting-benchmark-results", "title": "Chronos vs Toto: Zero-Shot Forecasting Benchmark Results", "summary": "A benchmark comparing Chronos-Bolt and Toto zero-shot forecasting models on Prometheus and OpenSearch telemetry data found that both models outperformed naive baselines on periodic memory signals but struggled with heavy-tailed, spike-prone CPU utilization series. The evaluation, using MASE for point accuracy and CRPS for uncertainty quality, showed Chronos-Bolt produced calibrated quantile forecasts for long-horizon capacity planning, while Toto demonstrated competitive performance on stable periodic patterns.", "body_md": "Good forecasts help with capacity planning and quieter alerts. But one traffic spike or memory leak can make any forecast useless. The goal is simple: prove your forecast beats a [naive baseline](https://otexts.com/fpp3/simple-methods.html#na%C3%AFve-method) and stays reliable under uncertainty.\n\nIn this post, we compare two forecasting models, Chronos ([Chronos‑Bolt](https://huggingface.co/amazon/chronos-bolt-base)) and [Toto](https://www.datadoghq.com/blog/datadog-time-series-foundation-model/), on telemetry from [Prometheus](https://prometheus.io/) and [OpenSearch](https://opensearch.org/). We judge them with two easy metrics: [MASE](https://otexts.com/fpp3/accuracy.html#mean-absolute-scaled-error-mase) for point accuracy and [CRPS](https://en.wikipedia.org/wiki/Continuous_ranked_probability_score) for the quality of uncertainty.\n\n*Figure: Forecast fan chart for a periodic memory signal (5m aggregation, 256-step horizon). Chronos emits calibrated 0.1–0.9 quantiles.*\n\nLong‑horizon forecasts matter for capacity planning. Teams need to anticipate storage growth, provision compute, and schedule scaling windows without constant firefighting. A longer horizon (for example, 256–336 steps) surfaces trend and seasonality far enough ahead to guide procurement, autoscaling policies, and SLO budgets.\n\nBands, not just point lines, are critical in operations. The quantile envelope translates uncertainty into action: alert thresholds can follow the 0.9 band on spike‑prone services, while budgetary plans anchor around the median or 0.8. When bands widen, you get early warning that risk is rising even if the point forecast looks stable.\n\nWe evaluate both models in a zero‑shot setting used out‑of‑the‑box without fine‑tuning on these specific series. This highlights how well the models generalize to new telemetry without labeled training data.\n\nFor background, see [zero‑shot learning](https://en.wikipedia.org/wiki/Zero-shot_learning) and first part of this blog-post series [Zero-Shot Forecasting: Our Search for a Time-Series Foundation Model](https://www.parseable.com/blog/zero-shot-forecasting)\n\nOur dataset comes from the [OpenTelemetry Demo](https://opentelemetry.io/ecosystem/demo) (Astronomy Shop), focusing on two common signals:\n\nForecasting in observability is hard. Real systems are [bursty](https://en.wikipedia.org/wiki/Burstiness), undergo [regime shifts](https://en.wikipedia.org/wiki/Concept_drift), and only sometimes show [seasonality](https://en.wikipedia.org/wiki/Seasonality). Some series like Prometheus memory at 5m/10m show clear cycles and relatively stable behavior. Others like OpenSearch CPU are [heavy‑tailed](https://en.wikipedia.org/wiki/Heavy-tailed_distribution) and spike‑prone, with [outliers](https://en.wikipedia.org/wiki/Outlier) that can dwarf the average. Ignore these realities and you get pretty charts with inaccurate results.\n\nA quick exploratory data analysis confirmed it. Memory at 5m/10m looked periodic and well‑behaved, while OpenSearch CPU showed high variability, extreme kurtosis (many tail events), and frequent outliers. Two very different forecasting problems.\n\n| Series (name) | Window | Mean/Median (%) | Std | CV | Outliers (%) | Notes |\n|---|---|---|---|---|---|---|\nPrometheus memory (`mem_util_5m_prometheus` ) |\n5m | mean ≈ 41.3 | ≈ 12.9 | ≈ 0.31 | 0 | Slight negative trend (−0.006/interval), R² ≈ 0.14 |\nPrometheus memory (`mem_util_10m_prometheus` ) |\n10m | mean ≈ 36.3 | ≈ 11.0 | ≈ 0.30 | ≈ 4.9 | Mild positive trend; skewness ≈ 1.02 |\nPrometheus memory (`mem_util_10s_prometheus` ) |\n10s | mean ≈ 38.4 | ≈ 2.0 | ≈ 0.053 | ≈ 2.0 | Heavy tails (kurtosis ≈ 19.3); step changes |\nOpenSearch CPU (`cpu_util_10s_opensearch` ) |\n10s | median ≈ 6.45 | — | ≈ 1.32 | — | Spikes > 200%; heavy tails (kurtosis ≈ 9.0); large swings |\n\nHow to read this:\n\n[MASE](https://otexts.com/fpp3/accuracy.html#scaled-errors) compares your model's absolute errors to a naive benchmark. It's scale independent and easy to interpret:\n\nIn this study, we compute MASE using sktime’s `mean_absolute_scaled_error`\n\nwith `sp=1`\n\n(non‑seasonal). This implicitly benchmarks against a one‑step naive baseline: the forecast at time `t`\n\nequals the observed value at `t‑1`\n\n. Our input is univariate per series, and we provide `y_train`\n\nfor each rolling origin. MASE < 1 means the model beats the naive baseline.\n\nImplementation note\n\n```\nmean_absolute_scaled_error(y_true, y_pred, y_train=y_train, sp=1)\n```\n\n[CRPS](https://en.wikipedia.org/wiki/Continuous_ranked_probability_score) evaluates the entire predictive distribution against the observed value. Lower is better. While MASE gauges the central forecast, CRPS rewards calibrated uncertainty, exactly what SREs need for risk‑aware alert thresholds and error budgets. In production, aim for MASE < 1 and CRPS at or below the naive baseline. Sharp but overconfident distributions are hazardous.\n\n**Rule of thumb:** Use MASE to prove you've beaten naive; use CRPS to prove your uncertainty is honest.\n\nGood forecast metrics are only useful if teams can act on them. Parseable helps teams move from benchmark scores to production decisions with dashboards, alerts, anomaly detection, and forecasting workflows built into one observability platform.\n\n[Start your free trial]\n\nChronos (Chronos‑Bolt) uses [direct multi‑step forecasting](https://skforecast.org/0.17.0/user_guides/direct-multi-step-forecasting.html): it’s trained to output up to 64 steps in one shot and, in practice, degrades minimally even out to 512 steps. It produces quantiles from 0.1 to 0.9; requesting other quantiles results in errors. The upside: clean fan charts and efficient long‑horizon generation, especially on stable, periodic series.\n\nToto is [autoregressive](https://en.wikipedia.org/wiki/Autoregressive_model) it generates forecasts step by step, sampling from a parametric distribution at each step. More samples usually mean more stable forecasts and better CRPS—but also higher latency. Toto accepts a parameter called `num_samples`\n\nwhich dictates the number of samples it should generate to make a forecast. In practice, Toto handled horizons up to 512 steps without issue. Across much of our test data, smaller num_samples (=32) often performed best—reducing inference time (from ~9× slower to ~5× slower as compared to Chronos bolt base) without sacrificing accuracy. Treat this as a solid starting point, then tune as needed.\n\n`mean_absolute_scaled_error`\n\nwith `sp=1`\n\n(non‑seasonal, naive last‑value baseline: forecast at time `t`\n\nequals observed at `t‑1`\n\n); input is univariate; `y_train`\n\nprovided per rolling origin`<metric>_<window>_<source>`\n\n— for example, `mem_util_5m_prometheus`\n\nor `cpu_util_10s_opensearch`\n\n.`<prediction_length>_<data_used>.csv`\n\n— for example, `512_mem_util_5m_prometheus.csv`\n\n.`<prediction_length>_<data_used>/`\n\n— each folder contains two plots: `chronos`\n\nand `toto`\n\n.Root paths in this repo: plots live under `Forecast Plots`\n\nand CSVs under `CSV Files`\n\n.\n\nCSV header dictionary:\n\n`Toto Time`\n\n: Toto inference time for the horizon (ms or s, as exported)`Chronos Time`\n\n: Chronos inference time for the horizon`Toto MASE`\n\n: MASE for Toto’s point forecast vs naïve`Chronos MASE`\n\n: MASE for Chronos’s point forecast vs naïve`Toto CRPS`\n\n: CRPS for Toto’s predictive distribution`Chronos CRPS`\n\n: CRPS for Chronos’s predictive distribution`Input Length`\n\n: Context length used for inferenceExample CSVs:\n\nThis setup mirrors how you'd actually deploy forecasting in an observability stack: rolling updates, short and long horizons, and guardrails against regression.\n\nAll the Forecasting results are available in the [GitHub repo](https://github.com/parseablehq/zero-shot-forecasting).\n\n**General observations (zero‑shot)**: Both models perform well on series with clear cyclic structure (Prometheus memory at 5m/10m) and degrade when periodicity is weak (10s windows, spike‑prone OpenSearch CPU). Use MASE to confirm improvement over naive and CRPS to ensure uncertainty isn’t over‑confident.\n\n*What you’ll see in the charts*: smooth cycles, tight quantile bands, and stable long‑horizon fans.\n\n**Metrics**: Both Chronos and Toto consistently beat naive (MASE < 1). Chronos often edges ahead at 512‑step horizons thanks to its direct multi‑step design, which keeps error growth in check. CRPS is strong for both, reflecting predictable cycles and well‑calibrated uncertainty.\n\n*Figures: Memory utilization, 5m aggregation, 256‑step horizon. Top: Chronos; Bottom: Toto. Bands show 0.1–0.9 quantiles.*\n\nCSV:\n\n**What this means in ops**: Prefer Chronos for long‑range capacity planning on periodic services. Drive alerts from quantile bands (for example, 0.9 for spikes, median/0.8 for budgets) rather than a single point line.\n\n*What you’ll see in the charts*: frequent spikes, asymmetric tails, and wider bands—especially during bursts.\n\n**Metrics**: CRPS becomes the deciding factor. Toto can better capture heavy tails with adequate sampling, improving distributional calibration—but with higher latency. Chronos still performs, but if the bands look too narrow on spike‑prone series, be cautious: over‑sharp uncertainty is a red flag.\n\n*Figures: CPU utilization (OpenSearch), 5m aggregation, 64‑step horizon. Top: Chronos; Bottom: Toto. Wider bands reflect tail risk.*\n\nCSV:\n\n**What this means in ops**: Increase Toto’s samples when tail risk matters (on‑call noise, error budgets). If latency is a concern, dial back samples or switch horizons. Consider widening alert thresholds to follow the upper quantile band during known burst windows.\n\n**What you’ll see in the charts**: weaker periodicity, noisier signals, and visibly broader uncertainty.\n\nForecasting results are only the first step. Parseable gives teams a practical place to operationalize those insights across logs, metrics, and traces, so stable trends, tail risk, and widening uncertainty bands can actually drive action.\n\n[See Parseable in action].\n\n**Metrics**: Without strong cyclic structure, both models degrade. Expect MASE ≈ 1 (or marginally better) and wider CRPS. That’s not failure, it’s honest uncertainty.\n\n*Figures: Memory utilization, 10s aggregation, 64‑step horizon. Top: Chronos; Bottom: Toto. Bands widen due to weak periodicity.*\n\nCSV:\n\norecasting results are only the first step. Parseable gives teams a practical place to operationalize those insights across logs, metrics, and traces, so stable trends, tail risk, and widening uncertainty bands can actually drive action. See Parseable in action\n\n**What this means in ops**: For near‑term paging, prefer 64‑step horizons and smoothing/aggregation. For capacity planning, use 5m/10m aggregates where structure is clearer.\n\n**How to read these figures**: the shaded fan shows 0.1–0.9 quantiles; the central line is the median. Tighter shading means lower uncertainty. If the upper band crosses your alert threshold, assume higher risk even if the median stays below it.\n\n`mean_absolute_scaled_error`\n\n(`sp=1`\n\n, non‑seasonal) using the naive last‑value baseline; provide `y_train`\n\nfor each rolling origin; input is univariate per series.`sp=1`\n\n).\n\n```\n- Chronos: use 0.1–0.9 quantiles; generate 64, 256, and 336 steps (longer if needed). Save outputs and latencies.\n- Toto: start with `num_samples=32`; record inference latency and accuracy. If tails matter and CRPS looks over‑confident, increase `num_samples`.\n```\n\n{{ ... }}\n\n**What this means in ops**: For near‑term paging, prefer 64‑step horizons and smoothing/aggregation. For capacity planning, use 5m/10m aggregates where structure is clearer.\n\n**How to read these figures**: the shaded fan shows 0.1–0.9 quantiles; the central line is the median. Tighter shading means lower uncertainty. If the upper band crosses your alert threshold, assume higher risk even if the median stays below it.\n\nIf you are already benchmarking forecasting models on telemetry, the next step is productionizing the workflow. Parseable Pro includes forecasting alerts, dashboards, SQL access, and AI-native analysis, so teams can turn experiments into operational guardrails faster.\n\n[Try Parseable Pro free].\n\n```\n- Compute MASE with sktime’s `mean_absolute_scaled_error` (`sp=1`, non‑seasonal) using the naive last‑value baseline; provide `y_train` for each rolling origin; input is univariate per series.\n- No seasonal naive: we do not use seasonal baselines; the metric’s scaling uses last‑value (`sp=1`).\n- Gate to proceed: target MASE < 1 by a clear margin (≥10–20%) on stable series.\n- Track both MASE (point) and [CRPS](https://en.wikipedia.org/wiki/Continuous_ranked_probability_score) (probabilistic).\n- Acceptance gates:\n  - MASE < 1 for your chosen horizon(s).\n  - CRPS ≤ naive and stable across windows (no drift).\n  - Coverage check: ~80% of actuals inside the 0.1–0.9 band (by design it won’t be 100%; the point is “not too tight”).\n- If coverage is low or CRPS rises: widen bands (Chronos, Toto) or increase samples (Toto).\n- Paging/on‑call: prefer 64‑step horizons; alert when the 0.9 quantile stays above a threshold for N consecutive points (e.g., N = 3–5 windows).\n- Capacity planning: use 256–336 steps; plan to the median and budget to the 0.8/0.9 quantile.\n- Executive reports: show median plus bands; avoid single‑line forecasts.\n- Use quantile bands for thresholds (e.g., 0.9 for “high‑risk” services).\n- Page only if both conditions hold for N windows: upper quantile > threshold AND observed is trending up.\n- Add dampening: suspend pages if uncertainty (band width) is rapidly expanding after a deploy.\n- Dashboard by service/window (10s/5m/10m):\n  - MASE and CRPS as time series; 7‑day moving averages.\n  - Quantile coverage (% of points within 0.1–0.9 band).\n  - Inference latency by horizon and model; model version and config (samples/context).\n- Alert on trends, not spikes:\n  - MASE ≥ 1 for N consecutive windows.\n  - CRPS +25% vs baseline for M windows.\n  - Coverage < 70–75% for K windows (over‑tight bands).\n- If MASE worsens or CRPS inflates, automatically fall back to the naive (last‑value) baseline.\n- Temporarily widen alert thresholds to quantile bands (e.g., use 0.9 instead of the mean) until retraining.\n- When noise is high (10s windows), aggregate to 5m/10m for capacity decisions.\n- Monitor missing timestamps, step changes, near‑zero plateaus, and value spikes; backfill or mask before scoring.\n- Watch for regime shifts (deploys/traffic changes) that flip winners; keep baselines alive for quick rollbacks.\n```\n\n`mem_util_5m_prometheus`\n\ncapacity planning, anchor to the median for purchase decisions and keep a “risk band” at 0.9. If leadership asks for p95, publish it with a note on interpolation and prove that ~95% of points fell below it in the last 7 days.`cpu_util_5m_opensearch`\n\n, 64‑step predictions keep CRPS low and coverage healthy. At 256 steps, CRPS inflates—bump `num_samples`\n\n(e.g., from 32 to 64) or forecast in two 128‑step segments and re‑seed with observations.`mem_util_10m_prometheus`\n\nloses its daily dip. Chronos still looks fine at the median, but coverage falls to ~60%. Flip alerts to the baseline temporarily and retrain on post‑deploy data.`mem_util_10s_prometheus`\n\nat 64 steps shows honest wide bands and MASE near 1. Aggregate to 5m for planning (where Chronos beats naive) and keep 10s only for exploratory drilling.If you remember only two things, make them these:\n\nMASE tells you if your point forecast truly beats naive.\n\nCRPS tells you if your uncertainty is believable when it matters most.\n\nOn periodic series, Chronos often wins, especially at long horizons, thanks to direct multi-step forecasting and clean quantile bands.\n\nOn spiky, heavy-tailed series, Toto shines when you tune sampling to balance CRPS and latency. Most teams will benefit from using both: Chronos for stable workloads, Toto where tails do the talking.\n\nForecasting becomes valuable when it moves beyond offline evaluation and starts shaping real operational decisions. Parseable helps teams do that with unified telemetry, forecasting alerts, dashboards, and AI-native analysis built for production observability.\n\n[Start your free trial].", "url": "https://wpnews.pro/news/chronos-vs-toto-zero-shot-forecasting-benchmark-results", "canonical_source": "https://dev.to/team-parseable/chronos-vs-toto-zero-shot-forecasting-benchmark-results-1101", "published_at": "2026-05-27 04:41:57+00:00", "updated_at": "2026-05-27 04:52:27.828771+00:00", "lang": "en", "topics": ["machine-learning", "artificial-intelligence", "ai-products", "ai-tools", "ai-infrastructure"], "entities": ["Chronos", "Toto", "Amazon", "Datadog", "Prometheus", "OpenSearch", "Chronos-Bolt"], "alternates": {"html": "https://wpnews.pro/news/chronos-vs-toto-zero-shot-forecasting-benchmark-results", "markdown": "https://wpnews.pro/news/chronos-vs-toto-zero-shot-forecasting-benchmark-results.md", "text": "https://wpnews.pro/news/chronos-vs-toto-zero-shot-forecasting-benchmark-results.txt", "jsonld": "https://wpnews.pro/news/chronos-vs-toto-zero-shot-forecasting-benchmark-results.jsonld"}}