Chronos vs Toto: Zero-Shot Forecasting Benchmark Results A benchmark comparing Chronos-Bolt and Toto zero-shot forecasting models on Prometheus and OpenSearch telemetry data found that both models outperformed naive baselines on periodic memory signals but struggled with heavy-tailed, spike-prone CPU utilization series. The evaluation, using MASE for point accuracy and CRPS for uncertainty quality, showed Chronos-Bolt produced calibrated quantile forecasts for long-horizon capacity planning, while Toto demonstrated competitive performance on stable periodic patterns. Good forecasts help with capacity planning and quieter alerts. But one traffic spike or memory leak can make any forecast useless. The goal is simple: prove your forecast beats a naive baseline https://otexts.com/fpp3/simple-methods.html na%C3%AFve-method and stays reliable under uncertainty. In this post, we compare two forecasting models, Chronos Chronos‑Bolt https://huggingface.co/amazon/chronos-bolt-base and Toto https://www.datadoghq.com/blog/datadog-time-series-foundation-model/ , on telemetry from Prometheus https://prometheus.io/ and OpenSearch https://opensearch.org/ . We judge them with two easy metrics: MASE https://otexts.com/fpp3/accuracy.html mean-absolute-scaled-error-mase for point accuracy and CRPS https://en.wikipedia.org/wiki/Continuous ranked probability score for the quality of uncertainty. Figure: Forecast fan chart for a periodic memory signal 5m aggregation, 256-step horizon . Chronos emits calibrated 0.1–0.9 quantiles. Long‑horizon forecasts matter for capacity planning. Teams need to anticipate storage growth, provision compute, and schedule scaling windows without constant firefighting. A longer horizon for example, 256–336 steps surfaces trend and seasonality far enough ahead to guide procurement, autoscaling policies, and SLO budgets. Bands, not just point lines, are critical in operations. The quantile envelope translates uncertainty into action: alert thresholds can follow the 0.9 band on spike‑prone services, while budgetary plans anchor around the median or 0.8. When bands widen, you get early warning that risk is rising even if the point forecast looks stable. We evaluate both models in a zero‑shot setting used out‑of‑the‑box without fine‑tuning on these specific series. This highlights how well the models generalize to new telemetry without labeled training data. For background, see zero‑shot learning https://en.wikipedia.org/wiki/Zero-shot learning and first part of this blog-post series Zero-Shot Forecasting: Our Search for a Time-Series Foundation Model https://www.parseable.com/blog/zero-shot-forecasting Our dataset comes from the OpenTelemetry Demo https://opentelemetry.io/ecosystem/demo Astronomy Shop , focusing on two common signals: Forecasting in observability is hard. Real systems are bursty https://en.wikipedia.org/wiki/Burstiness , undergo regime shifts https://en.wikipedia.org/wiki/Concept drift , and only sometimes show seasonality https://en.wikipedia.org/wiki/Seasonality . Some series like Prometheus memory at 5m/10m show clear cycles and relatively stable behavior. Others like OpenSearch CPU are heavy‑tailed https://en.wikipedia.org/wiki/Heavy-tailed distribution and spike‑prone, with outliers https://en.wikipedia.org/wiki/Outlier that can dwarf the average. Ignore these realities and you get pretty charts with inaccurate results. A quick exploratory data analysis confirmed it. Memory at 5m/10m looked periodic and well‑behaved, while OpenSearch CPU showed high variability, extreme kurtosis many tail events , and frequent outliers. Two very different forecasting problems. | Series name | Window | Mean/Median % | Std | CV | Outliers % | Notes | |---|---|---|---|---|---|---| Prometheus memory mem util 5m prometheus | 5m | mean ≈ 41.3 | ≈ 12.9 | ≈ 0.31 | 0 | Slight negative trend −0.006/interval , R² ≈ 0.14 | Prometheus memory mem util 10m prometheus | 10m | mean ≈ 36.3 | ≈ 11.0 | ≈ 0.30 | ≈ 4.9 | Mild positive trend; skewness ≈ 1.02 | Prometheus memory mem util 10s prometheus | 10s | mean ≈ 38.4 | ≈ 2.0 | ≈ 0.053 | ≈ 2.0 | Heavy tails kurtosis ≈ 19.3 ; step changes | OpenSearch CPU cpu util 10s opensearch | 10s | median ≈ 6.45 | — | ≈ 1.32 | — | Spikes 200%; heavy tails kurtosis ≈ 9.0 ; large swings | How to read this: MASE https://otexts.com/fpp3/accuracy.html scaled-errors compares your model's absolute errors to a naive benchmark. It's scale independent and easy to interpret: In this study, we compute MASE using sktime’s mean absolute scaled error with sp=1 non‑seasonal . This implicitly benchmarks against a one‑step naive baseline: the forecast at time t equals the observed value at t‑1 . Our input is univariate per series, and we provide y train for each rolling origin. MASE < 1 means the model beats the naive baseline. Implementation note mean absolute scaled error y true, y pred, y train=y train, sp=1 CRPS https://en.wikipedia.org/wiki/Continuous ranked probability score evaluates the entire predictive distribution against the observed value. Lower is better. While MASE gauges the central forecast, CRPS rewards calibrated uncertainty, exactly what SREs need for risk‑aware alert thresholds and error budgets. In production, aim for MASE < 1 and CRPS at or below the naive baseline. Sharp but overconfident distributions are hazardous. Rule of thumb: Use MASE to prove you've beaten naive; use CRPS to prove your uncertainty is honest. Good forecast metrics are only useful if teams can act on them. Parseable helps teams move from benchmark scores to production decisions with dashboards, alerts, anomaly detection, and forecasting workflows built into one observability platform. Start your free trial Chronos Chronos‑Bolt uses direct multi‑step forecasting https://skforecast.org/0.17.0/user guides/direct-multi-step-forecasting.html : it’s trained to output up to 64 steps in one shot and, in practice, degrades minimally even out to 512 steps. It produces quantiles from 0.1 to 0.9; requesting other quantiles results in errors. The upside: clean fan charts and efficient long‑horizon generation, especially on stable, periodic series. Toto is autoregressive https://en.wikipedia.org/wiki/Autoregressive model it generates forecasts step by step, sampling from a parametric distribution at each step. More samples usually mean more stable forecasts and better CRPS—but also higher latency. Toto accepts a parameter called num samples which dictates the number of samples it should generate to make a forecast. In practice, Toto handled horizons up to 512 steps without issue. Across much of our test data, smaller num samples =32 often performed best—reducing inference time from ~9× slower to ~5× slower as compared to Chronos bolt base without sacrificing accuracy. Treat this as a solid starting point, then tune as needed. mean absolute scaled error with sp=1 non‑seasonal, naive last‑value baseline: forecast at time t equals observed at t‑1 ; input is univariate; y train provided per rolling origin