Chronos vs Toto: Zero-Shot Forecasting Benchmark Results

A benchmark comparing Chronos-Bolt and Toto zero-shot forecasting models on Prometheus and OpenSearch telemetry data found that both models outperformed naive baselines on periodic memory signals but struggled with heavy-tailed, spike-prone CPU utilization series. The evaluation, using MASE for point accuracy and CRPS for uncertainty quality, showed Chronos-Bolt produced calibrated quantile forecasts for long-horizon capacity planning, while Toto demonstrated competitive performance on stable periodic patterns.

Good forecasts help with capacity planning and quieter alerts. But one traffic spike or memory leak can make any forecast useless. The goal is simple: prove your forecast beats a naive baseline https://otexts.com/fpp3/simple-methods.html na%C3%AFve-method and stays reliable under uncertainty. In this post, we compare two forecasting models, Chronos Chronos‑Bolt https://huggingface.co/amazon/chronos-bolt-base and Toto https://www.datadoghq.com/blog/datadog-time-series-foundation-model/ , on telemetry from Prometheus https://prometheus.io/ and OpenSearch https://opensearch.org/ . We judge them with two easy metrics: MASE https://otexts.com/fpp3/accuracy.html mean-absolute-scaled-error-mase for point accuracy and CRPS https://en.wikipedia.org/wiki/Continuous ranked probability score for the quality of uncertainty. Figure: Forecast fan chart for a periodic memory signal 5m aggregation, 256-step horizon . Chronos emits calibrated 0.1–0.9 quantiles. Long‑horizon forecasts matter for capacity planning. Teams need to anticipate storage growth, provision compute, and schedule scaling windows without constant firefighting. A longer horizon for example, 256–336 steps surfaces trend and seasonality far enough ahead to guide procurement, autoscaling policies, and SLO budgets. Bands, not just point lines, are critical in operations. The quantile envelope translates uncertainty into action: alert thresholds can follow the 0.9 band on spike‑prone services, while budgetary plans anchor around the median or 0.8. When bands widen, you get early warning that risk is rising even if the point forecast looks stable. We evaluate both models in a zero‑shot setting used out‑of‑the‑box without fine‑tuning on these specific series. This highlights how well the models generalize to new telemetry without labeled training data. For background, see zero‑shot learning https://en.wikipedia.org/wiki/Zero-shot learning and first part of this blog-post series Zero-Shot Forecasting: Our Search for a Time-Series Foundation Model https://www.parseable.com/blog/zero-shot-forecasting Our dataset comes from the OpenTelemetry Demo https://opentelemetry.io/ecosystem/demo Astronomy Shop , focusing on two common signals: Forecasting in observability is hard. Real systems are bursty https://en.wikipedia.org/wiki/Burstiness , undergo regime shifts https://en.wikipedia.org/wiki/Concept drift , and only sometimes show seasonality https://en.wikipedia.org/wiki/Seasonality . Some series like Prometheus memory at 5m/10m show clear cycles and relatively stable behavior. Others like OpenSearch CPU are heavy‑tailed https://en.wikipedia.org/wiki/Heavy-tailed distribution and spike‑prone, with outliers https://en.wikipedia.org/wiki/Outlier that can dwarf the average. Ignore these realities and you get pretty charts with inaccurate results. A quick exploratory data analysis confirmed it. Memory at 5m/10m looked periodic and well‑behaved, while OpenSearch CPU showed high variability, extreme kurtosis many tail events , and frequent outliers. Two very different forecasting problems. | Series name | Window | Mean/Median % | Std | CV | Outliers % | Notes | |---|---|---|---|---|---|---| Prometheus memory mem util 5m prometheus | 5m | mean ≈ 41.3 | ≈ 12.9 | ≈ 0.31 | 0 | Slight negative trend −0.006/interval , R² ≈ 0.14 | Prometheus memory mem util 10m prometheus | 10m | mean ≈ 36.3 | ≈ 11.0 | ≈ 0.30 | ≈ 4.9 | Mild positive trend; skewness ≈ 1.02 | Prometheus memory mem util 10s prometheus | 10s | mean ≈ 38.4 | ≈ 2.0 | ≈ 0.053 | ≈ 2.0 | Heavy tails kurtosis ≈ 19.3 ; step changes | OpenSearch CPU cpu util 10s opensearch | 10s | median ≈ 6.45 | — | ≈ 1.32 | — | Spikes 200%; heavy tails kurtosis ≈ 9.0 ; large swings | How to read this: MASE https://otexts.com/fpp3/accuracy.html scaled-errors compares your model's absolute errors to a naive benchmark. It's scale independent and easy to interpret: In this study, we compute MASE using sktime’s mean absolute scaled error with sp=1 non‑seasonal . This implicitly benchmarks against a one‑step naive baseline: the forecast at time t equals the observed value at t‑1 . Our input is univariate per series, and we provide y train for each rolling origin. MASE < 1 means the model beats the naive baseline. Implementation note mean absolute scaled error y true, y pred, y train=y train, sp=1 CRPS https://en.wikipedia.org/wiki/Continuous ranked probability score evaluates the entire predictive distribution against the observed value. Lower is better. While MASE gauges the central forecast, CRPS rewards calibrated uncertainty, exactly what SREs need for risk‑aware alert thresholds and error budgets. In production, aim for MASE < 1 and CRPS at or below the naive baseline. Sharp but overconfident distributions are hazardous. Rule of thumb: Use MASE to prove you've beaten naive; use CRPS to prove your uncertainty is honest. Good forecast metrics are only useful if teams can act on them. Parseable helps teams move from benchmark scores to production decisions with dashboards, alerts, anomaly detection, and forecasting workflows built into one observability platform. Start your free trial Chronos Chronos‑Bolt uses direct multi‑step forecasting https://skforecast.org/0.17.0/user guides/direct-multi-step-forecasting.html : it’s trained to output up to 64 steps in one shot and, in practice, degrades minimally even out to 512 steps. It produces quantiles from 0.1 to 0.9; requesting other quantiles results in errors. The upside: clean fan charts and efficient long‑horizon generation, especially on stable, periodic series. Toto is autoregressive https://en.wikipedia.org/wiki/Autoregressive model it generates forecasts step by step, sampling from a parametric distribution at each step. More samples usually mean more stable forecasts and better CRPS—but also higher latency. Toto accepts a parameter called num samples which dictates the number of samples it should generate to make a forecast. In practice, Toto handled horizons up to 512 steps without issue. Across much of our test data, smaller num samples =32 often performed best—reducing inference time from ~9× slower to ~5× slower as compared to Chronos bolt base without sacrificing accuracy. Treat this as a solid starting point, then tune as needed. mean absolute scaled error with sp=1 non‑seasonal, naive last‑value baseline: forecast at time t equals observed at t‑1 ; input is univariate; y train provided per rolling origin <metric <window <source — for example, mem util 5m prometheus or cpu util 10s opensearch . <prediction length <data used .csv — for example, 512 mem util 5m prometheus.csv . <prediction length <data used / — each folder contains two plots: chronos and toto .Root paths in this repo: plots live under Forecast Plots and CSVs under CSV Files . CSV header dictionary: Toto Time : Toto inference time for the horizon ms or s, as exported Chronos Time : Chronos inference time for the horizon Toto MASE : MASE for Toto’s point forecast vs naïve Chronos MASE : MASE for Chronos’s point forecast vs naïve Toto CRPS : CRPS for Toto’s predictive distribution Chronos CRPS : CRPS for Chronos’s predictive distribution Input Length : Context length used for inferenceExample CSVs: This setup mirrors how you'd actually deploy forecasting in an observability stack: rolling updates, short and long horizons, and guardrails against regression. All the Forecasting results are available in the GitHub repo https://github.com/parseablehq/zero-shot-forecasting . General observations zero‑shot : Both models perform well on series with clear cyclic structure Prometheus memory at 5m/10m and degrade when periodicity is weak 10s windows, spike‑prone OpenSearch CPU . Use MASE to confirm improvement over naive and CRPS to ensure uncertainty isn’t over‑confident. What you’ll see in the charts : smooth cycles, tight quantile bands, and stable long‑horizon fans. Metrics : Both Chronos and Toto consistently beat naive MASE < 1 . Chronos often edges ahead at 512‑step horizons thanks to its direct multi‑step design, which keeps error growth in check. CRPS is strong for both, reflecting predictable cycles and well‑calibrated uncertainty. Figures: Memory utilization, 5m aggregation, 256‑step horizon. Top: Chronos; Bottom: Toto. Bands show 0.1–0.9 quantiles. CSV: What this means in ops : Prefer Chronos for long‑range capacity planning on periodic services. Drive alerts from quantile bands for example, 0.9 for spikes, median/0.8 for budgets rather than a single point line. What you’ll see in the charts : frequent spikes, asymmetric tails, and wider bands—especially during bursts. Metrics : CRPS becomes the deciding factor. Toto can better capture heavy tails with adequate sampling, improving distributional calibration—but with higher latency. Chronos still performs, but if the bands look too narrow on spike‑prone series, be cautious: over‑sharp uncertainty is a red flag. Figures: CPU utilization OpenSearch , 5m aggregation, 64‑step horizon. Top: Chronos; Bottom: Toto. Wider bands reflect tail risk. CSV: What this means in ops : Increase Toto’s samples when tail risk matters on‑call noise, error budgets . If latency is a concern, dial back samples or switch horizons. Consider widening alert thresholds to follow the upper quantile band during known burst windows. What you’ll see in the charts : weaker periodicity, noisier signals, and visibly broader uncertainty. Forecasting results are only the first step. Parseable gives teams a practical place to operationalize those insights across logs, metrics, and traces, so stable trends, tail risk, and widening uncertainty bands can actually drive action. See Parseable in action . Metrics : Without strong cyclic structure, both models degrade. Expect MASE ≈ 1 or marginally better and wider CRPS. That’s not failure, it’s honest uncertainty. Figures: Memory utilization, 10s aggregation, 64‑step horizon. Top: Chronos; Bottom: Toto. Bands widen due to weak periodicity. CSV: orecasting results are only the first step. Parseable gives teams a practical place to operationalize those insights across logs, metrics, and traces, so stable trends, tail risk, and widening uncertainty bands can actually drive action. See Parseable in action What this means in ops : For near‑term paging, prefer 64‑step horizons and smoothing/aggregation. For capacity planning, use 5m/10m aggregates where structure is clearer. How to read these figures : the shaded fan shows 0.1–0.9 quantiles; the central line is the median. Tighter shading means lower uncertainty. If the upper band crosses your alert threshold, assume higher risk even if the median stays below it. mean absolute scaled error sp=1 , non‑seasonal using the naive last‑value baseline; provide y train for each rolling origin; input is univariate per series. sp=1 . - Chronos: use 0.1–0.9 quantiles; generate 64, 256, and 336 steps longer if needed . Save outputs and latencies. - Toto: start with num samples=32 ; record inference latency and accuracy. If tails matter and CRPS looks over‑confident, increase num samples . {{ ... }} What this means in ops : For near‑term paging, prefer 64‑step horizons and smoothing/aggregation. For capacity planning, use 5m/10m aggregates where structure is clearer. How to read these figures : the shaded fan shows 0.1–0.9 quantiles; the central line is the median. Tighter shading means lower uncertainty. If the upper band crosses your alert threshold, assume higher risk even if the median stays below it. If you are already benchmarking forecasting models on telemetry, the next step is productionizing the workflow. Parseable Pro includes forecasting alerts, dashboards, SQL access, and AI-native analysis, so teams can turn experiments into operational guardrails faster. Try Parseable Pro free . - Compute MASE with sktime’s mean absolute scaled error sp=1 , non‑seasonal using the naive last‑value baseline; provide y train for each rolling origin; input is univariate per series. - No seasonal naive: we do not use seasonal baselines; the metric’s scaling uses last‑value sp=1 . - Gate to proceed: target MASE < 1 by a clear margin ≥10–20% on stable series. - Track both MASE point and CRPS https://en.wikipedia.org/wiki/Continuous ranked probability score probabilistic . - Acceptance gates: - MASE < 1 for your chosen horizon s . - CRPS ≤ naive and stable across windows no drift . - Coverage check: ~80% of actuals inside the 0.1–0.9 band by design it won’t be 100%; the point is “not too tight” . - If coverage is low or CRPS rises: widen bands Chronos, Toto or increase samples Toto . - Paging/on‑call: prefer 64‑step horizons; alert when the 0.9 quantile stays above a threshold for N consecutive points e.g., N = 3–5 windows . - Capacity planning: use 256–336 steps; plan to the median and budget to the 0.8/0.9 quantile. - Executive reports: show median plus bands; avoid single‑line forecasts. - Use quantile bands for thresholds e.g., 0.9 for “high‑risk” services . - Page only if both conditions hold for N windows: upper quantile threshold AND observed is trending up. - Add dampening: suspend pages if uncertainty band width is rapidly expanding after a deploy. - Dashboard by service/window 10s/5m/10m : - MASE and CRPS as time series; 7‑day moving averages. - Quantile coverage % of points within 0.1–0.9 band . - Inference latency by horizon and model; model version and config samples/context . - Alert on trends, not spikes: - MASE ≥ 1 for N consecutive windows. - CRPS +25% vs baseline for M windows. - Coverage < 70–75% for K windows over‑tight bands . - If MASE worsens or CRPS inflates, automatically fall back to the naive last‑value baseline. - Temporarily widen alert thresholds to quantile bands e.g., use 0.9 instead of the mean until retraining. - When noise is high 10s windows , aggregate to 5m/10m for capacity decisions. - Monitor missing timestamps, step changes, near‑zero plateaus, and value spikes; backfill or mask before scoring. - Watch for regime shifts deploys/traffic changes that flip winners; keep baselines alive for quick rollbacks. mem util 5m prometheus capacity planning, anchor to the median for purchase decisions and keep a “risk band” at 0.9. If leadership asks for p95, publish it with a note on interpolation and prove that ~95% of points fell below it in the last 7 days. cpu util 5m opensearch , 64‑step predictions keep CRPS low and coverage healthy. At 256 steps, CRPS inflates—bump num samples e.g., from 32 to 64 or forecast in two 128‑step segments and re‑seed with observations. mem util 10m prometheus loses its daily dip. Chronos still looks fine at the median, but coverage falls to ~60%. Flip alerts to the baseline temporarily and retrain on post‑deploy data. mem util 10s prometheus at 64 steps shows honest wide bands and MASE near 1. Aggregate to 5m for planning where Chronos beats naive and keep 10s only for exploratory drilling.If you remember only two things, make them these: MASE tells you if your point forecast truly beats naive. CRPS tells you if your uncertainty is believable when it matters most. On periodic series, Chronos often wins, especially at long horizons, thanks to direct multi-step forecasting and clean quantile bands. On spiky, heavy-tailed series, Toto shines when you tune sampling to balance CRPS and latency. Most teams will benefit from using both: Chronos for stable workloads, Toto where tails do the talking. Forecasting becomes valuable when it moves beyond offline evaluation and starts shaping real operational decisions. Parseable helps teams do that with unified telemetry, forecasting alerts, dashboards, and AI-native analysis built for production observability. Start your free trial .