# How to Brier-grade your own ML option-pricing forecasts in 40 lines of Python

> Source: <https://dev.to/connerlambden/how-to-brier-grade-your-own-ml-option-pricing-forecasts-in-40-lines-of-python-2gb2>
> Published: 2026-05-27 03:33:09+00:00

If you ship a probabilistic forecast, the single highest-value habit you can build is *logging your forecasts so you can grade them later*. Sabermetrics figured this out forty years ago. Weather forecasting has done it for a century. Most ML model owners still do not do it.

This post walks through a 40-line Python recipe that logs an ML option-pricing model's per-contract probability-ITM forecast to a CSV, so you can compute the Brier loss after the option expires. The recipe is part of a small open-source cookbook for the [Helium MCP](https://heliumtrades.com/mcp-page/) REST surface — an MCP server that also exposes its tools as plain HTTPS GETs, which makes it convenient as a teaching substrate even if you do not use MCP.

You will not need an API key, a signup, or a Python SDK.

For every option contract we care about, we want one row that records:

When we Brier-grade later, we get one number per contract. Average across many contracts and we have a directly comparable calibration score — exactly the discipline a baseball win-probability model or a weather precipitation forecast gets graded on.

The Helium server exposes its option-pricing tool at this URL:

```
GET https://heliumtrades.com/mcp_option_price/
    ?symbol=AAPL&strike=310&expiration=2026-06-26&option_type=call
```

Plain GET, JSON in / JSON out, no auth header, free tier of 50 calls per IP per day. A live call returns:

```
{
  "symbol": "AAPL",
  "strike": 310.0,
  "expiration": "2026-06-26",
  "option_type": "call",
  "predicted_price": 6.53,
  "prob_itm": 0.42,
  "options_data_date": "2026-05-26"
}
```

Two of those fields are forecasts about the future: `predicted_price`

(the model's fair value) and `prob_itm`

(the model's probability the option finishes ITM at expiration). The expiration date in the request is the fixed resolution date. That gives us a clean falsifiable target.

```
"""Log Helium's ML option-price + prob_itm forecasts to a CSV so you can
Brier-grade them at expiration.
"""
import csv
import sys
from datetime import datetime
from pathlib import Path

import requests

ENDPOINT = "https://heliumtrades.com/mcp_option_price/"
LOG_FILE = Path("calibration_log.csv")

def main(symbol, strike, expiration, option_type):
    params = {
        "symbol": symbol, "strike": strike,
        "expiration": expiration, "option_type": option_type,
    }
    resp = requests.get(ENDPOINT, params=params, timeout=30)
    resp.raise_for_status()
    data = resp.json()

    is_new = not LOG_FILE.exists()
    with LOG_FILE.open("a", newline="") as f:
        w = csv.writer(f)
        if is_new:
            w.writerow([
                "timestamp", "symbol", "strike", "expiration", "option_type",
                "helium_predicted_price", "helium_prob_itm", "helium_data_date",
                "market_mark", "realized_underlying_price", "realized_itm",
                "brier_loss",
            ])
        w.writerow([
            datetime.utcnow().isoformat(timespec="seconds"),
            symbol, strike, expiration, option_type,
            data.get("predicted_price"), data.get("prob_itm"),
            data.get("options_data_date"),
            "", "", "", "",
        ])
    print(f"Logged {symbol} ${strike} {option_type.upper()} {expiration}: "
          f"predicted={data['predicted_price']} prob_itm={data['prob_itm']}")

if __name__ == "__main__":
    main(sys.argv[1], float(sys.argv[2]), sys.argv[3], sys.argv[4])
```

Save as `track.py`

, then:

```
pip install requests
python track.py AAPL 310 2026-06-26 call
python track.py AAPL 295 2026-06-26 put
python track.py NVDA 220 2026-07-17 call
# repeat for any contracts you want to grade later
```

The script appends one row per contract to `calibration_log.csv`

. Snapshot the file once a day to capture how the forecast evolves over time.

At expiration, fill in the realized underlying price and compute Brier loss. For a single contract the Brier loss for the prob_itm forecast is:

```
brier_loss = (prob_itm - realized_itm) ** 2
```

where `realized_itm`

is 1 if the contract finished in the money and 0 otherwise. Score every contract you logged, average the losses, and you have a calibration number you can compare across models, weeks, or strike regimes.

A quick scorer:

``` python
import csv
import pandas as pd

df = pd.read_csv("calibration_log.csv")

def realized_itm(row):
    s = float(row["realized_underlying_price"])
    k = float(row["strike"])
    if row["option_type"] == "call":
        return 1 if s >= k else 0
    return 1 if s <= k else 0

resolved = df[df["realized_underlying_price"] != ""].copy()
resolved["realized_itm"] = resolved.apply(realized_itm, axis=1)
resolved["brier_loss"] = (
    resolved["helium_prob_itm"].astype(float) - resolved["realized_itm"]
) ** 2

print(f"Contracts graded: {len(resolved)}")
print(f"Mean Brier loss: {resolved['brier_loss'].mean():.4f}")
print(f"Calibration histogram:")
print(resolved.groupby(
    pd.cut(resolved["helium_prob_itm"].astype(float), [0, 0.25, 0.5, 0.75, 1.0])
)["realized_itm"].mean())
```

The calibration histogram is the part most people skip. A model with mean Brier loss of 0.18 can still be wildly miscalibrated in specific probability bins (overconfident at extreme ends, say). The histogram tells you *where* it is miscalibrated.

Most quant content compares predicted prices to current prices and stops there. That comparison cannot distinguish between "the model is right and the market is wrong" and the reverse — and both are unfalsifiable until expiration. Probability-ITM, on the other hand, has an unambiguous resolution: the underlying either closes above the strike or it does not.

So `prob_itm`

is the friendliest output to grade. If you want to spend an hour playing with calibration intuition, log forecasts for 50 contracts across a few different expirations, wait for them to resolve, and run the scorer.

The same pattern — one endpoint, one short script, real output — works for the other tools the Helium server exposes:

`overall credibility`

, `fearful bias`

, `emotionality_score`

, or any other dimensionAll six recipes are in the open-source cookbook here:

➡️ [github.com/connerlambden/helium-mcp-cookbook](https://github.com/connerlambden/helium-mcp-cookbook)

The cookbook is MIT-licensed. Fork it, modify it, write your own recipes. PRs welcome.

The same ten tools are also exposed as a remote MCP server. If you would rather call them from inside Claude Desktop, Cursor, or any MCP-aware client, the config is:

```
{
  "mcpServers": {
    "helium": {
      "command": "npx",
      "args": ["mcp-remote", "https://heliumtrades.com/mcp"]
    }
  }
}
```

After a client restart your LLM can call the same tools by name. The Helium repo is at [github.com/connerlambden/helium-mcp](https://github.com/connerlambden/helium-mcp).

If your model emits probabilities, you should grade them. The friction-free version is a 40-line script and a CSV. The day you put that habit in place is the day your forecasts start improving — not because the model changes, but because you finally have a feedback signal to learn from.