If you ship a probabilistic forecast, the single highest-value habit you can build is logging your forecasts so you can grade them later. Sabermetrics figured this out forty years ago. Weather forecasting has done it for a century. Most ML model owners still do not do it.
This post walks through a 40-line Python recipe that logs an ML option-pricing model's per-contract probability-ITM forecast to a CSV, so you can compute the Brier loss after the option expires. The recipe is part of a small open-source cookbook for the Helium MCP REST surface β an MCP server that also exposes its tools as plain HTTPS GETs, which makes it convenient as a teaching substrate even if you do not use MCP.
You will not need an API key, a signup, or a Python SDK.
For every option contract we care about, we want one row that records:
When we Brier-grade later, we get one number per contract. Average across many contracts and we have a directly comparable calibration score β exactly the discipline a baseball win-probability model or a weather precipitation forecast gets graded on.
The Helium server exposes its option-pricing tool at this URL:
GET https://heliumtrades.com/mcp_option_price/
?symbol=AAPL&strike=310&expiration=2026-06-26&option_type=call
Plain GET, JSON in / JSON out, no auth header, free tier of 50 calls per IP per day. A live call returns:
{
"symbol": "AAPL",
"strike": 310.0,
"expiration": "2026-06-26",
"option_type": "call",
"predicted_price": 6.53,
"prob_itm": 0.42,
"options_data_date": "2026-05-26"
}
Two of those fields are forecasts about the future: predicted_price
(the model's fair value) and prob_itm
(the model's probability the option finishes ITM at expiration). The expiration date in the request is the fixed resolution date. That gives us a clean falsifiable target.
"""Log Helium's ML option-price + prob_itm forecasts to a CSV so you can
Brier-grade them at expiration.
"""
import csv
import sys
from datetime import datetime
from pathlib import Path
import requests
ENDPOINT = "https://heliumtrades.com/mcp_option_price/"
LOG_FILE = Path("calibration_log.csv")
def main(symbol, strike, expiration, option_type):
params = {
"symbol": symbol, "strike": strike,
"expiration": expiration, "option_type": option_type,
}
resp = requests.get(ENDPOINT, params=params, timeout=30)
resp.raise_for_status()
data = resp.json()
is_new = not LOG_FILE.exists()
with LOG_FILE.open("a", newline="") as f:
w = csv.writer(f)
if is_new:
w.writerow([
"timestamp", "symbol", "strike", "expiration", "option_type",
"helium_predicted_price", "helium_prob_itm", "helium_data_date",
"market_mark", "realized_underlying_price", "realized_itm",
"brier_loss",
])
w.writerow([
datetime.utcnow().isoformat(timespec="seconds"),
symbol, strike, expiration, option_type,
data.get("predicted_price"), data.get("prob_itm"),
data.get("options_data_date"),
"", "", "", "",
])
print(f"Logged {symbol} ${strike} {option_type.upper()} {expiration}: "
f"predicted={data['predicted_price']} prob_itm={data['prob_itm']}")
if __name__ == "__main__":
main(sys.argv[1], float(sys.argv[2]), sys.argv[3], sys.argv[4])
Save as track.py
, then:
pip install requests
python track.py AAPL 310 2026-06-26 call
python track.py AAPL 295 2026-06-26 put
python track.py NVDA 220 2026-07-17 call
The script appends one row per contract to calibration_log.csv
. Snapshot the file once a day to capture how the forecast evolves over time.
At expiration, fill in the realized underlying price and compute Brier loss. For a single contract the Brier loss for the prob_itm forecast is:
brier_loss = (prob_itm - realized_itm) ** 2
where realized_itm
is 1 if the contract finished in the money and 0 otherwise. Score every contract you logged, average the losses, and you have a calibration number you can compare across models, weeks, or strike regimes.
A quick scorer:
import csv
import pandas as pd
df = pd.read_csv("calibration_log.csv")
def realized_itm(row):
s = float(row["realized_underlying_price"])
k = float(row["strike"])
if row["option_type"] == "call":
return 1 if s >= k else 0
return 1 if s <= k else 0
resolved = df[df["realized_underlying_price"] != ""].copy()
resolved["realized_itm"] = resolved.apply(realized_itm, axis=1)
resolved["brier_loss"] = (
resolved["helium_prob_itm"].astype(float) - resolved["realized_itm"]
) ** 2
print(f"Contracts graded: {len(resolved)}")
print(f"Mean Brier loss: {resolved['brier_loss'].mean():.4f}")
print(f"Calibration histogram:")
print(resolved.groupby(
pd.cut(resolved["helium_prob_itm"].astype(float), [0, 0.25, 0.5, 0.75, 1.0])
)["realized_itm"].mean())
The calibration histogram is the part most people skip. A model with mean Brier loss of 0.18 can still be wildly miscalibrated in specific probability bins (overconfident at extreme ends, say). The histogram tells you where it is miscalibrated.
Most quant content compares predicted prices to current prices and stops there. That comparison cannot distinguish between "the model is right and the market is wrong" and the reverse β and both are unfalsifiable until expiration. Probability-ITM, on the other hand, has an unambiguous resolution: the underlying either closes above the strike or it does not.
So prob_itm
is the friendliest output to grade. If you want to spend an hour playing with calibration intuition, log forecasts for 50 contracts across a few different expirations, wait for them to resolve, and run the scorer.
The same pattern β one endpoint, one short script, real output β works for the other tools the Helium server exposes:
overall credibility
, fearful bias
, emotionality_score
, or any other dimensionAll six recipes are in the open-source cookbook here:
β‘οΈ github.com/connerlambden/helium-mcp-cookbook
The cookbook is MIT-licensed. Fork it, modify it, write your own recipes. PRs welcome.
The same ten tools are also exposed as a remote MCP server. If you would rather call them from inside Claude Desktop, Cursor, or any MCP-aware client, the config is:
{
"mcpServers": {
"helium": {
"command": "npx",
"args": ["mcp-remote", "https://heliumtrades.com/mcp"]
}
}
}
After a client restart your LLM can call the same tools by name. The Helium repo is at github.com/connerlambden/helium-mcp.
If your model emits probabilities, you should grade them. The friction-free version is a 40-line script and a CSV. The day you put that habit in place is the day your forecasts start improving β not because the model changes, but because you finally have a feedback signal to learn from.