How to Brier-grade your own ML option-pricing forecasts in 40 lines of Python

A developer has published a 40-line Python script that logs machine-learning option-pricing forecasts from the Helium MCP REST API to a CSV file, enabling Brier-score calibration after contracts expire. The open-source recipe records per-contract probability-in-the-money forecasts and predicted prices via plain HTTPS GET requests without requiring an API key or signup, then computes Brier loss against realized outcomes at expiration. The approach mirrors the forecast-grading discipline used in sabermetrics and weather forecasting, providing directly comparable calibration scores for ML probability models.

If you ship a probabilistic forecast, the single highest-value habit you can build is logging your forecasts so you can grade them later . Sabermetrics figured this out forty years ago. Weather forecasting has done it for a century. Most ML model owners still do not do it. This post walks through a 40-line Python recipe that logs an ML option-pricing model's per-contract probability-ITM forecast to a CSV, so you can compute the Brier loss after the option expires. The recipe is part of a small open-source cookbook for the Helium MCP https://heliumtrades.com/mcp-page/ REST surface — an MCP server that also exposes its tools as plain HTTPS GETs, which makes it convenient as a teaching substrate even if you do not use MCP. You will not need an API key, a signup, or a Python SDK. For every option contract we care about, we want one row that records: When we Brier-grade later, we get one number per contract. Average across many contracts and we have a directly comparable calibration score — exactly the discipline a baseball win-probability model or a weather precipitation forecast gets graded on. The Helium server exposes its option-pricing tool at this URL: GET https://heliumtrades.com/mcp option price/ ?symbol=AAPL&strike=310&expiration=2026-06-26&option type=call Plain GET, JSON in / JSON out, no auth header, free tier of 50 calls per IP per day. A live call returns: { "symbol": "AAPL", "strike": 310.0, "expiration": "2026-06-26", "option type": "call", "predicted price": 6.53, "prob itm": 0.42, "options data date": "2026-05-26" } Two of those fields are forecasts about the future: predicted price the model's fair value and prob itm the model's probability the option finishes ITM at expiration . The expiration date in the request is the fixed resolution date. That gives us a clean falsifiable target. """Log Helium's ML option-price + prob itm forecasts to a CSV so you can Brier-grade them at expiration. """ import csv import sys from datetime import datetime from pathlib import Path import requests ENDPOINT = "https://heliumtrades.com/mcp option price/" LOG FILE = Path "calibration log.csv" def main symbol, strike, expiration, option type : params = { "symbol": symbol, "strike": strike, "expiration": expiration, "option type": option type, } resp = requests.get ENDPOINT, params=params, timeout=30 resp.raise for status data = resp.json is new = not LOG FILE.exists with LOG FILE.open "a", newline="" as f: w = csv.writer f if is new: w.writerow "timestamp", "symbol", "strike", "expiration", "option type", "helium predicted price", "helium prob itm", "helium data date", "market mark", "realized underlying price", "realized itm", "brier loss", w.writerow datetime.utcnow .isoformat timespec="seconds" , symbol, strike, expiration, option type, data.get "predicted price" , data.get "prob itm" , data.get "options data date" , "", "", "", "", print f"Logged {symbol} ${strike} {option type.upper } {expiration}: " f"predicted={data 'predicted price' } prob itm={data 'prob itm' }" if name == " main ": main sys.argv 1 , float sys.argv 2 , sys.argv 3 , sys.argv 4 Save as track.py , then: pip install requests python track.py AAPL 310 2026-06-26 call python track.py AAPL 295 2026-06-26 put python track.py NVDA 220 2026-07-17 call repeat for any contracts you want to grade later The script appends one row per contract to calibration log.csv . Snapshot the file once a day to capture how the forecast evolves over time. At expiration, fill in the realized underlying price and compute Brier loss. For a single contract the Brier loss for the prob itm forecast is: brier loss = prob itm - realized itm 2 where realized itm is 1 if the contract finished in the money and 0 otherwise. Score every contract you logged, average the losses, and you have a calibration number you can compare across models, weeks, or strike regimes. A quick scorer: python import csv import pandas as pd df = pd.read csv "calibration log.csv" def realized itm row : s = float row "realized underlying price" k = float row "strike" if row "option type" == "call": return 1 if s = k else 0 return 1 if s <= k else 0 resolved = df df "realized underlying price" = "" .copy resolved "realized itm" = resolved.apply realized itm, axis=1 resolved "brier loss" = resolved "helium prob itm" .astype float - resolved "realized itm" 2 print f"Contracts graded: {len resolved }" print f"Mean Brier loss: {resolved 'brier loss' .mean :.4f}" print f"Calibration histogram:" print resolved.groupby pd.cut resolved "helium prob itm" .astype float , 0, 0.25, 0.5, 0.75, 1.0 "realized itm" .mean The calibration histogram is the part most people skip. A model with mean Brier loss of 0.18 can still be wildly miscalibrated in specific probability bins overconfident at extreme ends, say . The histogram tells you where it is miscalibrated. Most quant content compares predicted prices to current prices and stops there. That comparison cannot distinguish between "the model is right and the market is wrong" and the reverse — and both are unfalsifiable until expiration. Probability-ITM, on the other hand, has an unambiguous resolution: the underlying either closes above the strike or it does not. So prob itm is the friendliest output to grade. If you want to spend an hour playing with calibration intuition, log forecasts for 50 contracts across a few different expirations, wait for them to resolve, and run the scorer. The same pattern — one endpoint, one short script, real output — works for the other tools the Helium server exposes: overall credibility , fearful bias , emotionality score , or any other dimensionAll six recipes are in the open-source cookbook here: ➡️ github.com/connerlambden/helium-mcp-cookbook https://github.com/connerlambden/helium-mcp-cookbook The cookbook is MIT-licensed. Fork it, modify it, write your own recipes. PRs welcome. The same ten tools are also exposed as a remote MCP server. If you would rather call them from inside Claude Desktop, Cursor, or any MCP-aware client, the config is: { "mcpServers": { "helium": { "command": "npx", "args": "mcp-remote", "https://heliumtrades.com/mcp" } } } After a client restart your LLM can call the same tools by name. The Helium repo is at github.com/connerlambden/helium-mcp https://github.com/connerlambden/helium-mcp . If your model emits probabilities, you should grade them. The friction-free version is a 40-line script and a CSV. The day you put that habit in place is the day your forecasts start improving — not because the model changes, but because you finally have a feedback signal to learn from.