Fast Automatic ML Hyperparameter tuning Using Optuna (w. MLflow model registry and IRIS DB) A developer demonstrated automatic hyperparameter tuning using Optuna with MLflow and InterSystems IRIS database. The approach efficiently optimizes LightGBM models on the California Housing dataset, leveraging Optuna's search algorithms and MLflow for experiment tracking and model registry. The integration with IRIS enables concurrent studies and scalable hyperparameter searches. This article presents a straightforward approach to automatically and efficiently tune hyperparameters for machine learning models using Optuna as the optimisation framework. We explore how to use both Optuna’s native storage options and InterSystems IRIS as a database backend to track the progress of hyperparameter searches. We also show how MLflow can be used to monitor experiments and manage models through its tracking and model registry UI. This article is based on this Kaggle Notebook https://www.kaggle.com/code/jorgeivnjh/fast-automatic-ml-hyperparameter-tuning-w-optuna , which you can run and directly edit yourself. When training ML models, the choice of hyperparameters can strongly influence performance. They are not the only factor, but they can significantly affect both convergence and generalisation. Tuning hyperparameters manually takes a lot of effort. This is especially true because hyperparameters interact with each other, so tuning them independently is usually not enough. For example, higher regularisation may require a lower learning rate for more stable optimization. A more complex model may require stronger regularization to avoid overfitting, but at the same time, a very small learning rate on a complex model can make learning too slow. Optuna is an MIT-licensed open source library, which allows commercial use, that automates hyperparameter search for ML models developed with the most popular frameworks such as scikit-learn, PyTorch, TensorFlow, and LightGBM. It works by defining a search space and an objective metric to either minimize or maximize. Optuna then explores the search space efficiently to find well-performing configurations. Here we use Optuna to tune a LightGBM model on a dummy dataset and show how to scale the search using shared database storage. We will also use MLflow for experiment tracking and model registry, and IRIS DB as a possible Optuna storage backend for concurrent studies. We will use the California Housing dataset, commonly used in ML examples, to populate IRIS tables and run the tuning workflow. Note: For the last bit, you will need an existing IRIS instance that you can connect to. I am using the one created with Docker by running the docker-compose file from this repo https://github.com/JorgeIvanJH/IRIS and MLflow-Continuous-Training-Pipeline . I am also using the environment variables and requirements.txt from that repository, together with Python 3.12. python import os import dotenv import sklearn import pandas as pd import sqlalchemy from sqlalchemy import create engine import optuna import lightgbm as lgb from sklearn.model selection import cross val score from sklearn.model selection import KFold import seaborn as sns import matplotlib.pyplot as plt import datetime as dt import seaborn as sns import matplotlib.pyplot as plt dotenv.load dotenv Connection String to Existing IRIS Database server = os.getenv "IRIS SERVER" port = os.getenv "IRIS PORT" Standard InterSystems superserver port namespace = os.getenv "IRIS NAMESPACE" username = os.getenv "IRIS USERNAME" password = os.getenv "IRIS PASSWORD" print f"pandas version: {pd. version }" print f"sklearn version: {sklearn. version }" print f"sqlalchemy version: {sqlalchemy. version }" print f"optuna version: {optuna. version }" print f"lightgbm version: {lgb. version }" print f"seaborn version: {sns. version }" print f"matplotlib version: {plt.matplotlib. version }" pandas version: 2.3.3 sklearn version: 1.8.0 sqlalchemy version: 2.0.46 optuna version: 4.8.0 lightgbm version: 4.6.0 seaborn version: 0.13.2 matplotlib version: 3.10.8 Optuna https://optuna.org/ is a hyperparameter optimization framework that speeds up tuning by training multiple model configurations and learning from their results. It provides: For a richer intro to Optuna, see this video https://www.youtube.com/watch?v=P6NwZVl8ttc A practical approach to efficiently find good hyperparameters is: Important Hyperparameter tuning must use an appropriate validation setup. Otherwise, we may only find the configuration that best overfits the validation split, rather than one that generalizes well to the dataset at hand. The cell below loads scikit-learn's fetch california housing dataset, and changes the column names to snake case. Load California Housing Dataset X,y = sklearn.datasets.fetch california housing return X y=True,as frame=True X.columns = col.replace " ", " " for col in X.columns y.name = "median house value" df = X.copy df y.name = y It is essential to choose the right cross-validation strategy. This depends on the task, whether it is regression or classification, whether the target is imbalanced, whether the order of samples matters, and whether there are groups in the data. For example, if multiple rows belong to the same patient, we may want to avoid having samples from the same patient appear in both training and validation splits. Refer to this summary https://scikit-learn.org/stable/auto examples/model selection/plot cv indices.html sphx-glr-auto-examples-model-selection-plot-cv-indices-py of the options available in SKlearn for further guidance. For simplicity, we can use the following decision rules: if time order matters: use TimeSeriesSplit no shuffle equivalent else: if groups exist: if classification and classes are imbalanced: use StratifiedGroupKFold no shuffle equivalent else: use GroupKFold → or GroupShuffleSplit else: if classification and classes are imbalanced: use StratifiedKFold → or StratifiedShuffleSplit else: use KFold → or ShuffleSplit crossvalstrategy = KFold n splits=3, shuffle=True, random state=42 After choosing the model, in this case LightGBM, we define the hyperparameters that we want to tune and the metric that we want to optimize. The cells in this section can be run multiple times until we reach a satisfactory performance level. The variables marked as tweakable are the ones we are likely to adjust between studies. The general process is: Since this is a regression task, we use mean squared error as the metric to minimize. The metric is evaluated using the cross-validation strategy defined above. Note: When storage=storage url points to a supported database, such as SQLite or InterSystems IRIS, Optuna automatically creates the tables needed to track studies, trials, parameters, and results. Each study is identified by its study name. If the same study name and database are reused with load if exists=True, Optuna resumes from the existing study instead of starting from scratch. This shared storage is also what enables concurrent optimization: multiple processes, or even multiple machines, can connect to the same database and contribute trials to the same study. NUM TRIALS = 20 Tweak os.environ "LOKY MAX CPU COUNT" = str os.cpu count def objective trial : param = { "learning rate": trial.suggest float "learning rate", 0.001, 0.2,log=True , Tweak "max depth": trial.suggest int "max depth", 3, 50 , Tweak "n estimators": trial.suggest int "n estimators", 50, 1000 , Tweak "num leaves": trial.suggest categorical "num leaves", 16, 31, 63, 127, 255 , "lambda l2": trial.suggest float "lambda l2", 1e-8, 10.0, log=True , Tweak "max bin": trial.suggest categorical "max bin", 63, 127, 255 } model = lgb.LGBMRegressor param scores = cross val score model, X, y, cv=crossvalstrategy, scoring="neg mean squared error", n jobs=-1 return -scores.mean study = optuna.create study study name=f"lightgbm hyperparam tuning {dt.datetime.now .strftime '%Y-%m-%d %H-%M-%S' }", direction="minimize", storage=storage url, load if exists=True, sampler=optuna.samplers.TPESampler seed=42 , study.optimize objective, n trials=NUM TRIALS, show progress bar=True, n jobs=1 best params = study.best params print f"\nBest parameters: {best params}" print f"\nBest performance: {study.best value}" 32m I 2026-05-13 15:58:38,618 0m A new study created in memory with name: lightgbm hyperparam tuning 2026-05-13 15-58-38 0m 0%| | 0/20 00:00