Azure Databricks for MLOps and Feature Engineering at Scale with Apache Spark, Delta Lake, and MLflow

Azure Databricks provides a production-grade feature engineering pipeline for MLOps using Apache Spark, Delta Lake, and MLflow. The pipeline follows the Medallion Architecture with Bronze, Silver, and Gold layers to transform raw data into ML-ready features. A customer churn prediction use case demonstrates append-only ingestion, deduplication, schema enforcement, and aggregation for scalable feature engineering.

Raw data doesn't win model competitions. Features do. And when your raw data is tens of billions of rows sitting across multiple sources, you can't afford to run pandas in a notebook and call it a day. In this tutorial I'll walk through building a production-grade feature engineering pipeline on Azure Databricks using: The use case is a customer churn prediction system, but the patterns apply to any ML feature pipeline. The pipeline follows the Medallion Architecture — a layered approach where data gets progressively cleaner and more feature-ready as it moves from Bronze to Silver to Gold. MLflow sits across all three layers tracking every run. | Layer | Delta Table | What happens here | Typical latency | |---|---|---|---| Bronze | churn.bronze.events | Raw ingest, no transforms, append only | Minutes | Silver | churn.silver.customers | Deduplication, null handling, schema enforcement | Minutes | Gold | churn.gold.features | Aggregations, window functions, encoding | Minutes to hours | MLflow Run | N/A | Training, metric logging, artifact storage | Hours | Registry | N/A | Versioned model store, stage promotion | On demand | The Bronze layer is append-only. No transforms. No business logic. Just get the data in and preserve it exactly as it arrived so you can always replay from source. python from pyspark.sql import SparkSession from pyspark.sql.functions import current timestamp, lit from delta.tables import DeltaTable spark = SparkSession.builder.getOrCreate Read raw events from ADLS Gen2 / Event Hub / source of choice raw events = spark.read.format 'json' .load 'abfss://raw@yourstorage.dfs.core.windows.net/events/' Add ingestion metadata — never mutate source columns bronze df = raw events.withColumn ' ingested at', current timestamp \ .withColumn ' source', lit 'events api' Write to Bronze Delta table — append only, no overwrites bronze df.write \ .format 'delta' \ .mode 'append' \ .option 'mergeSchema', 'true' \ .saveAsTable 'churn.bronze.events' print f"Bronze rows written: {bronze df.count }" Why append-only?If your downstream pipeline produces bad features, you want to replay from Bronze without re-ingesting from source. Overwriting Bronze breaks that ability. Silver is where you enforce schema, handle nulls, deduplicate, and standardize. Think of it as your canonical, trusted dataset. python from pyspark.sql.functions import col, to timestamp, when, trim, upper from delta.tables import DeltaTable bronze = spark.table 'churn.bronze.events' silver df = bronze \ .filter col 'customer id' .isNotNull \ .filter col 'event type' .isNotNull \ .dropDuplicates 'customer id', 'event id' \ .withColumn 'event ts', to timestamp col 'event timestamp' \ .withColumn 'event type', upper trim col 'event type' \ .withColumn 'country code', when col 'country' .isNull , lit 'UNKNOWN' .otherwise upper col 'country' \ .select 'customer id', 'event id', 'event type', 'event ts', 'country code', 'product id', 'session id', ' ingested at', Upsert into Silver using Delta MERGE — idempotent on re-runs if DeltaTable.isDeltaTable spark, 'churn.silver.customers' : silver table = DeltaTable.forName spark, 'churn.silver.customers' silver table.alias 'tgt' .merge silver df.alias 'src' , 'tgt.customer id = src.customer id AND tgt.event id = src.event id' .whenNotMatchedInsertAll .execute else: silver df.write.format 'delta' .saveAsTable 'churn.silver.customers' print f"Silver table updated. Total rows: {spark.table 'churn.silver.customers' .count }" This is the heart of the pipeline. We compute aggregated, windowed, and encoded features that the model will actually train on. from pyspark.sql.functions import col, count, countDistinct, sum as sum, avg, datediff, max as max, min as min, current date, expr, when from pyspark.sql.window import Window silver = spark.table 'churn.silver.customers' ------------------------------------------------------------------ 1. Aggregate features per customer over 30 / 90 day windows ------------------------------------------------------------------ today = current date agg features = silver \ .withColumn 'days since event', datediff today, col 'event ts' \ .groupBy 'customer id' \ .agg count 'event id' .alias 'total events' , countDistinct 'session id' .alias 'total sessions' , countDistinct 'product id' .alias 'distinct products' , sum when col 'days since event' <= 30, 1 .otherwise 0 .alias 'events last 30d' , sum when col 'days since event' <= 90, 1 .otherwise 0 .alias 'events last 90d' , max 'event ts' .alias 'last event ts' , min 'event ts' .alias 'first event ts' , \ .withColumn 'days since last event', datediff today, col 'last event ts' \ .withColumn 'customer tenure days', datediff today, col 'first event ts' \ .withColumn 'avg events per day', col 'total events' / col 'customer tenure days' + 1 ------------------------------------------------------------------ 2. Encode churn risk tier as ordinal feature ------------------------------------------------------------------ feature df = agg features \ .withColumn 'recency tier', when col 'days since last event' <= 7, lit 3 active .when col 'days since last event' <= 30, lit 2 at risk .otherwise lit 1 churned \ .withColumn 'engagement score', col 'events last 30d' 0.6 + col 'events last 90d' 0.4 / col 'customer tenure days' + 1 ------------------------------------------------------------------ 3. Write to Gold feature store — overwrite with partition by date ------------------------------------------------------------------ feature df \ .withColumn 'feature date', current date \ .write \ .format 'delta' \ .mode 'overwrite' \ .option 'replaceWhere', f"feature date = '{today}'" \ .saveAsTable 'churn.gold.features' print f"Gold features written: {feature df.count } customers" With features in Gold, we hand off to MLflow to train, track, and register the model. Notice we log the Delta table version so we can always reproduce exactly which feature snapshot trained which model. python import mlflow import mlflow.sklearn from mlflow.models.signature import infer signature from sklearn.ensemble import GradientBoostingClassifier from sklearn.model selection import train test split from sklearn.metrics import roc auc score, f1 score import pandas as pd mlflow.set experiment '/churn-prediction/feature-pipeline' Read Gold features — capture Delta version for reproducibility gold table = DeltaTable.forName spark, 'churn.gold.features' delta version = gold table.history 1 .select 'version' .collect 0 0 features pdf = spark.table 'churn.gold.features' .toPandas FEATURE COLS = 'total events', 'total sessions', 'distinct products', 'events last 30d', 'events last 90d', 'days since last event', 'customer tenure days', 'avg events per day', 'recency tier', 'engagement score', TARGET = 'churned' X = features pdf FEATURE COLS y = features pdf TARGET X train, X test, y train, y test = train test split X, y, test size=0.2, random state=42 with mlflow.start run run name=f'gbm-features-v{delta version}' as run: params = {'n estimators': 200, 'max depth': 5, 'learning rate': 0.05} model = GradientBoostingClassifier params, random state=42 model.fit X train, y train y pred = model.predict X test y prob = model.predict proba X test :, 1 Log everything mlflow.log params params mlflow.log metric 'roc auc', roc auc score y test, y prob mlflow.log metric 'f1 score', f1 score y test, y pred mlflow.log param 'delta feature version', delta version mlflow.log param 'feature columns', FEATURE COLS mlflow.log param 'training rows', len X train Log model with signature signature = infer signature X train, y pred mlflow.sklearn.log model model, artifact path='churn-gbm', signature=signature, registered model name='churn-prediction-gbm', print f"Run ID: {run.info.run id}" print f"ROC-AUC: {roc auc score y test, y prob :.4f}" print f"Feature Delta version logged: {delta version}" One of the best things about Delta Lake is time travel. If a model behaves unexpectedly in production, you can reload the exact feature snapshot it was trained on. python Reload the exact feature version that trained a specific model run import mlflow run = mlflow.get run 'your-run-id-here' feature version = int run.data.params 'delta feature version' Rehydrate that exact feature snapshot historical features = spark.read \ .format 'delta' \ .option 'versionAsOf', feature version \ .table 'churn.gold.features' print f"Loaded feature snapshot from Delta version {feature version}" print f"Row count: {historical features.count }" You can now retrain on the exact same data to reproduce the result | Tool | Role in pipeline | Why not the alternative | |---|---|---| Apache Spark | Distributed feature computation | Pandas single node, OOM at scale , Dask less native Databricks integration | Delta Lake | Feature storage with versioning | Parquet no ACID, no time travel , Hive tables no merge support | MLflow Tracking | Experiment and param logging | Manual logging not reproducible , W&B extra cost, less native on Databricks | MLflow Registry | Model versioning and promotion | Custom model store more ops overhead | Medallion Architecture | Pipeline layer separation | Flat pipelines hard to debug, no replay capability | Delta MERGE | Idempotent Silver upserts | Overwrite destroys history , append creates duplicates | Shuffle partitions matter. Spark defaults to 200 shuffle partitions which is fine for small data but will bottleneck at scale. Set spark.conf.set "spark.sql.shuffle.partitions", "auto" on Databricks Runtime 10+ or tune it manually to 2-3x your core count . Z-ordering on Gold features. If you're querying Gold by customer id frequently, add OPTIMIZE churn.gold.features ZORDER BY customer id after the write. This co-locates related data and cuts query times dramatically on large tables. Log Delta version in every MLflow run. This is non-negotiable for reproducibility. Without it you can't prove which feature snapshot trained which model, which becomes a compliance problem in regulated industries. Cluster autoscaling for feature jobs. Feature engineering jobs tend to have spiky resource needs big during aggregation, small during writes . Enable autoscaling on your Databricks cluster and set a min/max node count rather than a fixed size. The combination of Spark, Delta Lake, and MLflow on Databricks gives you a feature engineering pipeline that is reproducible Delta time travel + MLflow param logging , scalable Spark handles billions of rows , and auditable every run is tracked, every feature version is stored . The Medallion Architecture keeps the pipeline modular — you can rerun just the Gold layer if you change a feature definition without touching Bronze or Silver, and MLflow ties model performance back to the exact feature version that produced it.