{"slug": "azure-databricks-for-mlops-and-feature-engineering-at-scale-with-apache-spark", "title": "Azure Databricks for MLOps and Feature Engineering at Scale with Apache Spark, Delta Lake, and MLflow", "summary": "Azure Databricks provides a production-grade feature engineering pipeline for MLOps using Apache Spark, Delta Lake, and MLflow. The pipeline follows the Medallion Architecture with Bronze, Silver, and Gold layers to transform raw data into ML-ready features. A customer churn prediction use case demonstrates append-only ingestion, deduplication, schema enforcement, and aggregation for scalable feature engineering.", "body_md": "Raw data doesn't win model competitions. Features do. And when your raw data is tens of billions of rows sitting across multiple sources, you can't afford to run pandas in a notebook and call it a day.\n\nIn this tutorial I'll walk through building a production-grade feature engineering pipeline on **Azure Databricks** using:\n\nThe use case is a customer churn prediction system, but the patterns apply to any ML feature pipeline.\n\nThe pipeline follows the **Medallion Architecture** — a layered approach where data gets progressively cleaner and more feature-ready as it moves from Bronze to Silver to Gold. MLflow sits across all three layers tracking every run.\n\n| Layer | Delta Table | What happens here | Typical latency |\n|---|---|---|---|\nBronze |\n`churn.bronze.events` |\nRaw ingest, no transforms, append only | Minutes |\nSilver |\n`churn.silver.customers` |\nDeduplication, null handling, schema enforcement | Minutes |\nGold |\n`churn.gold.features` |\nAggregations, window functions, encoding | Minutes to hours |\nMLflow Run |\nN/A | Training, metric logging, artifact storage | Hours |\nRegistry |\nN/A | Versioned model store, stage promotion | On demand |\n\nThe Bronze layer is append-only. No transforms. No business logic. Just get the data in and preserve it exactly as it arrived so you can always replay from source.\n\n``` python\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import current_timestamp, lit\nfrom delta.tables import DeltaTable\n\nspark = SparkSession.builder.getOrCreate()\n\n# Read raw events from ADLS Gen2 / Event Hub / source of choice\nraw_events = spark.read.format('json').load('abfss://raw@yourstorage.dfs.core.windows.net/events/')\n\n# Add ingestion metadata — never mutate source columns\nbronze_df = raw_events.withColumn('_ingested_at', current_timestamp()) \\\n                       .withColumn('_source', lit('events_api'))\n\n# Write to Bronze Delta table — append only, no overwrites\nbronze_df.write \\\n    .format('delta') \\\n    .mode('append') \\\n    .option('mergeSchema', 'true') \\\n    .saveAsTable('churn.bronze.events')\n\nprint(f\"Bronze rows written: {bronze_df.count()}\")\n```\n\nWhy append-only?If your downstream pipeline produces bad features, you want to replay from Bronze without re-ingesting from source. Overwriting Bronze breaks that ability.\n\nSilver is where you enforce schema, handle nulls, deduplicate, and standardize. Think of it as your canonical, trusted dataset.\n\n``` python\nfrom pyspark.sql.functions import col, to_timestamp, when, trim, upper\nfrom delta.tables import DeltaTable\n\nbronze = spark.table('churn.bronze.events')\n\nsilver_df = bronze \\\n    .filter(col('customer_id').isNotNull()) \\\n    .filter(col('event_type').isNotNull()) \\\n    .dropDuplicates(['customer_id', 'event_id']) \\\n    .withColumn('event_ts',     to_timestamp(col('event_timestamp'))) \\\n    .withColumn('event_type',   upper(trim(col('event_type')))) \\\n    .withColumn('country_code', when(col('country').isNull(), lit('UNKNOWN'))\n                                .otherwise(upper(col('country')))) \\\n    .select(\n        'customer_id',\n        'event_id',\n        'event_type',\n        'event_ts',\n        'country_code',\n        'product_id',\n        'session_id',\n        '_ingested_at',\n    )\n\n# Upsert into Silver using Delta MERGE — idempotent on re-runs\nif DeltaTable.isDeltaTable(spark, 'churn.silver.customers'):\n    silver_table = DeltaTable.forName(spark, 'churn.silver.customers')\n    silver_table.alias('tgt').merge(\n        silver_df.alias('src'),\n        'tgt.customer_id = src.customer_id AND tgt.event_id = src.event_id'\n    ).whenNotMatchedInsertAll().execute()\nelse:\n    silver_df.write.format('delta').saveAsTable('churn.silver.customers')\n\nprint(f\"Silver table updated. Total rows: {spark.table('churn.silver.customers').count()}\")\n```\n\nThis is the heart of the pipeline. We compute aggregated, windowed, and encoded features that the model will actually train on.\n\n```\nfrom pyspark.sql.functions import (\n    col, count, countDistinct, sum as _sum,\n    avg, datediff, max as _max, min as _min,\n    current_date, expr, when\n)\nfrom pyspark.sql.window import Window\n\nsilver = spark.table('churn.silver.customers')\n\n# ------------------------------------------------------------------\n# 1. Aggregate features per customer over 30 / 90 day windows\n# ------------------------------------------------------------------\ntoday = current_date()\n\nagg_features = silver \\\n    .withColumn('days_since_event', datediff(today, col('event_ts'))) \\\n    .groupBy('customer_id') \\\n    .agg(\n        count('event_id')                                          .alias('total_events'),\n        countDistinct('session_id')                                .alias('total_sessions'),\n        countDistinct('product_id')                                .alias('distinct_products'),\n        _sum(when(col('days_since_event') <= 30, 1).otherwise(0)) .alias('events_last_30d'),\n        _sum(when(col('days_since_event') <= 90, 1).otherwise(0)) .alias('events_last_90d'),\n        _max('event_ts')                                           .alias('last_event_ts'),\n        _min('event_ts')                                           .alias('first_event_ts'),\n    ) \\\n    .withColumn('days_since_last_event', datediff(today, col('last_event_ts'))) \\\n    .withColumn('customer_tenure_days',  datediff(today, col('first_event_ts'))) \\\n    .withColumn('avg_events_per_day',\n        col('total_events') / (col('customer_tenure_days') + 1))\n\n# ------------------------------------------------------------------\n# 2. Encode churn risk tier as ordinal feature\n# ------------------------------------------------------------------\nfeature_df = agg_features \\\n    .withColumn('recency_tier',\n        when(col('days_since_last_event') <= 7,  lit(3))   # active\n       .when(col('days_since_last_event') <= 30, lit(2))   # at risk\n       .otherwise(lit(1))                                   # churned\n    ) \\\n    .withColumn('engagement_score',\n        (col('events_last_30d') * 0.6 + col('events_last_90d') * 0.4) /\n        (col('customer_tenure_days') + 1)\n    )\n\n# ------------------------------------------------------------------\n# 3. Write to Gold feature store — overwrite with partition by date\n# ------------------------------------------------------------------\nfeature_df \\\n    .withColumn('feature_date', current_date()) \\\n    .write \\\n    .format('delta') \\\n    .mode('overwrite') \\\n    .option('replaceWhere', f\"feature_date = '{today}'\") \\\n    .saveAsTable('churn.gold.features')\n\nprint(f\"Gold features written: {feature_df.count()} customers\")\n```\n\nWith features in Gold, we hand off to MLflow to train, track, and register the model. Notice we log the Delta table version so we can always reproduce exactly which feature snapshot trained which model.\n\n``` python\nimport mlflow\nimport mlflow.sklearn\nfrom mlflow.models.signature import infer_signature\nfrom sklearn.ensemble import GradientBoostingClassifier\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import roc_auc_score, f1_score\nimport pandas as pd\n\nmlflow.set_experiment('/churn-prediction/feature-pipeline')\n\n# Read Gold features — capture Delta version for reproducibility\ngold_table  = DeltaTable.forName(spark, 'churn.gold.features')\ndelta_version = gold_table.history(1).select('version').collect()[0][0]\n\nfeatures_pdf = spark.table('churn.gold.features').toPandas()\n\nFEATURE_COLS = [\n    'total_events', 'total_sessions', 'distinct_products',\n    'events_last_30d', 'events_last_90d', 'days_since_last_event',\n    'customer_tenure_days', 'avg_events_per_day',\n    'recency_tier', 'engagement_score',\n]\nTARGET = 'churned'\n\nX = features_pdf[FEATURE_COLS]\ny = features_pdf[TARGET]\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\nwith mlflow.start_run(run_name=f'gbm-features-v{delta_version}') as run:\n\n    params = {'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.05}\n    model  = GradientBoostingClassifier(**params, random_state=42)\n    model.fit(X_train, y_train)\n\n    y_pred = model.predict(X_test)\n    y_prob = model.predict_proba(X_test)[:, 1]\n\n    # Log everything\n    mlflow.log_params(params)\n    mlflow.log_metric('roc_auc', roc_auc_score(y_test, y_prob))\n    mlflow.log_metric('f1_score', f1_score(y_test, y_pred))\n    mlflow.log_param('delta_feature_version', delta_version)\n    mlflow.log_param('feature_columns', FEATURE_COLS)\n    mlflow.log_param('training_rows', len(X_train))\n\n    # Log model with signature\n    signature = infer_signature(X_train, y_pred)\n    mlflow.sklearn.log_model(\n        model,\n        artifact_path='churn-gbm',\n        signature=signature,\n        registered_model_name='churn-prediction-gbm',\n    )\n\n    print(f\"Run ID: {run.info.run_id}\")\n    print(f\"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}\")\n    print(f\"Feature Delta version logged: {delta_version}\")\n```\n\nOne of the best things about Delta Lake is time travel. If a model behaves unexpectedly in production, you can reload the exact feature snapshot it was trained on.\n\n``` python\n# Reload the exact feature version that trained a specific model run\nimport mlflow\n\nrun = mlflow.get_run('your-run-id-here')\nfeature_version = int(run.data.params['delta_feature_version'])\n\n# Rehydrate that exact feature snapshot\nhistorical_features = spark.read \\\n    .format('delta') \\\n    .option('versionAsOf', feature_version) \\\n    .table('churn.gold.features')\n\nprint(f\"Loaded feature snapshot from Delta version {feature_version}\")\nprint(f\"Row count: {historical_features.count()}\")\n\n# You can now retrain on the exact same data to reproduce the result\n```\n\n| Tool | Role in pipeline | Why not the alternative |\n|---|---|---|\nApache Spark |\nDistributed feature computation | Pandas (single node, OOM at scale), Dask (less native Databricks integration) |\nDelta Lake |\nFeature storage with versioning | Parquet (no ACID, no time travel), Hive tables (no merge support) |\nMLflow Tracking |\nExperiment and param logging | Manual logging (not reproducible), W&B (extra cost, less native on Databricks) |\nMLflow Registry |\nModel versioning and promotion | Custom model store (more ops overhead) |\nMedallion Architecture |\nPipeline layer separation | Flat pipelines (hard to debug, no replay capability) |\nDelta MERGE |\nIdempotent Silver upserts | Overwrite (destroys history), append (creates duplicates) |\n\n**Shuffle partitions matter.** Spark defaults to 200 shuffle partitions which is fine for small data but will bottleneck at scale. Set `spark.conf.set(\"spark.sql.shuffle.partitions\", \"auto\")`\n\non Databricks Runtime 10+ or tune it manually to `2-3x your core count`\n\n.\n\n**Z-ordering on Gold features.** If you're querying Gold by `customer_id`\n\nfrequently, add `OPTIMIZE churn.gold.features ZORDER BY (customer_id)`\n\nafter the write. This co-locates related data and cuts query times dramatically on large tables.\n\n**Log Delta version in every MLflow run.** This is non-negotiable for reproducibility. Without it you can't prove which feature snapshot trained which model, which becomes a compliance problem in regulated industries.\n\n**Cluster autoscaling for feature jobs.** Feature engineering jobs tend to have spiky resource needs (big during aggregation, small during writes). Enable autoscaling on your Databricks cluster and set a min/max node count rather than a fixed size.\n\nThe combination of Spark, Delta Lake, and MLflow on Databricks gives you a feature engineering pipeline that is reproducible (Delta time travel + MLflow param logging), scalable (Spark handles billions of rows), and auditable (every run is tracked, every feature version is stored).\n\nThe Medallion Architecture keeps the pipeline modular — you can rerun just the Gold layer if you change a feature definition without touching Bronze or Silver, and MLflow ties model performance back to the exact feature version that produced it.", "url": "https://wpnews.pro/news/azure-databricks-for-mlops-and-feature-engineering-at-scale-with-apache-spark", "canonical_source": "https://dev.to/jubinsoni/azure-databricks-for-feature-engineering-at-scale-with-apache-spark-delta-lake-and-mlflow-3k4n", "published_at": "2026-06-28 01:35:55+00:00", "updated_at": "2026-06-28 02:03:39.876007+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "mlops", "developer-tools"], "entities": ["Azure Databricks", "Apache Spark", "Delta Lake", "MLflow", "Medallion Architecture"], "alternates": {"html": "https://wpnews.pro/news/azure-databricks-for-mlops-and-feature-engineering-at-scale-with-apache-spark", "markdown": "https://wpnews.pro/news/azure-databricks-for-mlops-and-feature-engineering-at-scale-with-apache-spark.md", "text": "https://wpnews.pro/news/azure-databricks-for-mlops-and-feature-engineering-at-scale-with-apache-spark.txt", "jsonld": "https://wpnews.pro/news/azure-databricks-for-mlops-and-feature-engineering-at-scale-with-apache-spark.jsonld"}}