{"slug": "real-time-ai-feature-engineering-with-spark-structured-streaming-and-databricks", "title": "Real-Time AI Feature Engineering with Spark Structured Streaming and Databricks Feature Store", "summary": "Databricks introduced real-time AI feature engineering using Spark Structured Streaming and the Databricks Feature Store, now part of Unity Catalog as Feature Engineering in Unity Catalog. The solution addresses training-serving skew by ensuring point-in-time correct features from raw Kafka events to online serving in milliseconds. The streaming pipeline reads from Kafka, computes windowed aggregations, and writes features to the Feature Store via foreachBatch.", "body_md": "Building point-in-time correct, production-grade feature pipelines — from raw Kafka events to online feature serving in milliseconds, using Spark Structured Streaming and the Databricks Feature Store.\n\nFeature engineering is where most ML projects silently fail in production. Not because the model is wrong — but because the **features the model sees at training time are different from the features it sees at inference time**. This is called **training-serving skew**, and it's the #1 silent killer of ML systems.\n\nThree specific failure modes cause it:\n\nThe **Databricks Feature Store** — now part of Unity Catalog as **Feature Engineering in Unity Catalog** — solves all three by:\n\nUnderstanding the data model behind the Feature Store is essential for designing correct pipelines. Here's how the entities relate:\n\nThe critical relationship: a **Model Version** is bound to a **Training Set**, which records exactly which feature tables and which point-in-time lookups were used. This is how Databricks guarantees reproducibility — you can always re-create the exact training data that produced any model version.\n\n```\n# Databricks Runtime ML 13.x+ recommended\n# Feature Engineering in Unity Catalog (formerly Feature Store)\n\n%pip install databricks-feature-engineering==0.6.0 --quiet\ndbutils.library.restartPython()\n\nfrom databricks.feature_engineering import FeatureEngineeringClient, FeatureLookup\nfrom databricks.feature_engineering.entities.feature_serving_endpoint import (\n    ServedEntity, EndpointCoreConfig\n)\nfrom pyspark.sql import functions as F, SparkSession\nfrom pyspark.sql.types import (\n    StructType, StructField, StringType, LongType,\n    DoubleType, TimestampType, ArrayType\n)\nimport mlflow\n\nspark = SparkSession.builder.getOrCreate()\nfe = FeatureEngineeringClient()\n\n# Unity Catalog paths\nCATALOG       = \"prod\"\nFEATURE_DB    = f\"{CATALOG}.feature_store\"\nEVENTS_TABLE  = f\"{CATALOG}.silver.events_clean\"\nKAFKA_BROKER  = \"kafka-broker.internal:9092\"\nKAFKA_TOPIC   = \"user-events\"\n\n# Checkpoint locations (ADLS / S3 / GCS)\nCHECKPOINT_BASE = \"abfss://checkpoints@storage.dfs.core.windows.net/features\"\n```\n\nThe streaming pipeline reads from Kafka, computes windowed aggregations using Spark's stateful streaming engine, and writes features to the Feature Store via `foreachBatch`\n\n. This keeps the feature table continuously fresh.\n\n```\n# ── Streaming Feature Pipeline ────────────────────────────────────────────────\n\n# Step 1: Define the raw event schema from Kafka\nevent_schema = StructType([\n    StructField(\"user_id\",       StringType(),    False),\n    StructField(\"event_type\",    StringType(),    True),\n    StructField(\"product_id\",    StringType(),    True),\n    StructField(\"revenue\",       DoubleType(),    True),\n    StructField(\"session_id\",    StringType(),    True),\n    StructField(\"platform\",      StringType(),    True),\n    StructField(\"event_ts\",      TimestampType(), False),\n])\n\n# Step 2: Read from Kafka\nraw_stream = (\n    spark.readStream\n        .format(\"kafka\")\n        .option(\"kafka.bootstrap.servers\", KAFKA_BROKER)\n        .option(\"subscribe\", KAFKA_TOPIC)\n        .option(\"startingOffsets\", \"latest\")\n        .option(\"failOnDataLoss\", \"false\")\n        .load()\n        .select(\n            F.from_json(F.col(\"value\").cast(\"string\"), event_schema).alias(\"data\"),\n            F.col(\"timestamp\").alias(\"kafka_ts\")\n        )\n        .select(\"data.*\", \"kafka_ts\")\n)\n\n# Step 3: Apply watermark and compute windowed features\n# Watermark: tolerate up to 10 minutes of late data\nwindowed_features = (\n    raw_stream\n        .withWatermark(\"event_ts\", \"10 minutes\")\n        .groupBy(\n            F.col(\"user_id\"),\n            F.window(F.col(\"event_ts\"), \"1 hour\", \"15 minutes\").alias(\"window\")\n        )\n        .agg(\n            F.count(\"*\").alias(\"event_count_1h\"),\n            F.sum(F.when(F.col(\"event_type\") == \"purchase\", F.col(\"revenue\"))\n                  .otherwise(0)).alias(\"revenue_1h\"),\n            F.countDistinct(\"session_id\").alias(\"session_count_1h\"),\n            F.countDistinct(\"product_id\").alias(\"unique_products_1h\"),\n            F.sum(F.when(F.col(\"event_type\") == \"purchase\", 1)\n                  .otherwise(0)).alias(\"purchase_count_1h\"),\n            F.first(\"platform\").alias(\"last_platform\"),\n        )\n        # Flatten window struct to scalar columns\n        .withColumn(\"window_start\", F.col(\"window.start\"))\n        .withColumn(\"window_end\",   F.col(\"window.end\"))\n        .withColumn(\"feature_ts\",   F.col(\"window.end\"))   # timestamp key for PIT lookup\n        .drop(\"window\")\n        # Derived features\n        .withColumn(\"conversion_rate_1h\",\n            F.when(F.col(\"event_count_1h\") > 0,\n                   F.col(\"purchase_count_1h\") / F.col(\"event_count_1h\"))\n            .otherwise(0.0))\n        .withColumn(\"avg_revenue_per_purchase_1h\",\n            F.when(F.col(\"purchase_count_1h\") > 0,\n                   F.col(\"revenue_1h\") / F.col(\"purchase_count_1h\"))\n            .otherwise(0.0))\n)\n\n# Step 4: Write to Feature Store via foreachBatch\n# foreachBatch gives us transactional writes per micro-batch\ndef write_to_feature_store(batch_df, batch_id):\n    \"\"\"\n    Called on each micro-batch. Merges feature data into the Feature Store\n    table using merge_on keys (user_id + feature_ts).\n    \"\"\"\n    if batch_df.isEmpty():\n        return\n\n    fe.write_table(\n        name=f\"{FEATURE_DB}.user_activity_features\",\n        df=batch_df,\n        mode=\"merge\",             # upsert: update existing, insert new\n    )\n    print(f\"Batch {batch_id}: wrote {batch_df.count()} feature rows\")\n\n# Step 5: Create the feature table (idempotent — safe to re-run)\ntry:\n    fe.create_table(\n        name=f\"{FEATURE_DB}.user_activity_features\",\n        primary_keys=[\"user_id\"],\n        timestamp_keys=[\"feature_ts\"],\n        schema=windowed_features.schema,\n        description=(\n            \"Real-time user activity features computed from event stream. \"\n            \"1-hour sliding window, refreshed every 15 minutes. \"\n            \"Primary key: user_id. Timestamp key: feature_ts (window end).\"\n        ),\n    )\n    print(\"Feature table created.\")\nexcept Exception:\n    print(\"Feature table already exists — continuing.\")\n\n# Step 6: Launch the streaming query\nstreaming_query = (\n    windowed_features.writeStream\n        .outputMode(\"update\")               # update mode for stateful aggregations\n        .option(\"checkpointLocation\", f\"{CHECKPOINT_BASE}/user_activity\")\n        .trigger(processingTime=\"5 minutes\") # micro-batch every 5 min\n        .foreachBatch(write_to_feature_store)\n        .start()\n)\n\nprint(f\"Streaming query '{streaming_query.name}' running...\")\nprint(f\"Status: {streaming_query.status}\")\n```\n\nThis is the most critical part of the Feature Store. When creating training data, we must join labels to features **at the timestamp of the label event** — not the current time. This prevents data leakage.\n\n```\n# ── Point-in-Time Correct Training Dataset ────────────────────────────────────\n\n# Step 1: Load the label dataset\n# Each row = one prediction target event, with the exact timestamp\n# at which a model would have needed to make a prediction.\n\nlabels_df = (\n    spark.table(f\"{CATALOG}.gold.churn_labels\")\n        .select(\n            \"user_id\",\n            \"churn_label\",                        # 0 = retained, 1 = churned\n            F.col(\"observation_ts\").alias(\"event_timestamp\"),  # point-in-time anchor\n            \"experiment_split\"                    # train/val/test\n        )\n        .filter(F.col(\"observation_ts\") >= \"2024-01-01\")\n)\n\nprint(f\"Label rows: {labels_df.count():,}\")\nlabels_df.show(5)\n# +----------+-----------+---------------------+-----------------+\n# | user_id  |churn_label| event_timestamp     | experiment_split|\n# +----------+-----------+---------------------+-----------------+\n# | u_123456 | 0         | 2024-03-15 14:22:00 | train           |\n# | u_789012 | 1         | 2024-03-15 18:45:00 | train           |\n\n# Step 2: Define feature lookups\n# as_of_timestamp=None → use the label's event_timestamp (point-in-time)\n# Databricks will join each label row to the feature values\n# that were valid at event_timestamp — not the latest values.\n\nfeature_lookups = [\n    # User activity features — 1h window features from the streaming pipeline\n    FeatureLookup(\n        table_name=f\"{FEATURE_DB}.user_activity_features\",\n        feature_names=[\n            \"event_count_1h\",\n            \"revenue_1h\",\n            \"session_count_1h\",\n            \"unique_products_1h\",\n            \"purchase_count_1h\",\n            \"conversion_rate_1h\",\n            \"avg_revenue_per_purchase_1h\",\n            \"last_platform\",\n        ],\n        lookup_key=\"user_id\",\n        timestamp_lookup_key=\"event_timestamp\",    # ← PIT anchor\n    ),\n\n    # User profile features — slower-changing, from batch pipeline\n    FeatureLookup(\n        table_name=f\"{FEATURE_DB}.user_profile_features\",\n        feature_names=[\n            \"account_age_days\",\n            \"lifetime_revenue\",\n            \"preferred_category\",\n            \"subscription_tier\",\n        ],\n        lookup_key=\"user_id\",\n        timestamp_lookup_key=\"event_timestamp\",    # ← PIT anchor\n    ),\n\n    # Transaction aggregates — 30d and 90d rolling windows\n    FeatureLookup(\n        table_name=f\"{FEATURE_DB}.transaction_features\",\n        feature_names=[\n            \"purchase_count_30d\",\n            \"purchase_count_90d\",\n            \"avg_order_value_30d\",\n            \"days_since_last_purchase\",\n            \"category_diversity_score\",\n        ],\n        lookup_key=\"user_id\",\n        timestamp_lookup_key=\"event_timestamp\",\n    ),\n]\n\n# Step 3: Create training dataset (Feature Store handles the PIT join)\ntraining_set = fe.create_training_set(\n    df=labels_df,\n    feature_lookups=feature_lookups,\n    label=\"churn_label\",\n    exclude_columns=[\"observation_ts\", \"experiment_split\"],\n)\n\n# The returned DataFrame has features + labels, PIT-correct\ntraining_df = training_set.load_df()\nprint(f\"Training rows: {training_df.count():,}\")\nprint(f\"Training cols: {len(training_df.columns)}\")\ntraining_df.show(3)\n\n# Step 4: Train model and log via Feature Store (preserves lineage!)\nfrom sklearn.ensemble import GradientBoostingClassifier\nimport pandas as pd\n\ntrain_pdf = (\n    training_df\n        .filter(F.col(\"experiment_split\") == \"train\")\n        .drop(\"experiment_split\", \"user_id\")\n        .fillna(0)\n        .toPandas()\n)\n\nX_train = train_pdf.drop(columns=[\"churn_label\"])\ny_train = train_pdf[\"churn_label\"]\n\nmodel = GradientBoostingClassifier(\n    n_estimators=300,\n    learning_rate=0.05,\n    max_depth=5,\n    subsample=0.8,\n    random_state=42,\n)\n\nwith mlflow.start_run(run_name=\"churn-gbm-v1\") as run:\n    model.fit(X_train, y_train)\n\n    # Log model via Feature Store — this records the feature lineage\n    fe.log_model(\n        model=model,\n        artifact_path=\"churn_model\",\n        flavor=mlflow.sklearn,\n        training_set=training_set,      # ← binds model to its feature lookups\n        registered_model_name=f\"{CATALOG}.ml.user_churn_model\",\n    )\n    print(f\"Logged model with feature lineage. Run: {run.info.run_id}\")\n```\n\nFor real-time inference, the model needs features in milliseconds — not the seconds it takes to query Delta Lake. Databricks Feature Store can publish features to an **online store** (DynamoDB, Cosmos DB, MySQL, etc.) for low-latency reads.\n\n```\n# ── Publish Features to Online Store ─────────────────────────────────────────\n# Online stores are configured per feature table.\n# Here we publish user_activity_features to DynamoDB for <5ms lookups.\n\nfrom databricks.feature_engineering.entities.feature_store_online_table import (\n    OnlineTable, OnlineTableSpec, TriggeredSchedulingPolicy\n)\n\n# Create an online table spec (backed by a serverless real-time compute layer)\nonline_table_spec = OnlineTableSpec(\n    primary_key_columns=[\"user_id\"],\n    source_table_full_name=f\"{FEATURE_DB}.user_activity_features\",\n    run_triggered=OnlineTableSpec.TriggeredSchedulingPolicy(),  # sync on-demand\n    # OR for continuous sync:\n    # run_continuous=OnlineTableSpec.ContinuousSchedulingPolicy()\n)\n\n# Create the online table (idempotent)\nonline_table = fe.create_online_table(spec=online_table_spec)\nprint(f\"Online table: {online_table.name}\")\nprint(f\"Status:       {online_table.status.detailed_state}\")\n\n# Trigger an initial sync from the offline Delta table to the online store\nfe.refresh_online_table(name=f\"{FEATURE_DB}.user_activity_features\")\n```\n\nAt inference time, the Feature Store SDK performs automatic feature lookups, joining the incoming request data with features from the online store before passing them to the model.\n\n```\n# ── Real-Time Feature Serving at Inference ────────────────────────────────────\n\nimport requests, json\n\nWORKSPACE_URL = \"https://<workspace>.azuredatabricks.net\"\nTOKEN = dbutils.secrets.get(\"prod-scope\", \"databricks-token\")\n\n# Option 1: Model Serving with automatic feature lookup\n# When you logged the model with fe.log_model(), Databricks knows which\n# features to fetch. You only send the lookup key (user_id) at inference time.\n\ndef predict_churn(user_ids: list) -> list:\n    \"\"\"\n    Send only user_id — the serving endpoint fetches features automatically\n    from the online store and runs inference.\n    \"\"\"\n    payload = {\n        \"dataframe_records\": [\n            {\"user_id\": uid} for uid in user_ids\n        ]\n    }\n    resp = requests.post(\n        f\"{WORKSPACE_URL}/serving-endpoints/churn-predictor/invocations\",\n        headers={\n            \"Authorization\": f\"Bearer {TOKEN}\",\n            \"Content-Type\":  \"application/json\",\n        },\n        data=json.dumps(payload),\n        timeout=5,\n    )\n    resp.raise_for_status()\n    return resp.json()[\"predictions\"]\n\n# Example usage\npredictions = predict_churn([\"u_123456\", \"u_789012\", \"u_345678\"])\nfor uid, pred in zip([\"u_123456\", \"u_789012\", \"u_345678\"], predictions):\n    print(f\"{uid}: churn_probability = {pred:.4f}\")\n# u_123456: churn_probability = 0.0821\n# u_789012: churn_probability = 0.7643\n# u_345678: churn_probability = 0.1209\n\n# Option 2: Direct feature lookup via the Feature Serving endpoint\n# Useful when you want raw features without running inference\ndef get_features(user_ids: list) -> dict:\n    payload = {\n        \"dataframe_records\": [{\"user_id\": uid} for uid in user_ids]\n    }\n    resp = requests.post(\n        f\"{WORKSPACE_URL}/serving-endpoints/user-features-serving/invocations\",\n        headers={\n            \"Authorization\": f\"Bearer {TOKEN}\",\n            \"Content-Type\":  \"application/json\",\n        },\n        data=json.dumps(payload),\n        timeout=5,\n    )\n    return resp.json()\n\n# Option 3: Batch scoring (offline) — uses Delta offline store\n# No online store needed; reads directly from the feature table with PIT lookup\nbatch_labels = spark.table(f\"{CATALOG}.gold.users_to_score_today\") \\\n    .select(\"user_id\", F.current_timestamp().alias(\"event_timestamp\"))\n\nbatch_predictions = fe.score_batch(\n    model_uri=f\"models:/{CATALOG}.ml.user_churn_model@champion\",\n    df=batch_labels,\n    result_type=\"double\",\n)\n\nbatch_predictions.select(\"user_id\", \"prediction\") \\\n    .write.format(\"delta\").mode(\"overwrite\") \\\n    .saveAsTable(f\"{CATALOG}.gold.churn_scores_daily\")\n```\n\nA summary of the feature tables in our pipeline, their update cadence, and their role in the ML lifecycle:\n\n| Feature Table | Primary Key | Timestamp Key | Update Method | Latency | Used In |\n|---|---|---|---|---|---|\n`user_activity_features` |\n`user_id` |\n`feature_ts` |\nSpark Structured Streaming | ~5 min | Real-time churn, recommendation |\n`transaction_features` |\n`user_id` |\n`feature_ts` |\nScheduled batch (hourly) | ~60 min | Churn, LTV prediction |\n`user_profile_features` |\n`user_id` |\n`updated_at` |\nCDC from OLTP (near real-time) | ~2 min | All models |\n`product_features` |\n`product_id` |\n`feature_ts` |\nScheduled batch (daily) | ~24 hr | Recommendation, search ranking |\n`session_features` |\n`session_id` |\n`session_end_ts` |\nStreaming (micro-batch) | ~1 min | Click-through rate, abandon prediction |\n`cohort_features` |\n`cohort_id` |\n`computed_at` |\nWeekly batch | ~7 days | Segmentation, A/B analysis |\n\nFreshness vs cost tradeoff:Streaming features are ~10× more expensive to compute than batch features (continuous cluster vs scheduled job). Only promote a feature to streaming if your model's performance degrades meaningfully with stale data — validate this with an offline ablation study first.\n\n`timestamp_lookup_key`\n\nare non-negotiable for any model trained on time-series data. A missing `event_timestamp`\n\nin your label table is a data leakage bug waiting to happen.`fe.log_model()`\n\nis the right model logging call`mlflow.sklearn.log_model()`\n\n. It records feature lineage, enabling reproducible re-training and automatic feature lookup at serving time.`fe.score_batch()`\n\n**Databricks — Feature Engineering in Unity Catalog (Overview)**\n\n🔗 [https://docs.databricks.com/en/machine-learning/feature-store/uc/feature-tables-uc.html](https://docs.databricks.com/en/machine-learning/feature-store/uc/feature-tables-uc.html)\n\n**Databricks — Create and Manage Online Tables**\n\n🔗 [https://docs.databricks.com/en/machine-learning/feature-store/online-tables.html](https://docs.databricks.com/en/machine-learning/feature-store/online-tables.html)\n\n**Databricks — Point-in-Time Feature Lookups**\n\n🔗 [https://docs.databricks.com/en/machine-learning/feature-store/time-series.html](https://docs.databricks.com/en/machine-learning/feature-store/time-series.html)\n\n**Apache Spark — Structured Streaming Programming Guide**\n\n🔗 [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)\n\n**Apache Spark — Streaming Watermarks for Late Data Handling**\n\n🔗 [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking)\n\n**Databricks — Feature Store Python API Reference**\n\n🔗 [https://docs.databricks.com/en/machine-learning/feature-store/python-api.html](https://docs.databricks.com/en/machine-learning/feature-store/python-api.html)\n\n**Databricks — Score Batch with Feature Store**\n\n🔗 [https://docs.databricks.com/en/machine-learning/feature-store/score-batch.html](https://docs.databricks.com/en/machine-learning/feature-store/score-batch.html)\n\n**\"Feature Stores for ML\" — Feast Documentation (open-source reference)**\n\n🔗 [https://docs.feast.dev/](https://docs.feast.dev/)\n\n**\"Rethinking Feature Stores\" — Chip Huyen (huyenchip.com)**\n\n🔗 [https://huyenchip.com/2023/01/08/feature-store.html](https://huyenchip.com/2023/01/08/feature-store.html)\n\n**Databricks — Model Serving with Automatic Feature Lookup**\n\n🔗 [https://docs.databricks.com/en/machine-learning/model-serving/feature-store-model-serving.html](https://docs.databricks.com/en/machine-learning/model-serving/feature-store-model-serving.html)\n\n**\"Building Machine Learning Pipelines\" — Hannes Hapke & Catherine Nelson (O'Reilly)**\n\n🔗 [https://www.oreilly.com/library/view/building-machine-learning/9781492053187/](https://www.oreilly.com/library/view/building-machine-learning/9781492053187/)\n\n*This concludes the 4-part Databricks series:*", "url": "https://wpnews.pro/news/real-time-ai-feature-engineering-with-spark-structured-streaming-and-databricks", "canonical_source": "https://dev.to/jubinsoni/real-time-ai-feature-engineering-with-spark-structured-streaming-and-databricks-feature-store-eii", "published_at": "2026-06-24 09:41:21+00:00", "updated_at": "2026-06-24 09:43:26.929951+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "ai-infrastructure", "developer-tools", "mlops"], "entities": ["Databricks", "Spark Structured Streaming", "Databricks Feature Store", "Unity Catalog", "Kafka", "MLflow", "Feature Engineering in Unity Catalog", "FeatureEngineeringClient"], "alternates": {"html": "https://wpnews.pro/news/real-time-ai-feature-engineering-with-spark-structured-streaming-and-databricks", "markdown": "https://wpnews.pro/news/real-time-ai-feature-engineering-with-spark-structured-streaming-and-databricks.md", "text": "https://wpnews.pro/news/real-time-ai-feature-engineering-with-spark-structured-streaming-and-databricks.txt", "jsonld": "https://wpnews.pro/news/real-time-ai-feature-engineering-with-spark-structured-streaming-and-databricks.jsonld"}}