{"slug": "metadata-routing", "title": "Metadata Routing", "summary": "A developer discovered metadata routing in scikit-learn, a feature that elegantly solves the problem of passing sample weights and groups through complex ML pipelines. The feature, enabled via set_config(enable_metadata_routing=True), allows pipelines to route auxiliary information like fraud detection weights and customer IDs to specific components, eliminating hacky workarounds.", "body_md": "A couple of months ago, I stumbled upon [this video by Vincent D. Warmerdam](https://www.youtube.com/watch?v=lQ_-Aja-slA) about metadata routing in scikit-learn. I'll be honest, I had no idea what \"metadata routing\" even meant, but Vincent's explanation completely changed how I think about building ML pipelines.\n\nThe video showed me that one of the most frustrating problems in scikit-learn; passing sample weights and groups through complex pipelines finally had an elegant solution. It piqued my curiosity enough that I dove deep into the feature, tested it extensively, and honestly, I was surprised by how little coverage this gets in technical blogs and articles. So I figured, why not write about it myself and share what I learned?\n\nIf you've ever struggled with imbalanced datasets, grouped cross-validation, or just wanted to pass custom information through your pipelines, this article is for you. Let's start from the very beginning.\n\nLet's start with a concrete example. You're building a credit card fraud detection model with this data:\n\n```\n# Your training data\nX = transaction_features  # Amount, merchant, time, location, etc.\ny = is_fraud             # 0 = legitimate, 1 = fraud\n\n# But you also have additional information:\nsample_weights = [1.0, 1.0, 10.0, 1.0, ...]  # Fraud transactions weighted 10x\ncustomer_ids = [101, 102, 101, 103, ...]      # Which customer made each transaction\n```\n\n**Metadata** is the \"extra information\" beyond your features (X) and labels (y):\n\n`sample_weight`\n\n`groups`\n\nImagine you're building a fraud detection system for a financial company. You have:\n\n**The Challenge:** Your model needs to:\n\n**The problem?**\n\nThis \"metadata\" (weights, groups) isn't part of your feature matrix X or labels y. It's auxiliary information that needs to flow through your entire ML pipeline.\n\n**Before scikit-learn 1.3, this was nearly impossible.** Let's see why.\n\nPrior to metadata routing, you'd face multiple interconnected problems:\n\n``` python\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.linear_model import LogisticRegression\n\n# Your fraud detection pipeline\npipe = Pipeline([\n    ('scaler', StandardScaler()),\n    ('clf', LogisticRegression())\n])\n\n# You have fraud weights (fraudulent transactions weighted 10x)\nfraud_weights = np.where(y == 1, 10.0, 1.0)\n\n# This doesn't work!\npipe.fit(X, y, sample_weight=fraud_weights)  # Error: unexpected keyword argument\npython\nfrom sklearn.model_selection import cross_val_score, GroupKFold\n\n# You have customer IDs (can't split customers across folds)\ncustomer_groups = df['customer_id'].values\n\n# This doesn't work with pipelines!\nscores = cross_val_score(\n    pipe, X, y,\n    cv=GroupKFold(n_splits=5),\n    groups=customer_groups  # Pipeline doesn't know what to do with this\n)\npython\nfrom sklearn.model_selection import GridSearchCV\n\n# You need BOTH weights AND groups during hyperparameter tuning\ngrid = GridSearchCV(pipe, param_grid, cv=GroupKFold(n_splits=5))\n\n# This is impossible - can't pass both!\ngrid.fit(X, y, sample_weight=fraud_weights, groups=customer_groups)  # Doesn't work\n```\n\nSo you can begin to see the problem by now. Pipelines had no way to route this metadata to specific components. You'd have to use hacky workarounds like `clf__sample_weight`\n\n, which was inconsistent, broke with nested pipelines, and completely failed with cross-validation.\n\nMetadata routing solves ALL three problems at once with a clean, explicit API. Here's how it transforms our fraud detection pipeline:\n\n``` python\nfrom sklearn import set_config\nfrom sklearn.model_selection import GridSearchCV, GroupKFold\n\n# Enable metadata routing globally\nset_config(enable_metadata_routing=True)\n\n# Build the fraud detection pipeline\npipe = Pipeline([\n    ('scaler', StandardScaler()),\n    ('clf', LogisticRegression())\n])\n\n# Configure metadata routing - declare what each component needs\npipe['clf'].set_fit_request(sample_weight=True)\npipe['clf'].set_score_request(sample_weight=True)\n\n# Problem 1 SOLVED: Pass weights through pipeline\npipe.fit(X, y, sample_weight=fraud_weights)\n\n# Problem 2 SOLVED: Use groups in cross-validation\nscores = cross_val_score(\n    pipe, X, y,\n    cv=GroupKFold(n_splits=5),\n    groups=customer_groups  # Works perfectly!\n)\n\n# Problem 3 SOLVED: Combine weights AND groups in GridSearchCV\ngrid = GridSearchCV(pipe, param_grid, cv=GroupKFold(n_splits=5))\ngrid.fit(X, y, sample_weight=fraud_weights, groups=customer_groups)  # Both work!\n\nprint(f\"Best model handles imbalance AND respects customer grouping!\")\n```\n\n**What changed?** Each component explicitly declares what metadata it needs using `set_*_request()`\n\nmethods. The pipeline then automatically routes metadata to the right places. Simple, explicit, powerful.\n\nHere's what you need to know:\n\n`set_fit_request()`\n\n`fit()`\n\n`set_score_request()`\n\n`score()`\n\n`set_predict_request()`\n\n`predict()`\n\n**Important:**\n\nThe pipeline doesn't pass metadata to every step. Only components that explicitly call `set_*_request(metadata=True)`\n\nwill receive that metadata. Components that don't request metadata won't receive it, even if you pass it to the pipeline.\n\n```\n# Example: Selective routing\npipe = Pipeline([\n    ('scaler', StandardScaler()),        # Doesn't request sample_weight\n    ('clf', LogisticRegression())        # Requests sample_weight\n])\n\npipe['clf'].set_fit_request(sample_weight=True)  # Only clf gets weights\n\n# When you call:\npipe.fit(X, y, sample_weight=weights)\n\n# What happens:\n# - scaler.fit(X, y) → NO sample_weight (didn't request it)\n# - clf.fit(X_scaled, y, sample_weight=weights) → Gets sample_weight (requested it)\n```\n\nLet's build a custom transformer that uses sample weights during fitting. This is useful for weighted feature scaling or selection.\n\n``` python\nimport numpy as np\nfrom sklearn.base import BaseEstimator, TransformerMixin\n\nclass WeightedStandardScaler(BaseEstimator, TransformerMixin):\n    \"\"\"StandardScaler that respects sample weights during fitting.\"\"\"\n\n    def __init__(self):\n        self.mean_ = None\n        self.std_ = None\n\n    def fit(self, X, y=None, sample_weight=None):\n        \"\"\"Fit scaler using weighted mean and std.\"\"\"\n        if sample_weight is None:\n            sample_weight = np.ones(X.shape[0])\n\n        # Normalize weights\n        sample_weight = sample_weight / sample_weight.sum()\n\n        # Compute weighted statistics\n        self.mean_ = np.average(X, axis=0, weights=sample_weight)\n        variance = np.average((X - self.mean_) ** 2, axis=0, weights=sample_weight)\n        self.std_ = np.sqrt(variance)\n\n        return self\n\n    def transform(self, X):\n        \"\"\"Transform using fitted statistics.\"\"\"\n        return (X - self.mean_) / self.std_\n\n    def get_metadata_routing(self):\n        \"\"\"Configure metadata routing for this transformer.\"\"\"\n        return (\n            super()\n            .get_metadata_routing()\n            .add_self_request(self)\n            .fit(sample_weight=True)  # Request sample_weight in fit()\n        )\n\n# Usage\nfrom sklearn import set_config\nset_config(enable_metadata_routing=True)\n\nX = np.random.randn(100, 5)\nweights = np.random.rand(100)\n\nscaler = WeightedStandardScaler()\nX_scaled = scaler.fit_transform(X, sample_weight=weights)\n```\n\nHere's what matters when building custom estimators:\n\n`sample_weight`\n\nparameter in `fit()`\n\nmethod`get_metadata_routing()`\n\nto declare routing requirements`add_self_request()`\n\nand chain routing configuration`None`\n\ncase when metadata isn't providedNow let's use our custom transformer in a pipeline with multiple metadata consumers.\n\n``` python\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\n\n# Create sample data\nX = np.random.randn(1000, 10)\ny = (X[:, 0] + X[:, 1] > 0).astype(int)\nsample_weights = np.random.rand(1000)\n\nX_train, X_test, y_train, y_test, w_train, w_test = train_test_split(\n    X, y, sample_weights, test_size=0.2, random_state=42\n)\n\n# Build pipeline with metadata routing\npipe = Pipeline([\n    ('scaler', WeightedStandardScaler()),\n    ('classifier', LogisticRegression(max_iter=1000))\n])\n\n# Configure routing: both steps need sample_weight\npipe.set_fit_request(sample_weight=True)\npipe['classifier'].set_fit_request(sample_weight=True)\n\n# Fit with sample weights - they're routed to both steps\npipe.fit(X_train, y_train, sample_weight=w_train)\n\n# Score also supports metadata routing\npipe['classifier'].set_score_request(sample_weight=True)\nscore = pipe.score(X_test, y_test, sample_weight=w_test)\n\nprint(f\"Weighted accuracy: {score:.3f}\")\n```\n\n**Pipeline Routing Rules:**\n\n`set_*_request()`\n\nMetadata routing shines in hyperparameter tuning scenarios where you need to pass weights or groups to cross-validation.\n\n``` python\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.datasets import make_classification\n\n# Generate imbalanced dataset\nX, y = make_classification(\n    n_samples=1000, n_features=20, n_informative=15,\n    n_redundant=5, weights=[0.9, 0.1], random_state=42\n)\n\n# Create sample weights to handle imbalance\nsample_weights = np.where(y == 1, 10.0, 1.0)\n\n# Build pipeline\npipe = Pipeline([\n    ('scaler', WeightedStandardScaler()),\n    ('clf', LogisticRegression(max_iter=1000))\n])\n\n# Configure metadata routing for both steps\npipe['scaler'].set_fit_request(sample_weight=True)\npipe['clf'].set_fit_request(sample_weight=True)\npipe['clf'].set_score_request(sample_weight=True)\n\n# GridSearchCV with metadata routing\nparam_grid = {\n    'clf__C': [0.1, 1.0, 10.0],\n    'clf__penalty': ['l1', 'l2']\n}\n\ngrid_search = GridSearchCV(\n    pipe,\n    param_grid,\n    cv=5,\n    scoring='accuracy',\n    n_jobs=-1\n)\n\n# Fit with sample weights - they're used in both fitting and scoring\ngrid_search.fit(X, y, sample_weight=sample_weights)\n\nprint(f\"Best params: {grid_search.best_params_}\")\nprint(f\"Best weighted score: {grid_search.best_score_:.3f}\")\n\n# Access the best model\nbest_pipe = grid_search.best_estimator_\n```\n\n**GridSearchCV Routing Features:**\n\n`groups`\n\nparameter for GroupKFold and similar splitters**Using Groups for Cross-Validation:**\n\n``` python\nfrom sklearn.model_selection import GroupKFold\n\n# Create grouped data (e.g., multiple samples per patient)\ngroups = np.repeat(np.arange(100), 10)  # 100 groups, 10 samples each\n\n# Configure pipeline to use groups\ngrid_search = GridSearchCV(\n    pipe,\n    param_grid,\n    cv=GroupKFold(n_splits=5),\n    n_jobs=-1\n)\n\n# Pass groups to ensure they're not split across folds\ngrid_search.fit(X, y, groups=groups, sample_weight=sample_weights)\n```\n\nSometimes you need to pass different metadata values to different pipeline steps. Metadata aliasing lets you route metadata under different names.\n\n``` python\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.linear_model import LogisticRegression\n\n# Scenario: You have two types of weights\n# - feature_weights: for weighted feature scaling\n# - sample_weights: for weighted model training\n\n# Create pipeline\npipe = Pipeline([\n    ('scaler', WeightedStandardScaler()),\n    ('clf', LogisticRegression())\n])\n\n# Configure aliasing: route 'weights' parameter to different metadata\npipe['scaler'].set_fit_request(sample_weight='feature_weights')  # Alias\npipe['clf'].set_fit_request(sample_weight='sample_weights')      # Alias\n\n# Now you can pass both types of weights\npipe.fit(\n    X, y,\n    feature_weights=feature_importance_weights,  # Goes to scaler\n    sample_weights=class_balance_weights         # Goes to classifier\n)\n```\n\n**Use cases for aliasing:**\n\n**Important:**\n\nThe parameter name you use in `fit()`\n\nmust match the alias, not the internal parameter name.\n\nMetadata routing works seamlessly with nested pipelines, automatically propagating metadata through all levels.\n\n``` python\nfrom sklearn.pipeline import Pipeline, FeatureUnion\nfrom sklearn.decomposition import PCA\nfrom sklearn.feature_selection import SelectKBest\n\n# Build nested pipeline: preprocessing pipeline inside main pipeline\npreprocessing = Pipeline([\n    ('scaler', WeightedStandardScaler()),\n    ('features', FeatureUnion([\n        ('pca', PCA(n_components=10)),\n        ('select', SelectKBest(k=5))\n    ]))\n])\n\nmain_pipe = Pipeline([\n    ('preprocess', preprocessing),\n    ('clf', LogisticRegression())\n])\n\n# Configure routing at any level\nmain_pipe['preprocess']['scaler'].set_fit_request(sample_weight=True)\nmain_pipe['clf'].set_fit_request(sample_weight=True)\n\n# Metadata routes through all levels automatically\nmain_pipe.fit(X, y, sample_weight=weights)\n```\n\n**What happens:**\n\nA few things to remember about nested pipelines:\n\n`pipe['outer']['inner']`\n\n**Complex example with FeatureUnion:**\n\n```\n# FeatureUnion with different metadata needs\nfeature_union = FeatureUnion([\n    ('weighted_pca', WeightedPCA()),      # Needs sample_weight\n    ('standard_select', SelectKBest())    # Doesn't need sample_weight\n])\n\npipe = Pipeline([\n    ('features', feature_union),\n    ('clf', LogisticRegression())\n])\n\n# Only weighted_pca gets the weights\npipe['features'].transformer_list[0][1].set_fit_request(sample_weight=True)\npipe['clf'].set_fit_request(sample_weight=True)\n\npipe.fit(X, y, sample_weight=weights)\n# weights go to weighted_pca and clf, but not to standard_select\n```\n\n**1. Always Enable Metadata Routing Explicitly**\n\n``` python\nfrom sklearn import set_config\nset_config(enable_metadata_routing=True)\n```\n\n**2. Use Descriptive Metadata Names**\n\n```\n# Good: clear purpose\nestimator.set_fit_request(sample_weight=True, class_prior=True)\n\n# Avoid: generic names\nestimator.set_fit_request(metadata=True)\n```\n\n**3. Configure Routing at Pipeline Creation**\n\n```\n# Configure immediately after creating pipeline\npipe = Pipeline([...])\npipe['step1'].set_fit_request(sample_weight=True)\npipe['step2'].set_fit_request(sample_weight=True)\n```\n\n**4. Handle None Gracefully in Custom Estimators**\n\n``` python\ndef fit(self, X, y=None, sample_weight=None):\n    if sample_weight is None:\n        sample_weight = np.ones(len(X))\n    # ... rest of implementation\n```\n\n**Pitfall 1: Forgetting to Enable Metadata Routing**\n\n```\n# This will fail silently or raise errors\npipe.fit(X, y, sample_weight=weights)  # Metadata routing not enabled!\n```\n\n**Pitfall 2: Not Configuring All Steps**\n\n```\n# Only configured classifier, scaler won't receive weights\npipe['clf'].set_fit_request(sample_weight=True)\npipe.fit(X, y, sample_weight=weights)  # Scaler doesn't get weights!\n```\n\n**Pitfall 3: Mixing Old and New APIs**\n\n```\n# Don't use both approaches\npipe.fit(X, y, clf__sample_weight=weights)  # Old way\npipe['clf'].set_fit_request(sample_weight=True)  # New way\n```\n\n**Pitfall 4: Forgetting to Request Metadata in score()**\n\n```\npipe['clf'].set_fit_request(sample_weight=True)\n# Forgot this:\npipe['clf'].set_score_request(sample_weight=True)\npipe.score(X, y, sample_weight=weights)  # Weights ignored in scoring!\n```\n\n**Check Routing Configuration:**\n\n```\n# Inspect what metadata a component requests\nprint(pipe['clf'].get_metadata_routing())\n```\n\n**Verify Metadata is Being Used:**\n\n``` python\n# Add logging to custom estimators\ndef fit(self, X, y=None, sample_weight=None):\n    print(f\"Received sample_weight: {sample_weight is not None}\")\n    # ... rest of implementation\n```\n\n**Test with and without Metadata:**\n\n```\n# Ensure your estimator works both ways\nestimator.fit(X, y)  # Without metadata\nestimator.fit(X, y, sample_weight=weights)  # With metadata\n```\n\n`n_jobs`\n\nin GridSearchCV and similar`memory`\n\nparameter in Pipeline for caching**Use metadata routing when you need to:**\n\n**Don't use metadata routing when:**\n\nWhen I first started working with metadata routing, I struggled to clearly demarcate what should be a feature versus what should be metadata. For instance, in the earlier credit card fraud use case we saw, I kept asking myself: \"Should customer fraud history be a feature? What about transaction timestamps? Customer IDs?\"\n\nThe line felt blurry, and I made several mistakes before understanding the distinction. Let me share what I learned, so you can avoid the confusion I went through.\n\n**Features (X):** Information the model uses to make predictions\n\n**Metadata:** Information about how to train/evaluate the model, but not used for predictions\n\nLet's look at some ambiguous cases:\n\n```\n# Transaction amount as a FEATURE\nX = [[100.50, 'online', 'electronics'],  # Amount is a feature\n     [25.00, 'store', 'groceries']]\ny = [0, 1]  # Fraud labels\n\n# The model learns: \"Large electronics purchases online are suspicious\"\n```\n\n**Decision:**\n\nFeature - The model uses amount to predict fraud\n\n```\n# Customer's fraud history as METADATA (sample weight)\nX = [[100.50, 'online', 'electronics'],\n     [25.00, 'store', 'groceries']]\ny = [0, 1]\n\n# Customer 1 has 0% fraud history → weight = 1.0\n# Customer 2 has 50% fraud history → weight = 5.0 (pay more attention!)\nsample_weights = [1.0, 5.0]\n```\n\n**Decision:**\n\nMetadata - Tells the model \"pay more attention to this sample\" but isn't used for prediction\n\n**But wait!** You could also make this a feature:\n\n```\n# Customer fraud history as a FEATURE\nX = [[100.50, 'online', 'electronics', 0.0],   # Added fraud_history\n     [25.00, 'store', 'groceries', 0.5]]\ny = [0, 1]\n```\n\n**Decision:**\n\nFeature - Now the model learns \"customers with high fraud history are risky\"\n\nAsk yourself: **\"Should the model learn patterns from this, or does it tell the model how to learn?\"**\n\n| Scenario | Feature or Metadata? | Why? |\n|---|---|---|\n| Transaction amount | Feature |\nModel predicts based on amount |\n| Customer ID | Metadata (groups) |\nFor grouping in CV, not prediction |\n| Time of day | Feature |\nModel learns \"3 AM transactions are suspicious\" |\n| Data quality score | Metadata (weight) |\n\"Trust this sample more/less\" |\n| Previous fraud count | Could be either! |\nSee below |\n| Geographic location | Feature |\nModel learns regional patterns |\n| Sample collection date | Metadata (groups) |\nFor time-based CV splits |\n\nSome information genuinely could be either. Here's how to decide:\n\n**Option 1: As a Feature**\n\n```\nX = [[100, 'online', 2],  # 2 previous frauds\n     [50, 'store', 0]]     # 0 previous frauds\n```\n\n**Option 2: As Metadata (Sample Weight)**\n\n```\nX = [[100, 'online'],\n     [50, 'store']]\nsample_weights = [5.0, 1.0]  # Weight based on fraud history\n```\n\n**Option 3: Both!**\n\n```\nX = [[100, 'online', 2],\n     [50, 'store', 0]]\nsample_weights = [5.0, 1.0]\n```\n\n**Use as Feature when:**\n\n**Use as Metadata when:**\n\n**Use as Both when:**\n\n**Mistake 1: Using customer ID as a feature**\n\n```\nX = [[101, 100, 'online'],  # Customer ID as feature\n     [102, 50, 'store']]\n```\n\nProblem: Model memorizes customers instead of learning patterns. Use as metadata (groups) instead!\n\n**Mistake 2: Using sample importance as a feature**\n\n```\nX = [[100, 'online', 5.0],  # Importance score as feature\n     [50, 'store', 1.0]]\n```\n\nProblem: Importance score won't be available at prediction time. Use as metadata (sample_weight)!\n\n**Better Approach: Separate concerns**\n\n```\nX = [[100, 'online'],\n     [50, 'store']]\nsample_weights = [5.0, 1.0]  # Importance\ngroups = [101, 102]           # Customer IDs\n```\n\n**Features** = What the model learns from\n\n**Metadata** = How the model learns\n\nWhen in doubt, ask yourself: \"Will this be available when making predictions on new data?\" If no, it's probably metadata!\n\nLooking back at everything we've covered, metadata routing really changes the game for building ML pipelines in scikit-learn. No more hacky workarounds with `clf__sample_weight`\n\nor struggling to pass groups through cross-validation. You just declare what each component needs, and the routing system handles the rest. It's cleaner, more explicit, and honestly just makes sense.\n\n**What you should remember:**\n\n`set_config(enable_metadata_routing=True)`\n\n`set_*_request()`\n\nmethods to declare metadata requirements`None`\n\ngracefully in custom estimators**Next Steps:**\n\n**Author's Note:** This article covers scikit-learn 1.3+. The metadata routing API is stable and recommended for all new projects. Legacy parameter passing (e.g., `clf__sample_weight`\n\n) still works but is discouraged.", "url": "https://wpnews.pro/news/metadata-routing", "canonical_source": "https://dev.to/akshay_devkarama_414c087/metadata-routing-4ld2", "published_at": "2026-06-19 18:28:52+00:00", "updated_at": "2026-06-19 18:36:28.822600+00:00", "lang": "en", "topics": ["machine-learning", "developer-tools"], "entities": ["scikit-learn", "Vincent D. Warmerdam"], "alternates": {"html": "https://wpnews.pro/news/metadata-routing", "markdown": "https://wpnews.pro/news/metadata-routing.md", "text": "https://wpnews.pro/news/metadata-routing.txt", "jsonld": "https://wpnews.pro/news/metadata-routing.jsonld"}}