{"slug": "fde-architecture-framework-build-production-ml-systems-that-don-t-break", "title": "FDE Architecture Framework: Build Production ML Systems That Don't Break", "summary": "The FDE (Feature-Decision-Execution) architecture framework separates ML prediction, business logic, and system action into distinct layers to prevent production failures common in monolithic ML services. The tutorial teaches how to implement each layer with clean interfaces, test independently, and deploy with different scaling strategies, using a fraud detection system as an example.", "body_md": "# FDE Architecture Framework: Build Production ML Systems That Don't Break\n\nFeature-Decision-Execution (FDE) is the layered architecture pattern that separates ML prediction from business logic from system action — the pattern that makes production ML systems maintainable, auditable, and safe to iterate on.\n\n## Table of Contents\n\n# FDE Architecture Framework: Build Production ML Systems That Don’t Break\n\n**Level:** Intermediate\n\n**Time to complete:** 60–90 minutes\n\n**Prerequisites:** Python, basic familiarity with REST APIs and databases; no prior MLOps experience required\n\n## Learning Objectives\n\nBy the end of this tutorial you will be able to:\n\n- Explain the three layers of the FDE architecture and why they are separated\n- Implement each layer with a clean interface contract in Python\n- Wire the layers together into a working fraud detection system\n- Test each layer independently and in combination\n- Deploy layers independently with different scaling and release strategies\n\n## Table of Contents\n\n[Why Monolithic ML Services Fail](#part-1)[The FDE Pattern: Three Layers, Three Concerns](#part-2)[The Feature Layer: Serving ML Inputs](#part-3)[The Decision Layer: Model + Business Logic](#part-4)[The Execution Layer: Taking Action Safely](#part-5)[Tutorial: FDE for Banking Fraud Detection](#part-6)[Layer Contracts: The API Between Layers](#part-7)[Testing Strategy for FDE Systems](#part-8)[Deployment Patterns: Independent Layer Operations](#part-9)[When to Apply FDE (and When Not To)](#part-10)[Exercises](#exercises)\n\n## Part 1 — Why Monolithic ML Services Fail\n\nThe typical lifecycle of a monolithic ML service looks like this:\n\n**Month 1**: Data scientist trains a fraud model. Engineer wraps it in a Flask app. The app reads features from the database, runs the model, writes a decision to the decisions table, calls the fraud case management API. It works.\n\n**Month 6**: The model needs to be retrained. The engineer updates the model artifact, redeploys the service, and the business rules that were hardcoded alongside the model in the same function break because the output schema changed. Hotfix deployed. Three other integration points also break.\n\n**Month 12**: The fraud case management API is being replaced. The new API requires a different payload structure. The engineer modifies the service. But the service also calls a legacy audit system with the old format, and that system can’t be changed without a 6-week change request. The team works around it by transforming the payload twice in the same function.\n\n**Month 18**: Nobody understands the service anymore. Changing the model requires understanding the business rules. Changing the business rules requires understanding the API integrations. Changing the API integrations requires understanding the model output format. There is no test for any of it.\n\nThis is not a hypothetical. It is the standard lifecycle of 80% of production ML systems built without an explicit architecture pattern.\n\nThe root cause is the same in every case: **prediction, business logic, and system action are tangled in a single service**. When one changes, the others break. When something goes wrong in production, it’s unclear which layer is the source.\n\nFDE separates these concerns at the architecture level.\n\n## Part 2 — The FDE Pattern: Three Layers, Three Concerns\n\n```\n┌─────────────────────────────────────────────────────────┐\n│                    INCOMING REQUEST                      │\n│              (transaction, API call, event)              │\n└────────────────────────┬────────────────────────────────┘\n                         │\n              ┌──────────▼──────────┐\n              │    FEATURE LAYER    │  ← \"What do we know?\"\n              │                     │\n              │  • Feature serving  │\n              │  • Signal retrieval │\n              │  • Feature groups   │\n              │  • Freshness checks │\n              └──────────┬──────────┘\n                         │ FeatureSet\n              ┌──────────▼──────────┐\n              │   DECISION LAYER    │  ← \"What should we do?\"\n              │                     │\n              │  • ML model(s)      │\n              │  • Business rules   │\n              │  • Output contract  │\n              │  • Explainability   │\n              └──────────┬──────────┘\n                         │ DecisionResult\n              ┌──────────▼──────────┐\n              │  EXECUTION LAYER    │  ← \"Do it safely\"\n              │                     │\n              │  • Action dispatch  │\n              │  • Idempotency      │\n              │  • Rollback plan    │\n              │  • Audit logging    │\n              └──────────┬──────────┘\n                         │ ExecutionResult\n                    ┌────▼────┐\n                    │ CALLER  │\n                    └─────────┘\n```\n\nEach layer has a single responsibility and communicates with adjacent layers through a typed contract. The contract is the key: it means each layer can be developed, tested, deployed, and scaled independently.\n\n**Feature Layer**: given a request context (customer ID, transaction data, session info), produce the feature vector the decision layer needs. Nothing else.\n\n**Decision Layer**: given a feature set, produce a decision (approve/decline/review, score, recommended action, reason codes). Nothing else.\n\n**Execution Layer**: given a decision, take the appropriate action in downstream systems safely and idempotently. Nothing else.\n\n## Part 3 — The Feature Layer: Serving ML Inputs\n\n### 3.1 Responsibilities\n\nThe Feature Layer is responsible for:\n\n- Serving pre-computed features from a fast store (Redis, DynamoDB, Feast)\n- Computing real-time features from the incoming request\n- Joining online features with request-time signals\n- Validating feature completeness and freshness\n- Returning a typed\n`FeatureSet`\n\nobject\n\n### 3.2 The FeatureSet Contract\n\n``` python\nfrom dataclasses import dataclass, field\nfrom typing import Optional, Dict, Any\nfrom datetime import datetime\nfrom enum import Enum\n\nclass FeatureFreshness(Enum):\n    FRESH      = \"fresh\"        # within expected refresh window\n    STALE      = \"stale\"        # older than expected, but available\n    MISSING    = \"missing\"      # not available at all\n    SYNTHETIC  = \"synthetic\"    # imputed from fallback logic\n\n@dataclass\nclass FeatureGroup:\n    \"\"\"A logical group of related features with metadata.\"\"\"\n    name:        str\n    features:    Dict[str, Any]\n    freshness:   FeatureFreshness\n    computed_at: datetime\n    source:      str  # \"online_store\", \"real_time\", \"fallback\"\n\n@dataclass\nclass FeatureSet:\n    \"\"\"\n    The contract from Feature Layer → Decision Layer.\n    Contains all features the decision model needs, with provenance.\n    \"\"\"\n    request_id:     str\n    customer_id:    str\n    timestamp:      datetime\n    groups:         Dict[str, FeatureGroup]\n    is_complete:    bool       # False if any critical group is MISSING\n    warnings:       list = field(default_factory=list)\n\n    def get(self, group: str, feature: str, default=None):\n        \"\"\"Safe feature access with default fallback.\"\"\"\n        grp = self.groups.get(group)\n        if grp is None or grp.freshness == FeatureFreshness.MISSING:\n            return default\n        return grp.features.get(feature, default)\n\n    def get_vector(self, group: str) -> Dict[str, Any]:\n        \"\"\"Get all features in a group as a flat dict.\"\"\"\n        grp = self.groups.get(group)\n        if grp is None:\n            return {}\n        return grp.features\n```\n\n### 3.3 Implementing the Feature Layer\n\n``` python\nimport redis\nimport json\nimport numpy as np\nfrom datetime import datetime, timedelta\nfrom typing import Optional\nimport logging\n\nlogger = logging.getLogger(__name__)\n\nclass FeatureLayer:\n    \"\"\"\n    Feature serving layer for the FDE architecture.\n    Combines online-stored pre-computed features with real-time signals.\n    \"\"\"\n\n    def __init__(self, redis_url: str = \"redis://localhost:6379\",\n                 freshness_threshold_minutes: int = 60):\n        self.redis  = redis.from_url(redis_url, decode_responses=True)\n        self.freshness_threshold = timedelta(minutes=freshness_threshold_minutes)\n\n    def get_features(self, request_id: str, customer_id: str,\n                     transaction: dict) -> FeatureSet:\n        \"\"\"\n        Main entry point: build a complete FeatureSet for a decisioning request.\n        \"\"\"\n        timestamp = datetime.utcnow()\n        groups    = {}\n        warnings  = []\n\n        # 1. Online-stored customer risk features (Redis)\n        groups[\"customer_risk\"] = self._get_customer_risk(customer_id, timestamp)\n\n        # 2. Spend behaviour features (Redis, refreshed hourly)\n        groups[\"spend_behaviour\"] = self._get_spend_behaviour(customer_id, timestamp)\n\n        # 3. Real-time transaction features (computed from the request itself)\n        groups[\"transaction_context\"] = self._compute_transaction_features(\n            transaction, timestamp\n        )\n\n        # 4. Session risk features (real-time from request)\n        groups[\"session_context\"] = self._compute_session_features(\n            transaction, timestamp\n        )\n\n        # Check completeness: \"customer_risk\" is always required\n        is_complete = groups[\"customer_risk\"].freshness != FeatureFreshness.MISSING\n\n        if not is_complete:\n            warnings.append(\"customer_risk features missing — decisioning may degrade\")\n\n        for name, group in groups.items():\n            if group.freshness == FeatureFreshness.STALE:\n                warnings.append(f\"{name} features are stale (>60min old)\")\n\n        return FeatureSet(\n            request_id=request_id,\n            customer_id=customer_id,\n            timestamp=timestamp,\n            groups=groups,\n            is_complete=is_complete,\n            warnings=warnings,\n        )\n\n    def _get_customer_risk(self, customer_id: str,\n                           now: datetime) -> FeatureGroup:\n        \"\"\"Retrieve pre-computed customer risk features from Redis.\"\"\"\n        key  = f\"features:customer_risk:{customer_id}\"\n        data = self.redis.hgetall(key)\n\n        if not data:\n            return FeatureGroup(\n                name=\"customer_risk\",\n                features={},\n                freshness=FeatureFreshness.MISSING,\n                computed_at=now,\n                source=\"online_store\",\n            )\n\n        computed_at = datetime.fromisoformat(data.get(\"_computed_at\", now.isoformat()))\n        age         = now - computed_at\n        freshness   = (FeatureFreshness.FRESH if age < self.freshness_threshold\n                       else FeatureFreshness.STALE)\n\n        features = {\n            \"credit_score\":        float(data.get(\"credit_score\", 0)),\n            \"fraud_score_30d\":     float(data.get(\"fraud_score_30d\", 0)),\n            \"account_age_months\":  int(data.get(\"account_age_months\", 0)),\n            \"dispute_count_90d\":   int(data.get(\"dispute_count_90d\", 0)),\n            \"velocity_1h\":         float(data.get(\"velocity_1h\", 0)),   # $ in last hour\n            \"velocity_24h\":        float(data.get(\"velocity_24h\", 0)),  # $ in last 24h\n            \"international_ratio\": float(data.get(\"international_ratio\", 0)),\n        }\n\n        return FeatureGroup(\n            name=\"customer_risk\",\n            features=features,\n            freshness=freshness,\n            computed_at=computed_at,\n            source=\"online_store\",\n        )\n\n    def _get_spend_behaviour(self, customer_id: str,\n                              now: datetime) -> FeatureGroup:\n        \"\"\"Retrieve spend pattern features.\"\"\"\n        key  = f\"features:spend:{customer_id}\"\n        data = self.redis.hgetall(key)\n\n        if not data:\n            # Return synthetic defaults rather than MISSING for non-critical features\n            return FeatureGroup(\n                name=\"spend_behaviour\",\n                features={\"avg_txn_amount_30d\": 0.0, \"top_mcc\": \"unknown\"},\n                freshness=FeatureFreshness.SYNTHETIC,\n                computed_at=now,\n                source=\"fallback\",\n            )\n\n        return FeatureGroup(\n            name=\"spend_behaviour\",\n            features={\n                \"avg_txn_amount_30d\": float(data.get(\"avg_txn_amount_30d\", 0)),\n                \"std_txn_amount_30d\": float(data.get(\"std_txn_amount_30d\", 0)),\n                \"top_mcc\":            data.get(\"top_mcc\", \"unknown\"),\n                \"unique_merchants_7d\": int(data.get(\"unique_merchants_7d\", 0)),\n                \"weekend_spend_ratio\": float(data.get(\"weekend_spend_ratio\", 0.5)),\n            },\n            freshness=FeatureFreshness.FRESH,\n            computed_at=now,\n            source=\"online_store\",\n        )\n\n    def _compute_transaction_features(self, txn: dict,\n                                       now: datetime) -> FeatureGroup:\n        \"\"\"Compute features from the current transaction — always real-time.\"\"\"\n        amount    = float(txn.get(\"amount\", 0))\n        is_intl   = txn.get(\"country\", \"US\") != \"US\"\n        is_online = txn.get(\"channel\", \"\") in (\"web\", \"mobile\", \"api\")\n        is_night  = now.hour < 6 or now.hour >= 22\n\n        return FeatureGroup(\n            name=\"transaction_context\",\n            features={\n                \"amount\":          amount,\n                \"amount_log\":      float(np.log1p(amount)),\n                \"is_international\": int(is_intl),\n                \"is_online\":       int(is_online),\n                \"is_night\":        int(is_night),\n                \"mcc\":             txn.get(\"mcc\", \"0000\"),\n                \"merchant_country\": txn.get(\"country\", \"US\"),\n            },\n            freshness=FeatureFreshness.FRESH,\n            computed_at=now,\n            source=\"real_time\",\n        )\n\n    def _compute_session_features(self, txn: dict,\n                                   now: datetime) -> FeatureGroup:\n        \"\"\"Compute session-level risk signals.\"\"\"\n        return FeatureGroup(\n            name=\"session_context\",\n            features={\n                \"device_fingerprint_match\": int(txn.get(\"device_known\", True)),\n                \"ip_country_match\":         int(txn.get(\"ip_matches_billing\", True)),\n                \"auth_method\":              txn.get(\"auth_method\", \"pin\"),\n            },\n            freshness=FeatureFreshness.FRESH,\n            computed_at=now,\n            source=\"real_time\",\n        )\n```\n\n## Part 4 — The Decision Layer: Model + Business Logic\n\n### 4.1 The DecisionResult Contract\n\n``` python\nfrom dataclasses import dataclass, field\nfrom typing import List, Optional\nfrom enum import Enum\n\nclass DecisionAction(Enum):\n    APPROVE  = \"approve\"\n    DECLINE  = \"decline\"\n    REVIEW   = \"review\"   # send to human review queue\n    CHALLENGE = \"challenge\"  # step-up authentication\n\n@dataclass\nclass DecisionResult:\n    \"\"\"\n    The contract from Decision Layer → Execution Layer.\n    Encodes what to do, why, and how confident we are.\n    \"\"\"\n    request_id:      str\n    customer_id:     str\n    action:          DecisionAction\n    fraud_score:     float            # 0.0 (safe) to 1.0 (fraud)\n    confidence:      float            # model confidence in this decision\n    reason_codes:    List[str]        # human-readable reasons (for regulatory)\n    policy_applied:  str              # which rule or model made this decision\n    model_version:   str\n    is_model_driven: bool             # False if overridden by a hard rule\n    metadata:        dict = field(default_factory=dict)\n```\n\n### 4.2 Implementing the Decision Layer\n\n``` python\nimport numpy as np\nfrom sklearn.ensemble import GradientBoostingClassifier\nimport joblib\n\nclass DecisionLayer:\n    \"\"\"\n    Decision layer: combines ML model scores with business rules\n    to produce a typed DecisionResult.\n\n    Critically: business rules are explicit and auditable,\n    not embedded in the model.\n    \"\"\"\n\n    # Fraud score thresholds (easily tunable without model retrain)\n    DECLINE_THRESHOLD  = 0.85\n    REVIEW_THRESHOLD   = 0.60\n    CHALLENGE_THRESHOLD = 0.40\n\n    def __init__(self, model_path: str, model_version: str):\n        self.model         = joblib.load(model_path)\n        self.model_version = model_version\n\n    def decide(self, feature_set: FeatureSet) -> DecisionResult:\n        \"\"\"\n        Core decision logic: run model, apply business rules, return result.\n        \"\"\"\n        # Step 1: Check if any hard-rule pre-empts the model\n        hard_rule_result = self._check_hard_rules(feature_set)\n        if hard_rule_result is not None:\n            return hard_rule_result\n\n        # Step 2: If feature set is incomplete, apply conservative policy\n        if not feature_set.is_complete:\n            return self._incomplete_features_policy(feature_set)\n\n        # Step 3: Run the ML model\n        fraud_score, confidence = self._score(feature_set)\n\n        # Step 4: Apply threshold policy\n        action, reason_codes = self._apply_thresholds(\n            fraud_score, confidence, feature_set\n        )\n\n        return DecisionResult(\n            request_id=feature_set.request_id,\n            customer_id=feature_set.customer_id,\n            action=action,\n            fraud_score=fraud_score,\n            confidence=confidence,\n            reason_codes=reason_codes,\n            policy_applied=\"gbm_v3_threshold_policy\",\n            model_version=self.model_version,\n            is_model_driven=True,\n        )\n\n    def _check_hard_rules(self,\n                           fs: FeatureSet) -> Optional[DecisionResult]:\n        \"\"\"\n        Hard rules that override the model.\n        These encode: regulatory requirements, credit policy,\n        known fraud patterns, and operational constraints.\n        \"\"\"\n        customer_id = fs.customer_id\n\n        # Rule 1: Immediate decline for known fraud lists (OFAC, internal blocklist)\n        if self._is_on_blocklist(customer_id):\n            return DecisionResult(\n                request_id=fs.request_id,\n                customer_id=customer_id,\n                action=DecisionAction.DECLINE,\n                fraud_score=1.0,\n                confidence=1.0,\n                reason_codes=[\"BLOCKED_ENTITY\"],\n                policy_applied=\"hard_rule:blocklist\",\n                model_version=self.model_version,\n                is_model_driven=False,\n            )\n\n        # Rule 2: Extreme velocity — always decline\n        velocity_1h = fs.get(\"customer_risk\", \"velocity_1h\", 0)\n        if velocity_1h > 10_000:\n            return DecisionResult(\n                request_id=fs.request_id,\n                customer_id=customer_id,\n                action=DecisionAction.DECLINE,\n                fraud_score=0.95,\n                confidence=1.0,\n                reason_codes=[\"VELOCITY_EXCEEDED\"],\n                policy_applied=\"hard_rule:velocity_limit\",\n                model_version=self.model_version,\n                is_model_driven=False,\n            )\n\n        # Rule 3: International transaction on account flagged as domestic-only\n        if (fs.get(\"transaction_context\", \"is_international\")\n                and fs.get(\"customer_risk\", \"international_ratio\", 0) < 0.01):\n            # Step up to challenge rather than decline\n            return DecisionResult(\n                request_id=fs.request_id,\n                customer_id=customer_id,\n                action=DecisionAction.CHALLENGE,\n                fraud_score=0.50,\n                confidence=0.70,\n                reason_codes=[\"UNUSUAL_GEO\"],\n                policy_applied=\"hard_rule:geo_challenge\",\n                model_version=self.model_version,\n                is_model_driven=False,\n            )\n\n        return None  # no hard rule fired; proceed to model\n\n    def _score(self, fs: FeatureSet) -> tuple:\n        \"\"\"Build feature vector and run GBM model.\"\"\"\n        risk    = fs.get_vector(\"customer_risk\")\n        spend   = fs.get_vector(\"spend_behaviour\")\n        txn     = fs.get_vector(\"transaction_context\")\n        session = fs.get_vector(\"session_context\")\n\n        # Deviation of current amount from historical average\n        avg = spend.get(\"avg_txn_amount_30d\", txn.get(\"amount\", 1))\n        std = spend.get(\"std_txn_amount_30d\", avg * 0.5) or 1.0\n        amount_zscore = (txn.get(\"amount\", 0) - avg) / std\n\n        X = np.array([[\n            risk.get(\"fraud_score_30d\", 0),\n            risk.get(\"velocity_1h\", 0) / 1000,\n            risk.get(\"velocity_24h\", 0) / 10000,\n            risk.get(\"dispute_count_90d\", 0),\n            risk.get(\"international_ratio\", 0),\n            txn.get(\"amount_log\", 0),\n            amount_zscore,\n            txn.get(\"is_international\", 0),\n            txn.get(\"is_online\", 0),\n            txn.get(\"is_night\", 0),\n            session.get(\"device_fingerprint_match\", 1),\n            session.get(\"ip_country_match\", 1),\n        ]])\n\n        proba      = self.model.predict_proba(X)[0]\n        fraud_prob = float(proba[1])\n        confidence = float(max(proba))  # confidence is how far from 0.5\n\n        return fraud_prob, confidence\n\n    def _apply_thresholds(self, fraud_score: float, confidence: float,\n                           fs: FeatureSet) -> tuple:\n        \"\"\"Map fraud score to action with reason codes.\"\"\"\n        if fraud_score >= self.DECLINE_THRESHOLD:\n            action  = DecisionAction.DECLINE\n            reasons = self._get_reason_codes(fraud_score, fs)\n        elif fraud_score >= self.REVIEW_THRESHOLD:\n            action  = DecisionAction.REVIEW\n            reasons = self._get_reason_codes(fraud_score, fs)\n        elif fraud_score >= self.CHALLENGE_THRESHOLD:\n            action  = DecisionAction.CHALLENGE\n            reasons = [\"ELEVATED_RISK\"]\n        else:\n            action  = DecisionAction.APPROVE\n            reasons = [\"WITHIN_NORMAL_PARAMETERS\"]\n\n        return action, reasons\n\n    def _get_reason_codes(self, score: float,\n                           fs: FeatureSet) -> List[str]:\n        \"\"\"Generate regulatory-grade reason codes (FCRA compliant).\"\"\"\n        codes = []\n        if fs.get(\"customer_risk\", \"velocity_1h\", 0) > 2000:\n            codes.append(\"HIGH_VELOCITY\")\n        if fs.get(\"transaction_context\", \"is_international\"):\n            codes.append(\"INTERNATIONAL_TRANSACTION\")\n        if fs.get(\"transaction_context\", \"is_night\"):\n            codes.append(\"UNUSUAL_TIME\")\n        if not fs.get(\"session_context\", \"device_fingerprint_match\", True):\n            codes.append(\"UNRECOGNISED_DEVICE\")\n        if not codes:\n            codes.append(\"MODEL_RISK_SCORE\")\n        return codes[:4]  # max 4 reason codes per FCRA\n\n    def _incomplete_features_policy(self, fs: FeatureSet) -> DecisionResult:\n        \"\"\"Conservative policy when features are missing.\"\"\"\n        return DecisionResult(\n            request_id=fs.request_id,\n            customer_id=fs.customer_id,\n            action=DecisionAction.REVIEW,\n            fraud_score=0.50,\n            confidence=0.10,\n            reason_codes=[\"INSUFFICIENT_FEATURES\"],\n            policy_applied=\"fallback:incomplete_features\",\n            model_version=self.model_version,\n            is_model_driven=False,\n            metadata={\"warnings\": fs.warnings},\n        )\n\n    def _is_on_blocklist(self, customer_id: str) -> bool:\n        # In production: call blocklist service or Redis set\n        return False\n```\n\n## Part 5 — The Execution Layer: Taking Action Safely\n\n### 5.1 Responsibilities\n\nThe Execution Layer takes the `DecisionResult`\n\nand acts on it. Its responsibilities are:\n\n**Routing**: send the decision to the right downstream system (approve → payment rails, decline → decline handler, review → case management queue)**Idempotency**: ensure that retried requests don’t double-execute actions** Audit logging**: write an immutable record of every action taken** Rollback**: for reversible actions, maintain rollback capability\n\n### 5.2 Implementing the Execution Layer\n\n``` python\nimport uuid\nimport time\nimport redis\nfrom typing import Optional\nfrom dataclasses import dataclass\n\n@dataclass\nclass ExecutionResult:\n    \"\"\"The response from the Execution Layer back to the caller.\"\"\"\n    request_id:     str\n    action_taken:   str\n    success:        bool\n    reference_id:   Optional[str]   # downstream system reference\n    is_idempotent:  bool            # True if this was a duplicate request\n    audit_trail_id: str\n    error:          Optional[str] = None\n\nclass ExecutionLayer:\n    \"\"\"\n    Execution layer: safely dispatches decisions to downstream systems.\n    Handles idempotency, audit logging, and rollback for reversible actions.\n    \"\"\"\n\n    def __init__(self, redis_url: str, audit_logger, payment_client,\n                 review_queue_client):\n        self.redis         = redis.from_url(redis_url)\n        self.audit         = audit_logger\n        self.payments      = payment_client\n        self.review_queue  = review_queue_client\n        self.idempotency_ttl = 86400  # 24 hours\n\n    def execute(self, decision: DecisionResult,\n                transaction: dict) -> ExecutionResult:\n        \"\"\"\n        Main entry point: execute the decision safely.\n        \"\"\"\n        # Step 1: Idempotency check\n        idempotency_key = f\"exec:idempotent:{decision.request_id}\"\n        existing = self.redis.get(idempotency_key)\n\n        if existing:\n            # Request already processed — return cached result\n            cached = json.loads(existing)\n            return ExecutionResult(\n                request_id=decision.request_id,\n                action_taken=cached[\"action_taken\"],\n                success=True,\n                reference_id=cached.get(\"reference_id\"),\n                is_idempotent=True,\n                audit_trail_id=cached[\"audit_trail_id\"],\n            )\n\n        # Step 2: Execute the action\n        audit_id = str(uuid.uuid4())\n        result   = self._dispatch(decision, transaction, audit_id)\n\n        # Step 3: Write to audit log (always — even on failure)\n        self.audit.write({\n            \"audit_trail_id\":  audit_id,\n            \"request_id\":      decision.request_id,\n            \"customer_id\":     decision.customer_id,\n            \"decision\":        decision.action.value,\n            \"fraud_score\":     decision.fraud_score,\n            \"reason_codes\":    decision.reason_codes,\n            \"policy_applied\":  decision.policy_applied,\n            \"model_version\":   decision.model_version,\n            \"is_model_driven\": decision.is_model_driven,\n            \"action_taken\":    result.action_taken,\n            \"success\":         result.success,\n            \"reference_id\":    result.reference_id,\n            \"timestamp\":       time.time(),\n            \"transaction\":     transaction,\n        })\n\n        # Step 4: Cache for idempotency\n        if result.success:\n            self.redis.setex(\n                idempotency_key,\n                self.idempotency_ttl,\n                json.dumps({\n                    \"action_taken\":    result.action_taken,\n                    \"reference_id\":    result.reference_id,\n                    \"audit_trail_id\":  audit_id,\n                }),\n            )\n\n        return result\n\n    def _dispatch(self, decision: DecisionResult,\n                   txn: dict, audit_id: str) -> ExecutionResult:\n        \"\"\"Route decision to the appropriate handler.\"\"\"\n        action = decision.action\n\n        if action == DecisionAction.APPROVE:\n            return self._handle_approve(decision, txn, audit_id)\n        elif action == DecisionAction.DECLINE:\n            return self._handle_decline(decision, txn, audit_id)\n        elif action == DecisionAction.REVIEW:\n            return self._handle_review(decision, txn, audit_id)\n        elif action == DecisionAction.CHALLENGE:\n            return self._handle_challenge(decision, txn, audit_id)\n        else:\n            raise ValueError(f\"Unknown action: {action}\")\n\n    def _handle_approve(self, decision, txn, audit_id) -> ExecutionResult:\n        \"\"\"Approve: authorise the transaction on the payment rails.\"\"\"\n        try:\n            ref = self.payments.authorise(\n                transaction_id=txn[\"transaction_id\"],\n                amount=txn[\"amount\"],\n                merchant=txn[\"merchant_id\"],\n                auth_code=str(uuid.uuid4())[:8].upper(),\n            )\n            return ExecutionResult(\n                request_id=decision.request_id,\n                action_taken=\"approved\",\n                success=True,\n                reference_id=ref,\n                is_idempotent=False,\n                audit_trail_id=audit_id,\n            )\n        except Exception as e:\n            return ExecutionResult(\n                request_id=decision.request_id,\n                action_taken=\"approve_failed\",\n                success=False,\n                reference_id=None,\n                is_idempotent=False,\n                audit_trail_id=audit_id,\n                error=str(e),\n            )\n\n    def _handle_decline(self, decision, txn, audit_id) -> ExecutionResult:\n        \"\"\"Decline: reject the transaction with reason codes.\"\"\"\n        self.payments.decline(\n            transaction_id=txn[\"transaction_id\"],\n            decline_codes=decision.reason_codes,\n        )\n        return ExecutionResult(\n            request_id=decision.request_id,\n            action_taken=\"declined\",\n            success=True,\n            reference_id=None,\n            is_idempotent=False,\n            audit_trail_id=audit_id,\n        )\n\n    def _handle_review(self, decision, txn, audit_id) -> ExecutionResult:\n        \"\"\"Review: route to human fraud analyst queue.\"\"\"\n        case_id = self.review_queue.enqueue({\n            \"transaction\": txn,\n            \"fraud_score\": decision.fraud_score,\n            \"reason_codes\": decision.reason_codes,\n            \"audit_trail_id\": audit_id,\n            \"priority\": \"high\" if decision.fraud_score > 0.75 else \"normal\",\n        })\n        return ExecutionResult(\n            request_id=decision.request_id,\n            action_taken=\"queued_for_review\",\n            success=True,\n            reference_id=case_id,\n            is_idempotent=False,\n            audit_trail_id=audit_id,\n        )\n\n    def _handle_challenge(self, decision, txn, audit_id) -> ExecutionResult:\n        \"\"\"Challenge: trigger step-up authentication flow.\"\"\"\n        challenge_id = self.payments.initiate_challenge(\n            transaction_id=txn[\"transaction_id\"],\n            challenge_type=\"otp_sms\",\n        )\n        return ExecutionResult(\n            request_id=decision.request_id,\n            action_taken=\"challenge_initiated\",\n            success=True,\n            reference_id=challenge_id,\n            is_idempotent=False,\n            audit_trail_id=audit_id,\n        )\n```\n\n## Part 6 — Tutorial: Wire It Together\n\n### 6.1 The FDE Orchestrator\n\n``` python\nimport uuid\nimport logging\n\nlogger = logging.getLogger(__name__)\n\nclass FraudDecisionService:\n    \"\"\"\n    FDE orchestrator for the fraud decision system.\n    Coordinates Feature → Decision → Execution, handles errors at each layer.\n    \"\"\"\n\n    def __init__(self, feature_layer: FeatureLayer,\n                 decision_layer: DecisionLayer,\n                 execution_layer: ExecutionLayer):\n        self.features  = feature_layer\n        self.decision  = decision_layer\n        self.execution = execution_layer\n\n    def process(self, transaction: dict) -> dict:\n        \"\"\"\n        Full FDE pipeline for a single transaction.\n        Returns a structured response for the calling payment system.\n        \"\"\"\n        request_id  = transaction.get(\"request_id\") or str(uuid.uuid4())\n        customer_id = transaction[\"customer_id\"]\n\n        # ── Layer 1: Features ─────────────────────────────────────────────────\n        try:\n            feature_set = self.features.get_features(\n                request_id=request_id,\n                customer_id=customer_id,\n                transaction=transaction,\n            )\n        except Exception as e:\n            logger.error(f\"Feature layer failed for {request_id}: {e}\")\n            # Fail open with synthetic empty features (or fail closed — your policy)\n            feature_set = FeatureSet(\n                request_id=request_id,\n                customer_id=customer_id,\n                timestamp=datetime.utcnow(),\n                groups={},\n                is_complete=False,\n                warnings=[f\"Feature layer error: {e}\"],\n            )\n\n        # ── Layer 2: Decision ─────────────────────────────────────────────────\n        try:\n            decision = self.decision.decide(feature_set)\n        except Exception as e:\n            logger.error(f\"Decision layer failed for {request_id}: {e}\")\n            # Safe fallback: review rather than approve or decline\n            decision = DecisionResult(\n                request_id=request_id,\n                customer_id=customer_id,\n                action=DecisionAction.REVIEW,\n                fraud_score=0.5,\n                confidence=0.0,\n                reason_codes=[\"DECISION_SYSTEM_ERROR\"],\n                policy_applied=\"fallback:system_error\",\n                model_version=\"unknown\",\n                is_model_driven=False,\n            )\n\n        # ── Layer 3: Execution ────────────────────────────────────────────────\n        try:\n            result = self.execution.execute(decision, transaction)\n        except Exception as e:\n            logger.critical(f\"Execution layer failed for {request_id}: {e}\")\n            return {\n                \"request_id\":    request_id,\n                \"status\":        \"error\",\n                \"action\":        \"system_error\",\n                \"error\":         str(e),\n            }\n\n        return {\n            \"request_id\":      request_id,\n            \"action\":          result.action_taken,\n            \"reference_id\":    result.reference_id,\n            \"fraud_score\":     decision.fraud_score,\n            \"reason_codes\":    decision.reason_codes,\n            \"audit_trail_id\":  result.audit_trail_id,\n            \"model_version\":   decision.model_version,\n        }\n```\n\n### 6.2 Usage Example\n\n```\n# Initialise the service (in production: inject via DI container)\nservice = FraudDecisionService(\n    feature_layer=FeatureLayer(redis_url=\"redis://localhost:6379\"),\n    decision_layer=DecisionLayer(\n        model_path=\"models/fraud_gbm_v3.joblib\",\n        model_version=\"v3.2.1\",\n    ),\n    execution_layer=ExecutionLayer(\n        redis_url=\"redis://localhost:6379\",\n        audit_logger=AuditLogger(),\n        payment_client=PaymentClient(),\n        review_queue_client=ReviewQueueClient(),\n    ),\n)\n\n# Process a transaction\nresult = service.process({\n    \"request_id\":      \"txn-20260612-001\",\n    \"customer_id\":     \"customer_8472\",\n    \"transaction_id\":  \"auth-0001-20260612\",\n    \"amount\":          1850.00,\n    \"merchant_id\":     \"merchant_12345\",\n    \"mcc\":             \"5411\",  # grocery store\n    \"country\":         \"US\",\n    \"channel\":         \"mobile\",\n    \"device_known\":    True,\n    \"ip_matches_billing\": True,\n    \"auth_method\":     \"biometric\",\n})\n\nprint(result)\n# {'request_id': 'txn-20260612-001', 'action': 'approved',\n#  'reference_id': 'AUTH-7F3A', 'fraud_score': 0.12,\n#  'reason_codes': ['WITHIN_NORMAL_PARAMETERS'],\n#  'audit_trail_id': '...', 'model_version': 'v3.2.1'}\n```\n\n## Part 7 — Layer Contracts: The API Between Layers\n\nThe contracts (`FeatureSet`\n\n, `DecisionResult`\n\n, `ExecutionResult`\n\n) are the most important part of the FDE architecture. Here’s what makes a good contract:\n\n**Typed and validated.** Use Python dataclasses, Pydantic, or a schema registry. Untyped dicts between layers are the first step toward the monolith failure mode.\n\n**Versioned.** When the Decision Layer adds a new field to `DecisionResult`\n\n, the Execution Layer should not break. Design contracts with forward-compatibility in mind: add fields, don’t remove them; use Optional types for new fields.\n\n**Observable.** Every contract exchange should be logged at the INFO level with at minimum: request_id, timestamp, layer, and key decision fields. This is what makes the system debuggable when something goes wrong.\n\n**Documented.** Every field should have a docstring. Future engineers reading the code shouldn’t need to trace through three layers to understand what `confidence`\n\nmeans in `DecisionResult`\n\n.\n\n## Part 8 — Testing Strategy for FDE Systems\n\nOne of the most powerful properties of FDE is testability. Each layer can be tested independently.\n\n``` python\nimport pytest\nfrom unittest.mock import MagicMock, patch\n\nclass TestDecisionLayer:\n    \"\"\"Unit tests for the Decision Layer — no Feature or Execution Layer needed.\"\"\"\n\n    def setup_method(self):\n        self.decision = DecisionLayer(\n            model_path=\"tests/fixtures/mock_model.joblib\",\n            model_version=\"test-v1\",\n        )\n\n    def test_hard_rule_velocity_triggers_decline(self):\n        \"\"\"Velocity hard rule should decline before model runs.\"\"\"\n        fs = FeatureSet(\n            request_id=\"test-001\",\n            customer_id=\"c001\",\n            timestamp=datetime.utcnow(),\n            groups={\n                \"customer_risk\": FeatureGroup(\n                    name=\"customer_risk\",\n                    features={\"velocity_1h\": 15000.0},  # exceeds $10K limit\n                    freshness=FeatureFreshness.FRESH,\n                    computed_at=datetime.utcnow(),\n                    source=\"online_store\",\n                ),\n                \"transaction_context\": FeatureGroup(\n                    name=\"transaction_context\",\n                    features={\"is_international\": 0, \"amount\": 500.0, \"amount_log\": 6.2},\n                    freshness=FeatureFreshness.FRESH,\n                    computed_at=datetime.utcnow(),\n                    source=\"real_time\",\n                ),\n                \"session_context\": FeatureGroup(\n                    name=\"session_context\",\n                    features={},\n                    freshness=FeatureFreshness.FRESH,\n                    computed_at=datetime.utcnow(),\n                    source=\"real_time\",\n                ),\n                \"spend_behaviour\": FeatureGroup(\n                    name=\"spend_behaviour\",\n                    features={},\n                    freshness=FeatureFreshness.SYNTHETIC,\n                    computed_at=datetime.utcnow(),\n                    source=\"fallback\",\n                ),\n            },\n            is_complete=True,\n        )\n\n        result = self.decision.decide(fs)\n\n        assert result.action == DecisionAction.DECLINE\n        assert result.is_model_driven == False\n        assert \"VELOCITY_EXCEEDED\" in result.reason_codes\n\n    def test_normal_transaction_approves(self):\n        \"\"\"Low-risk transaction should approve.\"\"\"\n        # Build a low-risk feature set\n        fs = build_low_risk_feature_set(\"test-002\", \"c002\")\n        result = self.decision.decide(fs)\n        assert result.action == DecisionAction.APPROVE\n        assert result.fraud_score < 0.40\n\n    def test_incomplete_features_route_to_review(self):\n        \"\"\"Incomplete feature set should trigger review, not approve.\"\"\"\n        fs = FeatureSet(\n            request_id=\"test-003\", customer_id=\"c003\",\n            timestamp=datetime.utcnow(), groups={},\n            is_complete=False, warnings=[\"customer_risk missing\"],\n        )\n        result = self.decision.decide(fs)\n        assert result.action == DecisionAction.REVIEW\n        assert \"INSUFFICIENT_FEATURES\" in result.reason_codes\n\nclass TestFDEIntegration:\n    \"\"\"Integration tests: all three layers working together.\"\"\"\n\n    def test_full_pipeline_approve(self):\n        \"\"\"End-to-end: normal transaction should be approved.\"\"\"\n        service = build_test_service()  # uses test doubles for external systems\n        result  = service.process(build_normal_transaction())\n        assert result[\"action\"] == \"approved\"\n        assert result[\"fraud_score\"] < 0.40\n\n    def test_idempotency(self):\n        \"\"\"Same request_id processed twice should return same result.\"\"\"\n        service = build_test_service()\n        txn     = build_normal_transaction()\n\n        result1 = service.process(txn)\n        result2 = service.process(txn)  # same request_id\n\n        assert result1[\"action\"]         == result2[\"action\"]\n        assert result1[\"audit_trail_id\"] == result2[\"audit_trail_id\"]\n        # Second call should be idempotent\n```\n\n## Part 9 — Deployment: Independent Layer Operations\n\nThe FDE architecture enables a deployment pattern that monolithic services can’t support: **layer-level canary releases**.\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│                   CANARY DEPLOYMENT                         │\n│                                                             │\n│  Feature Layer v1.2  ──→  Decision Layer v3 (90%)  ──→  Exec│\n│                      └──→  Decision Layer v4 (10%)  ──→  Exec│\n│                                                             │\n│  This lets you A/B test a new model without                 │\n│  touching the Feature or Execution layers.                  │\n└─────────────────────────────────────────────────────────────┘\n```\n\n**Canary pattern for the Decision Layer:**\n\n``` python\nimport random\n\nclass CanaryDecisionLayer:\n    \"\"\"\n    Routes a percentage of traffic to a candidate Decision Layer\n    while keeping the rest on the stable version.\n    \"\"\"\n\n    def __init__(self, stable: DecisionLayer, candidate: DecisionLayer,\n                 candidate_pct: float = 0.10):\n        self.stable    = stable\n        self.candidate = candidate\n        self.pct       = candidate_pct\n\n    def decide(self, feature_set: FeatureSet) -> DecisionResult:\n        if random.random() < self.pct:\n            result = self.candidate.decide(feature_set)\n            result.metadata[\"canary\"] = True\n        else:\n            result = self.stable.decide(feature_set)\n            result.metadata[\"canary\"] = False\n        return result\n```\n\nShadow mode — run the new layer in parallel but don’t use its output — is the safest way to validate a new Decision Layer before any traffic switches:\n\n```\nclass ShadowDecisionLayer:\n    \"\"\"\n    Runs the shadow layer in parallel for comparison, but returns stable output.\n    Logs shadow vs stable divergence for analysis.\n    \"\"\"\n\n    def __init__(self, stable: DecisionLayer, shadow: DecisionLayer,\n                 metrics_client):\n        self.stable  = stable\n        self.shadow  = shadow\n        self.metrics = metrics_client\n\n    def decide(self, feature_set: FeatureSet) -> DecisionResult:\n        stable_result = self.stable.decide(feature_set)\n\n        # Run shadow asynchronously (don't block on it)\n        try:\n            shadow_result = self.shadow.decide(feature_set)\n            agrees = stable_result.action == shadow_result.action\n            self.metrics.record(\"shadow_agreement\", int(agrees), tags={\n                \"stable_action\": stable_result.action.value,\n                \"shadow_action\": shadow_result.action.value,\n            })\n        except Exception as e:\n            self.metrics.record(\"shadow_error\", 1)\n\n        return stable_result  # always return stable\n```\n\n## Part 10 — When to Apply FDE (and When Not To)\n\n**Apply FDE when:**\n\n- The system takes actions in downstream systems (not just returns predictions)\n- Multiple teams own different parts of the pipeline (data engineering owns Feature, ML owns Decision, platform engineering owns Execution)\n- Regulatory audit trails are required\n- You need to be able to change the model without changing the action logic\n- Latency budget allows for the overhead of layer boundaries (adds ~5–10ms for well-implemented serialisation)\n\n**Don’t apply FDE when:**\n\n- You’re building a pure prediction service that returns a score with no action\n- The system is small enough that a single team maintains the entire pipeline\n- Latency budget is so tight (< 10ms) that layer boundaries are prohibitive\n- The model and business logic are fundamentally inseparable (e.g., the business logic is just a thin wrapper on the model output)\n\nFDE is an architectural pattern for systems that act, not just systems that predict. If your system returns a score and a human or another system decides what to do with it, a simpler architecture is appropriate.\n\n## Part 11 — Exercises\n\n### Exercise 1: Add a Fourth Layer — Feedback\n\nA production FDE system needs a fourth layer: Feedback, which observes the outcomes of executed decisions and routes them back to the Feature Layer (to update online features) and the Decision Layer (to trigger model retraining).\n\nDesign and implement a `FeedbackLayer`\n\nclass that:\n\n- Accepts feedback events (transaction disputed, fraud confirmed, challenge passed)\n- Updates the customer’s\n`velocity_1h`\n\nand`fraud_score_30d`\n\nin Redis - Logs feedback events to a stream for offline model retraining\n- Closes the loop: the next transaction for this customer uses updated features\n\n### Exercise 2: Rate Limiting in the Execution Layer\n\nThe Execution Layer should rate-limit decline actions per customer: if a customer has been declined 3 times in 10 minutes, the 4th request should be escalated to a human review rather than auto-declined.\n\nAdd this logic to `ExecutionLayer._handle_decline()`\n\nusing Redis counters with TTL. What should happen to the idempotency cache when a declined request is re-routed to review?\n\n### Exercise 3: Feature Layer Fallback Hierarchy\n\nThe Feature Layer currently falls back to a synthetic default when spend_behaviour is missing. Implement a tiered fallback hierarchy:\n\n- First: try the online Redis store\n- Second: try a warm cache (recent data from the last 24h, stored in a secondary Redis key)\n- Third: compute an approximate feature from the incoming transaction itself\n- Last resort: return SYNTHETIC with a global mean value\n\nAdd a `fallback_tier`\n\nfield to `FeatureGroup`\n\nto track which tier served each feature.\n\n### Exercise 4: Explainability Endpoint\n\nAdd a `/explain/{request_id}`\n\nAPI endpoint to the FDE service that returns:\n\n- The feature values that were used for this decision\n- Which features were most influential (SHAP values from the GBM model)\n- Whether a hard rule or the model made the decision\n- The full audit trail for regulatory inquiry\n\n### Exercise 5: Multi-Model Decision Layer\n\nExtend the Decision Layer to support a model ensemble: GBM as the primary scorer, a logistic regression as a second check, and a rule-based anomaly detector. Implement a `VotingDecisionLayer`\n\nthat:\n\n- Takes the maximum fraud score from all three\n- Requires two-out-of-three agreement before approving a transaction above $5,000\n- Logs which model(s) drove the decision\n\n## Summary\n\n| Layer | Responsibility | Input | Output |\n|---|---|---|---|\nFeature | ”What do we know?“ | request_id, customer_id, transaction | `FeatureSet` |\nDecision | ”What should we do?” | `FeatureSet` | `DecisionResult` |\nExecution | ”Do it safely” | `DecisionResult` , transaction | `ExecutionResult` |\n\nThe FDE pattern doesn’t make ML easier. It makes production ML *maintainable* — which is harder and more important. The model you build today will be replaced in 18 months. The architecture you build today will outlast three model generations. Build it right.\n\n## Further Reading\n\n[Designing ML Systems — Chip Huyen (2022)](https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/)— Chapter 7 covers serving patterns that align with FDE[Feast Feature Store Documentation](https://docs.feast.dev/)[Real-Time ML for Production — Made With ML](https://madewithml.com/courses/mlops/serving/)[The ML Test Score — Breck et al., Google (2017)](https://research.google/pubs/the-ml-test-score-a-rubric-for-ml-production-readiness-and-technical-debt-reduction/)[Rules of Machine Learning — Google](https://developers.google.com/machine-learning/guides/rules-of-ml)\n\nEnterprise AI Architecture\n\n## Want more enterprise AI architecture breakdowns?\n\nSubscribe to SuperML.", "url": "https://wpnews.pro/news/fde-architecture-framework-build-production-ml-systems-that-don-t-break", "canonical_source": "https://superml.dev/fde-architecture-framework-ml-serving-tutorial", "published_at": "2026-06-20 01:38:19.517767+00:00", "updated_at": "2026-06-20 01:38:21.424479+00:00", "lang": "en", "topics": ["machine-learning", "mlops", "ai-infrastructure", "developer-tools"], "entities": ["FDE Architecture Framework", "Python", "REST APIs", "Flask"], "alternates": {"html": "https://wpnews.pro/news/fde-architecture-framework-build-production-ml-systems-that-don-t-break", "markdown": "https://wpnews.pro/news/fde-architecture-framework-build-production-ml-systems-that-don-t-break.md", "text": "https://wpnews.pro/news/fde-architecture-framework-build-production-ml-systems-that-don-t-break.txt", "jsonld": "https://wpnews.pro/news/fde-architecture-framework-build-production-ml-systems-that-don-t-break.jsonld"}}