{"slug": "introducing-cross-repository-ci-relay-scalable-ci-for-pytorchs-out-of-tree", "title": "Introducing Cross-Repository CI Relay: Scalable CI for PyTorch’s Out-of-Tree Backends", "summary": "PyTorch introduced the Cross-Repository CI Relay (CRCR), an automated pipeline that triggers and tracks CI in downstream repositories whenever changes are made to pytorch/pytorch. Results are displayed on the PyTorch CI HUD, giving maintainers a unified dashboard for both in-tree and cross-repository CI health. The system supports tiered participation levels, from notification-only to blocking checks, to close the coordination gap between upstream and downstream projects.", "body_md": "### Featured projects\n\n### TL;DR\n\nPyTorch now has a Cross-Repository CI Relay (CRCR) that automatically triggers and tracks CI in downstream repositories whenever a PR is opened or a commit is pushed against `pytorch/pytorch`\n\n. Results flow back to the [PyTorch CI HUD](https://hud.pytorch.org/crcr), giving maintainers a single dashboard for both in-tree and cross-repository CI health — without requiring downstream repos to build custom integrations.\n\n## The Problem: Blind Spots in PyTorch CI\n\nPyTorch sits at the center of a large ecosystem. Hardware backends like Intel XPU, AMD ROCm, Apple MPS, and Qualcomm AI Engine maintain their own repositories with custom operator implementations and kernels. Ecosystem projects like vLLM, SGLang, and Hugging Face Transformers depend on PyTorch as a foundational library. Even within PyTorch, some backends are partially in-tree (like CPU architecture-specific code) and others rely on CI that runs outside the main test suite.\n\nPyTorch has a mature upstream CI that does extensive testing, including merge-blocking checks, label-driven workflows, and re-runnable PR checks, but it runs only on pytorch/pytorch.\n\nUntil now, these downstream repositories had no standard way to:\n\n**Know when to test**— Maintainers relied on polling or manual triggers to discover upstream changes that might break their code.** Report results back**— Even when downstream repos ran CI against upstream PRs, the results lived in separate dashboards, invisible to PyTorch core reviewers or community.**Correlate failures**— A PyTorch PR that breaks three downstream projects produced three independent failure signals with no unified view.\n\nThis created a coordination gap: PyTorch maintainers couldn’t see downstream breakage to take an informed decision before merging, and downstream teams couldn’t easily signal regressions to upstream.\n\n## The Solution: Cross-Repository CI Relay\n\nThe Cross-Repository CI Relay (CRCR) closes this gap with a fully automated pipeline that connects upstream PyTorch events to downstream CI and routes status and results back to the PyTorch CI HUD.\n\n**How CRCR works**: A PyTorch PR triggers the relay. The relay dispatches to all registered downstream repos. Those repos run workflows in their CI and report the status and results back via an authenticated callback. The results appear on the PyTorch HUD within seconds.\n\n### Participation Levels\n\nThe relay uses a tiered allowlist to support incremental onboarding and differentiated access. Downstream repos progress through 4 levels as they mature:\n\n| Level | Dispatch | Callback to HUD | Description |\n|---|---|---|---|\n| L1 | Yes | No | Receive dispatches only (notification tier) |\n| L2 | Yes | Yes | Full pipeline: dispatch + HUD reporting |\n| L3 | Yes | Yes | Adds a non-blocking check run on the upstream PR (triggered via a crcr specific label) |\n| L4 | Yes | Yes | Adds a blocking check run on the upstream PR (auto-triggered for every PR); reserved for critical downstream projects |\n\n**Use cases by level**:\n\n**L1 (Notify)**: A new project starting integration — receives dispatch events to trigger CI, but doesn’t report back yet. This is an entry point and useful for validating that the dispatch pipeline works before committing to full reporting.**L2 (Report)**: A downstream project running compatibility checks against upstream PRs. Results appear on the CRCR HUD dashboard, giving PyTorch maintainers visibility into downstream health.**L3 (Signal)**: A mature project that wants upstream PR reviewers to see its CI status directly in the GitHub PR checks tab. Non-blocking — the check run is informational and doesn’t prevent merging.**L4 (Gate)**: A critical downstream project where upstream breakage has production impact. The check run blocks merging if it fails. Requires on call contacts for triage.\n\nL3 and L4 are reserved for future capabilities as the system matures and the community defines promotion criteria. The requirements for advancing to each level are detailed in [RFC-0050](https://github.com/pytorch/rfcs/blob/master/RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md).\n\n## End-to-End Architecture\n\n*Figure 1: End-to-end data flow from a PyTorch PR through the relay to the CI HUD.*\n\n| Component | Technology | Purpose |\n|---|---|---|\n| Webhook Lambda | Python 3.12, AWS Lambda | Receive GitHub webhooks, fan out dispatches |\n| Callback Lambda | Python 3.12, AWS Lambda | Verify OIDC, enforce state machine, forward to HUD |\n| Redis | Amazon ElastiCache (TLS) | State machine, allowlist cache, rate limiting |\n| HUD API | Next.js on Vercel | Receive relay data, write to DynamoDB |\n| DynamoDB | `torchci-oot-workflow-job` table |\nPrimary storage for downstream CI records |\n| ClickHouse | `default.oot_workflow_job` table |\nAnalytics queries for HUD frontend |\n| Replicator | AWS Lambda (DynamoDB Streams) | Real-time DynamoDB → ClickHouse replication |\n| Callback Action | Composite GitHub Action | OIDC minting + payload delivery for downstream repos |\n\nThe system has five stages:\n\n**Stage 1: Webhook Reception**. When a pull or push events happen on `pytorch/pytorch`\n\n, GitHub sends a signed webhook to the Webhook Lambda. The Lambda verifies the HMAC-SHA256 signature, confirms the event originates from the upstream repository, and loads the tiered allowlist.\n\n**Stage 2: Fan-Out Dispatch**. The Lambda sends a `repository_dispatch`\n\nevent to every allowlisted downstream repository in parallel, carrying the full PR context (SHA, PR number, action). A `DISPATCHED`\n\nstate is recorded in Redis with a timestamp, which later enables queue-time measurement.\n\n**Stage 3: Downstream CI Execution**. Each downstream repository runs its CI workflow — building against the PR’s commit SHA and executing its test suite. The workflow uses a composite GitHub Action ([cross-repo-ci-relay-callback](https://github.com/pytorch/test-infra/tree/main/.github/actions/cross-repo-ci-relay-callback)) to report status at two points: `in_progress`\n\nwhen the job starts, and `completed`\n\nwith the conclusion when it finishes.\n\n**Stage 4: Authenticated Callback**. Each callback carries a GitHub OIDC token — a short-lived RS256 JWT that cryptographically proves which repository sent it. The Callback Lambda verifies this token, checks the allowlist, enforces rate limits, validates the state machine, and forwards the result to HUD.\n\n**Stage 5: HUD Persistence and Display**. The HUD API writes the record to DynamoDB. A DynamoDB Stream triggers the ClickHouse replicator, which lands the data in ClickHouse for the frontend to query. The result appears on `hud.pytorch.org/crcr`\n\nwithin seconds.\n\nAll secrets (GitHub App keys, HUD bot token, Redis credentials) are stored in AWS Secrets Manager. Redis connections use TLS in the Lambda runtime. The Webhook and Callback Lambdas run in a dedicated PyTorch-owned AWS account, separate from the HUD’s Vercel-linked AWS account.\n\n## What Downstream Repos Need to Do\n\nThe integration burden on downstream repos is minimal. A repository needs:\n\n- Be added to the\n[allowlist](https://github.com/pytorch/pytorch/blob/main/.github/allowlist.yml)(a one-line YAML entry). - Add a workflow file that listens to\n`repository_dispatch`\n\nevents. - Use the composite callback action to report\n`in_progress`\n\nand`completed`\n\n.\n\nHere’s a minimal downstream workflow:\n\n```\nname: PyTorch CI\non:\n  repository_dispatch:\n    types: [pull_request]\n\npermissions:\n  id-token: write\n\njobs:\n  test:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: pytorch/test-infra/.github/actions/cross-repo-ci-relay-callback@main\n        with:\n          status: in_progress\n\n      - uses: actions/checkout@v4\n      - name: Run tests\n        id: tests\n        run: |\n          # Your build and test logic here\n          echo \"outcome=success\" >> \"$GITHUB_OUTPUT\"\n\n      - if: always()\n        uses: pytorch/test-infra/.github/actions/cross-repo-ci-relay-callback@main \n        with: \n          status: completed \n          conclusion: ${{ steps.tests.outputs.outcome }} \n          test-results: '{\"passed\": 42, \"failed\": 0, \"skipped\": 3}'\n```\n\nNo callback URLs to configure, no secrets to manage, no custom authentication code. The action handles OIDC token minting, payload construction, and delivery.\n\nFor a step-by-step onboarding guide and a working reference implementation, see the [CRCR onboarding documentation](https://github.com/pytorch/crcr-test).\n\n## Security Model\n\nAccepting CI results from external repositories into PyTorch’s infrastructure requires careful security design. The relay enforces five stages of validation before any data reaches the HUD.\n\n*Figure 2: Every callback passes through five security layers before reaching the HUD.*\n\n### Security Stage 1: OIDC Identity Verification\n\nThe callback action mints a GitHub OIDC token with audience `pytorch-cross-repo-ci-relay`\n\n. The Callback Lambda verifies this token against GitHub’s public JWKS endpoint using RS256. The `repository`\n\nclaim in the token is cryptographically bound to the calling repository — a workflow in `org/repo-a`\n\ncannot produce a token claiming to be `org/repo-b`\n\n.\n\nThis is the foundation of the trust model:** the relay never trusts self-reported identity**. The OIDC-verified `verified_repo`\n\nis the single source of truth.\n\n### Security Stage 2: Allowlist Authorization\n\nThe allowlist is a YAML file in `pytorch/pytorch`\n\n, controlled by PyTorch maintainers. Repositories must be explicitly listed at **L2 or higher** to have their callback results accepted. The allowlist supports four tiers (L1–L4), enabling differentiated access as the system evolves.\n\n```\nL1:\n  - org/backend-dispatch-only\nL2:\n  - org/backend-with-hud-reporting\nL4:\n  - org/trusted-backend: oncall1, oncall2\n```\n\n### Security Stage 3: Rate Limiting\n\nA per-repository sliding-window rate limiter prevents any single downstream repo from flooding the pipeline. The rate limiter is** fail-closed**: if Redis is unreachable, callbacks are rejected (HTTP 500) rather than allowed through unchecked.\n\n### Security Stage 4: State Machine Validation\n\nThe relay maintains a three-state lifecycle in Redis for every dispatch:\n\n*Figure 3: Valid and invalid state transitions enforced by the relay.*\n\nThe state machine guarantees:\n\n**No callbacks without dispatch**: A downstream repo cannot inject results for a PR event that was never relayed.** No duplicates**: Replaying the same`in_progress`\n\nor`completed`\n\ncallback is rejected.**No skipped states**:`COMPLETED`\n\nwithout a prior`IN_PROGRESS`\n\nis rejected.**Per-job tracking**: Each job execution gets its own`check_run_id`\n\n(GitHub-assigned, not controllable by the workflow), supporting multi-job workflows.\n\n### Security Stage 5: Data Separation\n\nThe relay forwards data to HUD in two explicit namespaces:\n\n```\n{\n  \"trusted\": {\n    \"verified_repo\": \"org/backend-repo\",\n    \"downstream_repo_level\": \"L2\",\n    \"ci_metrics\": { \"queue_time\": 1.23, \"execution_time\": 45.6 }\n  },\n  \"untrusted\": {\n    \"callback_payload\": {\n      \"workflow\": {\n        \"status\": \"completed\",\n        \"conclusion\": \"success\",\n        \"name\": \"CI\",\n        \"test_results\": { \"passed\": 42, \"failed\": 0, \"skipped\": 3 }\n      }\n    }\n  }\n}\n```\n\nThe `trusted`\n\nblock contains relay-generated fields that HUD can rely on. The `untrusted`\n\nblock contains the downstream’s self-reported data. HUD uses `trusted.verified_repo`\n\nfor attribution and displays `untrusted.callback_payload.workflow`\n\nas informational.\n\n## Known Limitations\n\nCRCR authenticates identity, not correctness. Each downstream repository is responsible for the accuracy of its own callbacks:\n\nA compromised maintainer of an allowlisted repo can forge conclusion values in future callbacks — for example, reporting `success`\n\nfor a failing CI run. The impact is limited to **displaying incorrect data on the HUD**; this cannot affect PyTorch’s build infrastructure, inject code into upstream, or influence merge decisions (unless the repo is at a future gating tier).\n\nMitigations:\n\n- The\n`verified_repo`\n\nfield always identifies the true caller (OIDC-guaranteed). - Misbehaviour is observable in HUD data and CloudWatch logs.\n- The offending repo can be removed from the allowlist, immediately revoking access.\n\nCross-validating reported conclusions against the GitHub Check Runs API is a planned future enhancement.\n\n## CI Metrics: Queue Time and Execution Time\n\nThe relay computes two timing metrics from its state machine timestamps:\n\n**Queue time**: Time between`DISPATCHED`\n\n(webhook sends`repository_dispatch`\n\n) and`IN_PROGRESS`\n\n(downstream job starts). This measures GitHub Actions queue delays.**Execution time**: Time between`IN_PROGRESS`\n\nand`COMPLETED`\n\n. This measures actual CI execution duration.\n\nThese metrics are forwarded to HUD in the `trusted.ci_metrics`\n\nblock and displayed on the dashboard, giving infrastructure teams visibility into both platform-level queuing and per-backend test performance.\n\n## The HUD Dashboard\n\nThe CRCR results are displayed on the PyTorch CI HUD at [hud.pytorch.org/crcr](https://hud.pytorch.org/crcr):\n\n**Summary page**(`/crcr`\n\n): Aggregated pass rates, job counts, and average execution times across all backends over the last 14 days.**Backend dashboard**(`/crcr`\n\n`/{org}`\n\n`/{repo}`\n\n): Per-PR matrix view showing individual job results, test counts, and execution times for a specific backend.\n\nThe dashboard uses the same ClickHouse-backed query infrastructure as the rest of the PyTorch CI HUD, ensuring consistent performance and familiar UX for maintainers.\n\n## Getting Started\n\nIf you maintain an out-of-tree PyTorch backend and want to integrate with CRCR:\n\n**Request allowlist addition**: Open a PR to add your repository to[pytorch/pytorch/.github/allowlist.yml](https://github.com/pytorch/pytorch/blob/main/.github/allowlist.yml).**Add the workflow**: Copy the minimal workflow template above into your repository’s`.github/workflows/`\n\ndirectory.**Ensure permissions**: The workflow must declare`permissions: id-token: write`\n\nfor OIDC token minting.**Verify on HUD**: Once your first dispatch completes, check[hud.pytorch.org/crcr](https://hud.pytorch.org/crcr)for your results.\n\nFor detailed documentation, see the [CRCR README](https://docs.pytorch.org/docs/main/accelerator/ci.html) and [RFC-0050](https://github.com/pytorch/rfcs/blob/master/RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md).\n\n## What’s Next\n\n**Stale job cleanup**: A scheduled job to detect and mark`in_progress`\n\nrecords that never received a completion callback.**Upstream merge gating**: Using L3/L4 tier results to optionally block upstream merges when critical downstream backends are broken.** Push event support**: Extending dispatch beyond pull request events to cover post-merge CI on the main branch.\n\n## Acknowledgements\n\nCRCR was designed and implemented as a collaboration between the PyTorch Foundation infrastructure team and the Linux Foundation. We’d like to thank the reviewers and infrastructure engineers who helped shape the system: @malfet, @albanD, @atalman, @jathu, @zxiiro, @fffrog, @KarhouTam and @can-gaa-hou for architectural design and development, infrastructure deployment and operational support.\n\n*For questions or feedback, reach out on the PyTorch Dev Infra discussion forum or file an issue in pytorch/test-infra. You can also reach out to slack channel #crcr*", "url": "https://wpnews.pro/news/introducing-cross-repository-ci-relay-scalable-ci-for-pytorchs-out-of-tree", "canonical_source": "https://pytorch.org/blog/introducing-cross-repository-ci-relay-scalable-ci-for-pytorchs-out-of-tree-backends/", "published_at": "2026-06-29 14:18:42+00:00", "updated_at": "2026-06-29 14:31:38.508879+00:00", "lang": "en", "topics": ["developer-tools", "ai-infrastructure", "machine-learning"], "entities": ["PyTorch", "Intel XPU", "AMD ROCm", "Apple MPS", "Qualcomm AI Engine", "vLLM", "SGLang", "Hugging Face Transformers"], "alternates": {"html": "https://wpnews.pro/news/introducing-cross-repository-ci-relay-scalable-ci-for-pytorchs-out-of-tree", "markdown": "https://wpnews.pro/news/introducing-cross-repository-ci-relay-scalable-ci-for-pytorchs-out-of-tree.md", "text": "https://wpnews.pro/news/introducing-cross-repository-ci-relay-scalable-ci-for-pytorchs-out-of-tree.txt", "jsonld": "https://wpnews.pro/news/introducing-cross-repository-ci-relay-scalable-ci-for-pytorchs-out-of-tree.jsonld"}}