{"slug": "introducing-kreuzcrawl-v0-3-0", "title": "Introducing kreuzcrawl v0.3.0", "summary": "Kreuzcrawl v0.3.0 ships with 14 language bindings, a tiered WAF-aware dispatch engine, and reduces peak streaming memory from ~2.5 GB to ~20 MB. The release also enables SSRF defense by default across all outbound calls and is the first API-stable version of the web crawling tool.", "body_md": "kreuzcrawl began as a Rust core with bindings for ten languages. v0.3.0 ships fourteen, adds a tiered WAF-aware dispatch engine, cuts peak streaming memory from ~2.5 GB to ~20 MB, and enables SSRF defense across every outbound call path by default. It is the first release we consider API-stable.\n\nThis post covers what changed, why each decision was made, and what the harder engineering problems looked like from the inside.\n\n| Area | v0.2.0 | v0.3.0 |\n|---|---|---|\n| Language bindings | 10 | 14 (+Dart, Kotlin/Android, Swift, Zig) |\n| Peak streaming memory | ~2.5 GB | ~20 MB |\n| SSRF protection | opt-in | on by default |\n| Dispatch model | static HTTP / bypass / browser | tiered, signal-driven escalation |\n| WAF fingerprints | — | 35 across 8 vendors |\n| Fingerprint hot-reload | — | lock-free (`ArcSwap` ), 500 ms debounce |\n| MCP tools | partial | 1:1 with CLI, safety-annotated |\n| CLI subcommands | scrape, crawl | + batch-scrape, batch-crawl, download, citations |\n| Robots / sitemap parsers | engine-internal | public modules |\n| API stability | preview | stable |\n\nv0.2.0 shipped Rust, Python, Node.js, Ruby, Go, Java, C#, PHP, Elixir, and WebAssembly.\n\nv0.3.0 adds **Dart**, **Kotlin/Android**, **Swift**, and **Zig** — bringing the total to fourteen.\n\nNone of the per-language glue is written by hand. Every binding is generated from the Rust core by [alef](https://github.com/xberg-io/alef), our polyglot binding generator.\n\nThe Dart and Kotlin/Android packages bind through the C FFI layer (`kreuzcrawl-ffi`\n\n) via `dart:ffi`\n\nand JNI respectively. Swift binds through clang. Zig uses `@cImport`\n\nagainst the same C header.\n\nThe generation pipeline also hardened in this release: the Docker publish matrix now builds each architecture natively rather than via QEMU emulation, the Dart build no longer requires the Flutter SDK for pub.dev publishes, Swift artifactbundle checksums are injected automatically, and the Elixir/PHP/Ruby releases preserve their lock files through the source-publish step.\n\n=== \"Python\"\n\nsh\npip install kreuzcrawl\n=== \"Node.js\"\n\nsh\nnpm install @xberg/kreuzcrawl\n=== \"Rust\"\n\nsh\ncargo add kreuzcrawl\n=== \"Go\"\n\nsh\ngo get github.com/xberg-io/kreuzcrawl/packages/go\n=== \"Java\"\n\nxml\n<dependency>\n  <groupId>io.xberg.kreuzcrawl</groupId>\n  <artifactId>kreuzcrawl</artifactId>\n  <version>0.3.0</version>\n</dependency>\n=== \"Kotlin (Android)\"\n\ngroovy\nimplementation(\"io.xberg.kreuzcrawl.android:kreuzcrawl-android:0.3.0\")\n=== \"C#\"\n\nsh\ndotnet add package Kreuzcrawl\n=== \"Ruby\"\n\nsh\ngem install kreuzcrawl\n=== \"PHP\"\n\nsh\ncomposer require xberg-io/kreuzcrawl\n=== \"Elixir\"\n\nelixir\n{:kreuzcrawl, \"~> 0.3\"}\n=== \"Dart\"\n\nsh\ndart pub add kreuzcrawl\n=== \"Swift\"\n\nswift\n// Package.swift\n.package(url: \"https://github.com/xberg-io/kreuzcrawl\", from: \"0.3.0\")\n=== \"Zig\"\n\nsh\nzig fetch --save https://github.com/xberg-io/kreuzcrawl/archive/v0.3.0.tar.gz\n=== \"WebAssembly\"\n\nsh\nnpm install @xberg/kreuzcrawl-wasm\n`crawl_stream()`\n\nand `batch_crawl_stream()`\n\npreviously accumulated every page result in memory before the caller received any of them. On a large crawl — tens of thousands of pages, each carrying extracted text, metadata, links, and images — the peak working set reached approximately 2.5 GB.\n\nThe fix is a change in ownership: each page result is moved into `CrawlEvent::Page`\n\nand emitted immediately. The caller receives it, processes it, and drops it. The engine never holds more than the current in-flight pages, bounded by the concurrency setting.\n\n```\n// The event type (unchanged externally; behavior changed internally)\npub enum CrawlEvent {\n    Page { result: Box<CrawlPageResult> }, // (1)\n    Error { url: String, error: String },\n    Complete { pages_crawled: usize },\n}\n```\n\n`CrawlPageResult`\n\nis boxed, moved into the variant, and dropped when the caller's loop moves past it. The engine holds no reference after the send.\n\n``` python\n# Python — pages are processed and released one at a time\nfrom kreuzcrawl import crawl_stream\n\nasync for event in crawl_stream(engine, \"https://example.com\"):\n    if event.type == \"page\":\n        process(event)  # event is dropped after this scope\n```\n\nPeak working set on a 10,000-page crawl with default concurrency (16): **~20 MB**.\n\nThe non-streaming `crawl()`\n\nis unchanged — it accumulates by contract, because callers need the complete `CrawlResult`\n\n. The two code paths are kept separate. Merging them would push the accumulation pattern onto callers, which is the same problem moved one level up.\n\n!!! tip \"Choosing between `crawl()`\n\nand `crawl_stream()`\n\n\"\n\nUse `crawl()`\n\nwhen you need the full result set in memory. Use `crawl_stream()`\n\nfor\n\nlarge crawls, progress tracking, or when you process results one at a time. The memory\n\ndifference is significant at scale.\n\nWeb crawlers take URLs as input and make HTTP requests — the exact primitive an attacker needs to reach internal services. Every path that accepts a URL now validates it against an `SsrfPolicy`\n\nbefore making the request: `scrape()`\n\n, `crawl()`\n\n, `batch_crawl()`\n\n, sitemap fetches, robots.txt fetches, asset downloads, and link enqueue.\n\n| Category | Ranges |\n|---|---|\n| Loopback |\n`127.0.0.0/8` , `::1/128`\n|\n| Private (RFC 1918) |\n`10.0.0.0/8` , `172.16.0.0/12` , `192.168.0.0/16`\n|\n| Link-local / cloud metadata |\n`169.254.0.0/16` (incl. `169.254.169.254` ), `fe80::/10`\n|\n| Unspecified | `0.0.0.0/8` |\n| Multicast |\n`224.0.0.0/4` , `ff00::/8`\n|\n| IPv6 unique-local | `fc00::/7` |\n| Non-http(s) schemes |\n`file://` , `ftp://` , `gopher://` , … |\n\nChecking the hostname at validation time is insufficient. An attacker can register\n\n`evil.example.com`\n\n, serve a public IP at validation, then update DNS to point to\n\n`192.168.1.1`\n\nonce the check passes.\n\nThe policy resolves every hostname via DNS and validates **all returned IP addresses**. If any resolved IP is in the deny list, the request is refused — regardless of what the others resolve to.\n\n``` js\n// From kreuzcrawl/src/net/ssrf.rs\nlet addresses: Vec<IpAddr> = tokio::net::lookup_host(&lookup_addr).await?\n    .map(|addr| addr.ip())\n    .collect();\n\nfor ip in &addresses {\n    if !is_ip_permitted(*ip, policy) {\n        return Err(SsrfError::DeniedByPolicy {\n            reason: classify_private_ip(*ip),\n        });\n    }\n}\n```\n\nEach `30x`\n\n`Location`\n\nheader is re-resolved and re-validated before the next hop is taken. This closes the redirect-chain attack: a public URL that redirects to `http://169.254.169.254/latest/meta-data/`\n\nis refused at the second hop. Redirect following is bounded by `SsrfPolicy::max_redirects`\n\n(default: 5).\n\n```\n# Environment variable — applies to every crawler in the process\nexport KREUZCRAWL_ALLOW_PRIVATE_NETWORK=1\n// Per-config builder — applies to a single CrawlConfig\nCrawlConfig::builder()\n    .allow_private_networks(true)\n    .ssrf_allowlist_host(HostMatcher::Cidr(\"10.0.0.0/8\".into()))\n    .build()\n```\n\n!!! warning \"Wasm targets\"\n\nOn `wasm32`\n\n, SSRF checking is disabled — the browser's fetch API and same-origin\n\npolicy are the enforcing boundary, and `tokio::net::lookup_host`\n\nis unavailable in\n\nthat context.\n\nBefore v0.3.0, the dispatch decision was static: HTTP, or bypass-vendor, or browser — chosen at config time and fixed for the duration of the crawl. This had an obvious cost problem: routing every request through a bypass provider because 5% of pages are blocked is expensive.\n\nThe new engine chains tiers and escalates based on per-attempt signals.\n\n```\npub enum Tier {\n    Http,    // plain HTTP fetch\n    Bypass,  // vendor-managed bypass (Zyte, ScrapingBee, Bright Data, …)\n    Browser, // headless Chrome via Chromiumoxide\n}\n\npub enum EscalationStrategy {\n    None,              // HTTP only; surface all failures\n    BrowserOnly,       // HTTP → Browser on block  ← default\n    BypassFirst,       // always use bypass (legacy behaviour)\n    BypassOnly,        // HTTP → Bypass on block; no browser\n    BypassThenBrowser, // HTTP → Bypass → Browser; maximum resilience\n}\n```\n\nAll dispatch enums are `#[non_exhaustive]`\n\n— new variants can be added without breaking downstream `match`\n\narms.\n\nDetecting a WAF challenge page requires inspecting both response headers and body.\n\nA naïve approach — one regex per fingerprint per response — scales as O(fingerprints × body_length). With 35 fingerprints that's expensive per page.\n\nAll body-pattern signals across all fingerprints are compiled into a **single Aho-Corasick automaton** at startup. One scan of the response body returns the set of matching pattern indices; each maps to a fingerprint via a flat `Vec<usize>`\n\n.\n\n```\npub struct Rules {\n    fingerprints: Vec<Fingerprint>,\n    automaton: AhoCorasick,         // single automaton over all patterns\n    pattern_to_fp: Vec<usize>,      // AC pattern index → fingerprint index\n}\n```\n\nThe body scan is capped at **100 KB** (`CHALLENGE_BODY_LIMIT`\n\n). WAF challenge pages are small; real content pages overwhelmingly exceed this threshold. This bounds scan cost without missing signals.\n\nHeader signals are checked first (constant time per fingerprint). If a fingerprint fires on headers alone, the body scan is skipped entirely.\n\n**Current corpus:** 35 fingerprints across Cloudflare (10), DataDome (6), PerimeterX (5), Imperva (5), AWS WAF (4), F5 (2), Akamai (1), and generic corroborating patterns (2).\n\nThe fingerprint corpus is a TOML file (`rules/waf_fingerprints.toml`\n\n). In Kubernetes deployments, it is managed as a ConfigMap — operators update signatures without restarting the process.\n\nThe compiled `Rules`\n\nis wrapped in `arc_swap::ArcSwap`\n\n. `TomlClassifier::watch()`\n\nstarts a filesystem watcher that atomically swaps the rule set when the file changes:\n\n```\npub struct TomlClassifier {\n    rules: ArcSwap<Rules>,\n}\n\nimpl TomlClassifier {\n    pub fn watch(self: &Arc<Self>, path: impl AsRef<Path>) -> Result<WatchHandle, WatchError> {\n        watch::start_watch(Arc::clone(self), path.as_ref())\n    }\n}\n```\n\nEvents are debounced 500 ms — this handles both editors that write via tmpfile+rename and the Kubernetes ConfigMap atomic projection mechanism, which produces the same sequence of filesystem events.\n\nThe engine tracks a block rate per domain using an Exponentially Weighted Moving Average. High block rates promote the starting tier: a domain that has been blocking consistently starts at `Bypass`\n\nor `Browser`\n\nrather than always attempting `Http`\n\nfirst.\n\nThe `DomainStatePort`\n\ntrait is injectable:\n\n```\n#[async_trait]\npub trait DomainStatePort: Send + Sync + fmt::Debug {\n    async fn recommend(&self, domain: &str) -> DomainRecommendation;\n    async fn observe(&self, domain: &str, observation: &DomainObservation);\n}\n```\n\nThe default implementation (`EwmaDomainState`\n\n) is wired in automatically.\n\n[kreuzberg-cloud](https://github.com/xberg-io/kreuzberg-cloud) replaces it with a distributed store for cross-instance domain intelligence.\n\n```\nuse std::sync::Arc;\nuse kreuzcrawl::{\n    CrawlConfig, DispatchProfile, EscalationStrategy,\n    SimpleRetryPolicy, TomlClassifier,\n};\n\nlet config = CrawlConfig::builder()\n    .dispatch(\n        DispatchProfile::builder()\n            .strategy(EscalationStrategy::BypassThenBrowser)\n            .retry_policy(Arc::new(SimpleRetryPolicy::new().with_max_retries(3)))\n            .waf_classifier(Arc::new(TomlClassifier::builtin()))\n            .build(),\n    )\n    .build();\n```\n\nThe MCP server now exposes tools 1:1 with the CLI — `scrape`\n\n, `batch_scrape`\n\n, `batch_crawl`\n\n, `download`\n\n, and `generate_citations`\n\n. Earlier releases had partial coverage; v0.3.0 closes the gap.\n\nEach tool declares three safety properties from the MCP spec:\n\n| Property | Value | Meaning |\n|---|---|---|\n`read_only` |\n`true` |\ndoes not modify external state |\n`destructive` |\n`false` |\ndoes not delete or overwrite anything |\n`open_world` |\n`true` |\nmakes network requests to caller-specified URLs |\n\n`open_world: true`\n\nis the meaningful one. MCP hosts can use it to apply additional\n\nsandboxing or prompt for confirmation before an agent makes outbound requests. The SSRF\n\npolicy is the enforcement layer: a request to `http://169.254.169.254/`\n\nreturns a\n\n`SsrfPolicyViolation`\n\nerror before any network activity occurs.\n\nThe server runs in two modes depending on how it is invoked:\n\n`/mcp`\n\n— used for service deployments. Enabled when the binary\nis built with `--features api,mcp`\n\n.\n\n```\n# stdio mode (subprocess)\nkreuzcrawl mcp\n\n# HTTP mode\nkreuzcrawl serve  # exposes /mcp alongside the REST API\n```\n\nFour subcommands complete the CLI's 1:1 mapping with the core and MCP surfaces:\n\n| Command | Description |\n|---|---|\n`batch-scrape <urls…>` |\nScrape multiple URLs concurrently, emit structured JSON |\n`batch-crawl <urls…>` |\nCrawl from multiple seed URLs with shared concurrency budget |\n`download <url>` |\nFetch and save assets to disk (PDF, DOCX, images, …) |\n`citations <url>` |\nExtract structured citations and references from a page |\n`version` |\nPrint version and build metadata |\n\n```\n# Crawl two seeds, output Markdown\nkreuzcrawl batch-crawl \\\n  https://docs.example.com \\\n  https://blog.example.com \\\n  --depth 3 \\\n  --format markdown\n```\n\n`kreuzcrawl::robots`\n\nand `kreuzcrawl::sitemap`\n\nare now public modules, usable without\n\nconstructing a crawl engine:\n\n```\nuse kreuzcrawl::robots::{parse_robots_txt, is_path_allowed};\nuse kreuzcrawl::sitemap::{parse_sitemap_xml, parse_sitemap_index};\n\n// Standalone robots.txt check — both functions are infallible\nlet rules = parse_robots_txt(robots_body, \"Googlebot\");\nlet allowed = is_path_allowed(\"/private/\", &rules);\n\n// Standalone sitemap parse — infallible\nlet urls = parse_sitemap_xml(sitemap_body);\n\n// Sitemap index (points to child sitemaps)\nlet index = parse_sitemap_index(index_body);\n```\n\nThis is useful for compliance tooling, link-graph builders, and crawl planners that need to evaluate `robots.txt`\n\naccess rules or enumerate URLs from a sitemap without running a full crawl.\n\n`BrowserPool`\n\n, `BrowserPoolConfig`\n\n, `NativeBrowserExecutor`\n\n, and\n\n`NativeBrowserExecutorConfig`\n\nare now public. Callers that run many crawls against the\n\nsame targets can construct and warm a pool once and reuse it:\n\n``` js\nuse kreuzcrawl::{BrowserPool, BrowserPoolConfig, CrawlEngineBuilder};\n\nlet pool = BrowserPool::new(BrowserPoolConfig::default()); // sync, returns Arc<BrowserPool>\npool.warm().await?; // pre-open Chrome tabs up to pool capacity\n\nlet engine = CrawlEngineBuilder::new(config)\n    .with_browser_pool(pool)\n    .build()\n    .await?;\n```\n\nWithout pool injection, each engine creates and tears down its own Chrome instance. With pool injection, the browser process persists across crawl jobs — useful when you are running many short crawls in a tight loop.\n\nv0.3.0 adds two OpenTelemetry counters:\n\n| Counter | Description |\n|---|---|\n`crawl_waf_blocks_total` |\nNumber of times a WAF fingerprint fired, labeled by vendor |\n`crawl_backend_escalations_total` |\nNumber of tier escalations, labeled by source and target tier |\n\nThese are emitted unconditionally via `opentelemetry::global`\n\n— no feature gate required. Consumers that do not configure an OTel exporter incur no overhead beyond the counter increment.\n\nThe WAF subsystem also gained property-based tests, `cargo-fuzz`\n\ntargets covering the TOML corpus loader and Aho-Corasick automaton, and Criterion benchmarks measuring classification throughput at scale.\n\nThis is the first release kreuzcrawl declares stable. The commitments:\n\n`kreuzcrawl`\n\ncrate public surface`kreuzcrawl-ffi`\n\n) is stable at MAJOR.MINOR. Struct layouts are frozen at MAJOR.MINOR boundaries.`EscalationStrategy`\n\n, `EscalationReason`\n\n, `Tier`\n\n, `CrawlError`\n\n, `NetworkErrorKind`\n\n) are `#[non_exhaustive]`\n\n. New variants are non-breaking; callers outside the crate must include wildcard arms.`0.3.x`\n\ntargets a Rust core at `0.3.x`\n\n.The public API surface is largely additive. Two changes require attention:\n\n** CrawlError::WafBlocked is now a struct variant.** The previous unit variant becomes\n\n`CrawlError::WafBlocked { vendor, message }`\n\n. Match arms that destructure it need updating:\n\n``` js\n// Before\nCrawlError::WafBlocked => { /* handle */ }\n\n// After\nCrawlError::WafBlocked { vendor, message } => {\n    eprintln!(\"blocked by {vendor}: {message}\");\n}\n```\n\n** SimpleRetryPolicy retry count is now exact.** The previous implementation had an off-by-one:\n\n`max_retries=3`\n\nproduced 2 retries. The API also changed: `new()`\n\nnow takes no arguments (defaults to 3 retries); use `.with_max_retries(n)`\n\nto override. Update call sites that were compensating for the off-by-one or passing a count to `new()`\n\n.v0.3.0 stabilises the core surface. The areas we are actively working on:\n\n`HostMatcher`\n\nin language bindings.`allowlist`\n\nfield on `SsrfPolicy`\n\nis currently `#[alef(skip)]`\n\n— the untagged-enum FFI representation is not yet finalized. Expect it in a 0.3.x patch once the tagged-enum form is decided.`ProxyProvider`\n\ntrait and `StaticProxyProvider`\n\nare public in this release; per-request proxy selection and rotation are landing in 0.3.x.", "url": "https://wpnews.pro/news/introducing-kreuzcrawl-v0-3-0", "canonical_source": "https://dev.to/kreuzberg/introducing-kreuzcrawl-v030-8di", "published_at": "2026-06-25 09:52:08+00:00", "updated_at": "2026-06-25 10:13:29.031934+00:00", "lang": "en", "topics": ["developer-tools", "ai-infrastructure", "large-language-models"], "entities": ["Kreuzcrawl", "xberg-io", "alef", "Dart", "Kotlin", "Swift", "Zig", "WebAssembly"], "alternates": {"html": "https://wpnews.pro/news/introducing-kreuzcrawl-v0-3-0", "markdown": "https://wpnews.pro/news/introducing-kreuzcrawl-v0-3-0.md", "text": "https://wpnews.pro/news/introducing-kreuzcrawl-v0-3-0.txt", "jsonld": "https://wpnews.pro/news/introducing-kreuzcrawl-v0-3-0.jsonld"}}