{"slug": "i-tested-every-web-scraping-tool-against-lazada-here-s-what-actually-works-may", "title": "I Tested Every Web Scraping Tool Against Lazada — Here's What Actually Works (May 2026)", "summary": "A developer tested the open-source Python framework Scrapling against Lazada Singapore, a production site protected by Google reCAPTCHA and custom slider verification, using only a single 4GB VPS with no residential proxies. The framework's three-tier fetcher system—HTTP, browser, and stealth—allowed the developer to bypass anti-bot measures while keeping memory usage under control, though the initial decision to skip the Camoufox browser due to RAM concerns later proved to be a mistake. The built-in MCP server provided 14 tools for AI agent integration, with session management being critical to avoid out-of-memory crashes during concurrent scraping operations.", "body_md": "I came across [Scrapling](https://github.com/D4Vinci/Scrapling) through a recommendation on X and decided to put it through its paces — not against a demo page, but against Lazada Singapore, a production site with Google reCAPTCHA and a custom slider verification. The setup: a single 4GB VPS, no residential proxies, no credits, just open-source tools.\n\nHere's the full journey: installation pitfalls, wiring it into an AI agent, choosing the right browser for the job, and the real-world benchmarks that followed.\n\nScrapling is an adaptive web scraping framework for Python (BSD-3, v0.4.8). It handles everything from single HTTP requests to full-scale concurrent crawls. What sets it apart from the BeautifulSoup/Scrapy world:\n\n`Fetcher`\n\n, curl_cffi), browser (`DynamicFetcher`\n\n, Playwright Chromium), and stealth (`StealthyFetcher`\n\n, Chromium + anti-bot patches). Swap with one line.`mcp_scrapling_get`\n\n, `mcp_scrapling_fetch`\n\n, `mcp_scrapling_stealthy_fetch`\n\ndirectly.It's open source, pip-installable, and designed to be the backbone of a scraping stack — not just another tool in the toolbox.\n\nThis is where the real story starts. The VPS has 4GB RAM, 2 vCPUs, 77GB disk, and runs an AI agent gateway (615MB baseline). Every browser installation decision matters.\n\n```\npip install scrapling[fetchers,ai]   # HTTP + Chromium + MCP server\nscrapling install                     # Downloads Playwright browsers\n```\n\nThis pulls in Playwright Chromium, Firefox, and WebKit (~1.3GB disk), plus `curl_cffi`\n\nfor HTTP requests and `patchright`\n\n(Playwright fork) for browser automation.\n\n**Camoufox.** Every discussion about Scrapling mentions a GitHub thread where someone's VPS hit 1.4GB of RAM running Camoufox. That was enough to scare me off — on a 4GB machine, 1.4GB for one browser is a non-starter. So we skipped it and let Scrapling's StealthyFetcher fall back to Chromium.\n\nTurns out this was the wrong call. More on that later.\n\n``` python\nfrom scrapling.fetchers import Fetcher\n\npage = Fetcher.get('https://quotes.toscrape.com/', timeout=15)\nquotes = page.css('.quote .text::text').getall()\n# 0.88s, 200 OK, 10 quotes parsed\n# Memory: 56MB RSS\n```\n\nClean. Fast. No browser needed. The HTTP fetcher uses `curl_cffi`\n\nwith TLS fingerprint impersonation — it looks like Chrome to the server but costs nothing in RAM.\n\nScrapling ships with a built-in MCP (Model Context Protocol) server. Start it with `scrapling mcp`\n\nand your AI coding agent gets 14 native tools:\n\n| Tool | What it does |\n|---|---|\n`get` / `bulk_get`\n|\nHTTP fetch with CSS selector extraction |\n`fetch` / `bulk_fetch`\n|\nBrowser fetch with JS rendering |\n`stealthy_fetch` / `bulk_stealthy_fetch`\n|\nAnti-bot browser fetch |\n`open_session` / `close_session` / `list_sessions`\n|\nPersistent browser management |\n`screenshot` |\nFull-page PNG/JPEG capture |\n\nThe key advantage: CSS selector support means the agent extracts only relevant elements instead of dumping entire pages into context. Token savings compound fast.\n\nThe MCP server's session tools aren't optional — they're the difference between stable and catastrophic:\n\n```\n# ❌ Don't do this in a loop\nfor url in urls:\n    page = StealthyFetcher.fetch(url)  # New browser every time\n\n# ✅ Do this instead\nsession_id = open_session(type=\"dynamic\")\nfor url in urls:\n    page = fetch(url, session_id=session_id)  # Reuses same browser\nclose_session(session_id)\n```\n\nOne browser, reused. Without sessions, each one-shot fetch spawns a new Chromium process. After 5+ calls, memory pressure spikes. After 20+, you're in OOM territory.\n\nScrapling's three fetchers form a natural escalation ladder:\n\n| Tier | Fetcher | Engine | Best for |\n|---|---|---|---|\n| 1 | `Fetcher` |\ncurl_cffi (HTTP) | Static pages, APIs |\n| 2 | `DynamicFetcher` |\nPlaywright Chromium | JS-rendered SPAs |\n| 3 | `StealthyFetcher` |\nChromium + anti-bot patches | Cloudflare, bot detection |\n\nSame API across all three. Same CSS selectors. Same response object. You're not choosing between different libraries — you're choosing how much overhead to pay.\n\nBut the real question is: **do you need a browser at all?** Let's benchmark.\n\n| Fetcher | Avg Speed | vs Fastest |\n|---|---|---|\n`Fetcher` (HTTP) |\n0.77s |\n1× |\n`DynamicFetcher` (Chromium) |\n3.66s | 4.8× |\n`StealthyFetcher` |\n~4s | 5.2× |\n\nThe HTTP fetcher is absurdly fast. Browser-based tools add 3-4 seconds of overhead *per page*. That gap compounds: 10 pages is 7.7s vs 40s. 100 pages is 77s vs 6.5 minutes.\n\n| Fetcher | RAM Delta |\n|---|---|\n`Fetcher` (HTTP) |\n~0 MB |\n`StealthyFetcher` |\n+120 MB |\n`DynamicFetcher` |\n+180 MB |\n\nThe rule is simple: **start at tier 1 and only escalate when proven necessary.** If the page is static, you don't need a browser. If it's JS-rendered, you don't need stealth. If it has anti-bot, you don't need a different IP. Prove each escalation before taking it.\n\nRemember how I skipped Camoufox because of that 1.4GB horror story? After getting the stack running, I decided to test it properly.\n\n```\npip install camoufox\npython -m camoufox fetch  # Downloads the browser binary (~713MB)\n```\n\n**Camoufox is actually the lightest browser.** Measured on our VPS:\n\n| Browser | RAM (headless) | Stealth Level |\n|---|---|---|\n| Camoufox (Firefox) | 81 MB |\nC++-level |\n| Scrapling StealthyFetcher (Chromium) | 120 MB | JS-patched |\n| Scrapling DynamicFetcher (Chromium) | 180 MB | None |\n\nThe 1.4GB from that GitHub thread was user error — spawning a fresh browser per request without closing old ones. Same thing happens with any browser. Camoufox is a debloated Firefox fork: telemetry stripped, Mozilla services removed, `navigator.webdriver`\n\ngenuinely absent at the C++ level.\n\n**But there's a catch:** Scrapling's StealthyFetcher uses `patchright`\n\n(a Playwright Chromium fork) and does NOT auto-detect Camoufox. They don't integrate at the browser level because Playwright's Firefox protocol differs from Chromium's.\n\nThe workaround is straightforward:\n\n``` python\nfrom camoufox import Camoufox\nfrom scrapling import Selector\n\n# Camoufox: stealth browsing with Firefox fingerprint (81MB)\nwith Camoufox(headless=True) as browser:\n    page = browser.new_page()\n    page.goto('https://target.com')\n    html = page.content()\n\n# Scrapling: adaptive parsing with CSS/XPath\nsel = Selector(html)\ndata = sel.css('.product::text').getall()\n```\n\nCamoufox fetches undetected. Scrapling parses with adaptive resilience. Best of both worlds — but it's slow. More on that next.\n\n| Browser | Avg Page Load |\n|---|---|\n| Scrapling DynamicFetcher (Chromium) | 3.66s |\n| Camoufox (Firefox) | 8.84s |\n\n11× slower than the HTTP fetcher, 2.4× slower than Chromium. Firefox on Linux pays a cold-start tax. Camoufox earns its place at tier 5 in the ladder — not a replacement for Chromium, but a fallback when Chromium's fingerprint is the problem.\n\nAll of this — the speed data, the memory measurements, the Camoufox discovery — points to one design:\n\n```\nPriority 1:  Fetcher (HTTP)              0.77s   ~0 MB    Static pages\n   ↓ page is empty / JS-rendered?\nPriority 3:  DynamicFetcher (Chromium)    3.66s   180 MB   JS-rendered SPAs\n   ↓ blocked by anti-bot?\nPriority 4:  StealthyFetcher (Chromium)   ~4s     120 MB   Cloudflare, basic WAF\n   ↓ Chromium itself blocked?\nPriority 5:  Camoufox (Firefox)           8.84s    81 MB   Firefox fingerprint\n   ↓ CAPTCHA / aggressive WAF?\nPriority 6:  Firecrawl enhanced proxy     ~3-5s    credits Hard targets\n```\n\nEach tier costs more — time or money. Only escalate when proven necessary. The ladder is encoded as an agent skill, so every scraping task automatically starts at tier 1 and escalates on failure.\n\nLazada SG was the proving ground. Two-layer defense: Google reCAPTCHA → custom slider verification. In a previous test (early May 2026), only Lightpanda's Zig-based browser survived. Every Chromium tool got blocked.\n\nRunning the ladder:\n\n| Priority | Tool | Page 1 | Page 2 | Page 3 | Time |\n|---|---|---|---|---|---|\n| 1 | HTTP Fetcher | ❌ Empty | — | — | 0.77s |\n| 3 | DynamicFetcher | ✅ 41 items | ✅ 41 items | ✅ 41 items | ~3s/page |\n| 5 | Camoufox | ✅ 40 items | — | — | 42s/page |\n\nThe ladder worked exactly as designed:\n\nThe ladder saved us from jumping straight to Camoufox or paying Firecrawl credits when a simple Chromium browser handled everything.\n\n```\nPriority 1:  Scrapling Fetcher (HTTP)      0.77s   $0\nPriority 3:  Scrapling DynamicFetcher       3.66s   $0\nPriority 4:  Scrapling StealthyFetcher      ~4s     $0\nPriority 5:  Camoufox + Scrapling Selector  8.84s   $0\nPriority 6:  Firecrawl enhanced proxy       ~3-5s   credits\n```\n\nEverything runs on a single 4GB VPS. Peak memory with one browser session: ~800MB including the AI agent gateway. 39GB free disk after cleaning stale caches and old kernels. Total scraping cost: $0.\n\n**Installation is the first test.** Read the docs before `pip install`\n\n. Know what each dependency costs in RAM. Skip what you don't need — you can always add it later.\n\n**The 1.4GB Camoufox story was user error.** Spawning browsers in a loop without sessions will eat any machine. With persistent sessions, Camoufox is the lightest browser in the stack at 81MB. Don't believe benchmark threads — run your own.\n\n**Speed differences compound silently.** 0.77s vs 8.84s is nothing for one page. For 100 pages, it's 77 seconds vs nearly 15 minutes. Choosing the right tier pays off exponentially.\n\n**Fingerprint diversity is a superpower.** Having both Chromium and Firefox in your arsenal means you can bypass sites that target either. Camoufox is slow but it's a different shape entirely — and sometimes that's all you need.\n\n**Wire the ladder, not the tools.** Individual tools leave you guessing. A priority ladder gives you a protocol: start cheap, escalate on failure. Encode it as an agent skill and you never have to think about it again.\n\n**Scrapling is the platform, not just a fetcher.** Adaptive element tracking, three-tier architecture, spider framework with pause/resume, MCP server for AI agents — it's the foundation everything else plugs into. The benchmarks measure its fetchers, but the framework is what makes them interchangeable.\n\n*Questions? Find me on X @mariatanbobo*", "url": "https://wpnews.pro/news/i-tested-every-web-scraping-tool-against-lazada-here-s-what-actually-works-may", "canonical_source": "https://dev.to/mariatanbobo/i-tested-every-web-scraping-tool-against-lazada-heres-what-actually-works-may-2026-16pg", "published_at": "2026-05-30 03:18:10+00:00", "updated_at": "2026-05-30 03:41:51.889149+00:00", "lang": "en", "topics": ["ai-tools", "ai-agents", "ai-infrastructure"], "entities": ["Scrapling", "Lazada", "Google reCAPTCHA", "Playwright", "BeautifulSoup", "Scrapy", "D4Vinci", "MCP"], "alternates": {"html": "https://wpnews.pro/news/i-tested-every-web-scraping-tool-against-lazada-here-s-what-actually-works-may", "markdown": "https://wpnews.pro/news/i-tested-every-web-scraping-tool-against-lazada-here-s-what-actually-works-may.md", "text": "https://wpnews.pro/news/i-tested-every-web-scraping-tool-against-lazada-here-s-what-actually-works-may.txt", "jsonld": "https://wpnews.pro/news/i-tested-every-web-scraping-tool-against-lazada-here-s-what-actually-works-may.jsonld"}}