Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export

Apify released a tutorial for Crawlee for Python, demonstrating how to build a web crawling pipeline with robots handling, link graphs, and RAG chunk export. The tutorial covers environment setup, static and dynamic crawling, structured extraction, and downstream data processing using tools like BeautifulSoupCrawler, ParselCrawler, and PlaywrightCrawler. This enables developers to efficiently scrape and process web data for AI applications.

In this tutorial, we build a full Crawlee-for-Python https://github.com/apify/crawlee-python workflow that covers environment setup, local website generation, static crawling, dynamic crawling, structured extraction, and downstream data processing. We begin by configuring a compatible Crawlee runtime with pinned Pydantic support, Playwright browser installation, persistent storage directories, and Colab-safe execution handling. We then generate a realistic local demo website containing product pages, documentation pages, blog content, internal links, robots.txt rules, JSON-LD metadata, and JavaScript-rendered catalog items. Using BeautifulSoupCrawler, we perform fast recursive HTML crawling and extract page titles, metadata, text previews, outgoing links, product attributes, documentation headings, code blocks, and blog tags. With ParselCrawler, we run precise CSS- and XPath-based extraction on product detail pages. With PlaywrightCrawler, we render JavaScript content in a headless Chromium browser, wait for dynamic DOM elements to appear, extract client-side data, and capture full-page screenshots. Setting Up the Crawlee Python Runtime and Helpers python import os import sys import re import csv import json import time import math import shutil import socket import hashlib import asyncio import textwrap import subprocess import threading from pathlib import Path from functools import partial from http.server import ThreadingHTTPServer, SimpleHTTPRequestHandler from importlib.metadata import version, PackageNotFoundError SETUP SENTINEL = "/content/.crawlee python tutorial setup done v2" def sh command, check=True, quiet=False : print f"\n$ {command}" result = subprocess.run command, shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, if not quiet and result.stdout: print result.stdout -5000: if check and result.returncode = 0: raise RuntimeError f"Command failed with exit code {result.returncode}: {command}" return result.returncode == 0 def package version package name : try: return version package name except PackageNotFoundError: return None def is good pydantic version v : if not v: return False m = re.match r"^ \d+ \. \d+ ", v if not m: return False major, minor = int m.group 1 , int m.group 2 return major == 2 and minor == 11 current crawlee = package version "crawlee" current pydantic = package version "pydantic" needs setup = not os.path.exists SETUP SENTINEL or current crawlee is None or not is good pydantic version current pydantic if needs setup: print "PHASE 1: Installing compatible Crawlee + Pydantic + Playwright dependencies." print "After this finishes, Colab will restart automatically. Then run this same cell again." sh f'{sys.executable} -m pip uninstall -y crawlee pydantic pydantic-core', check=False sh f'{sys.executable} -m pip install -q -U ' f'"pydantic =2.11,<2.12" ' f'"crawlee all " ' f'pandas matplotlib networkx nest asyncio beautifulsoup4 parsel' sh f'{sys.executable} -m playwright install --with-deps chromium', check=False Path SETUP SENTINEL .write text "done", encoding="utf-8" print "\nInstalled versions:" sh f'{sys.executable} -m pip show crawlee pydantic pydantic-core', check=False try: import google.colab print "\nRestarting Colab runtime now. After it reconnects, run this same cell again." os.kill os.getpid , 9 except Exception: raise SystemExit "Setup complete. Restart the runtime/kernel manually, then run this cell again." print "PHASE 2: Dependencies are ready. Running the Crawlee tutorial." import pandas as pd import matplotlib.pyplot as plt import networkx as nx import nest asyncio nest asyncio.apply TUTORIAL ROOT = Path "/content/crawlee python advanced tutorial" SITE DIR = TUTORIAL ROOT / "demo site" OUTPUT DIR = TUTORIAL ROOT / "outputs" STORAGE DIR = TUTORIAL ROOT / "crawlee storage" SCREENSHOT DIR = OUTPUT DIR / "screenshots" for path in SITE DIR, OUTPUT DIR, STORAGE DIR : if path.exists : shutil.rmtree path for path in SITE DIR, OUTPUT DIR, STORAGE DIR, SCREENSHOT DIR : path.mkdir parents=True, exist ok=True os.environ "CRAWLEE STORAGE DIR" = str STORAGE DIR os.environ "CRAWLEE LOG LEVEL" = "INFO" os.environ "CRAWLEE PURGE ON START" = "true" from crawlee import Glob, ConcurrencySettings from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext, ParselCrawler, ParselCrawlingContext, PlaywrightCrawler, PlaywrightCrawlingContext, try: import crawlee print "Crawlee version:", crawlee. version except Exception: print "Crawlee imported successfully." print "Pydantic version:", package version "pydantic" def safe slug value : value = re.sub r" ^a-zA-Z0-9 +", "-", str value .strip "-" .lower return value or "item" def money to float value : if value is None: return None cleaned = re.sub r" ^0-9. ", "", str value return float cleaned if cleaned else None def normalize text value, max len=None : value = re.sub r"\s+", " ", value or "" .strip return value :max len if max len else value def write file path, content : path = Path path path.parent.mkdir parents=True, exist ok=True path.write text textwrap.dedent content .strip + "\n", encoding="utf-8" We begin by preparing the complete Colab runtime for the Crawlee tutorial. We install compatible versions of Crawlee, Pydantic, Playwright, and the required analysis libraries, and handle the automatic restart required after setup. We then configure storage folders, environment variables, crawler imports, and helper functions to ensure the rest of the workflow runs smoothly. Generating the Demo Website and Product Catalog PRODUCTS = { "sku": "CRW-101", "name": "Crawler Reliability Kit", "category": "automation", "price": 149.0, "rating": 4.8, "stock": 18, "features": "retry policy", "queue replay", "structured logs" , "related": "CRW-202", "CRW-303" , }, { "sku": "CRW-202", "name": "Playwright Rendering Pack", "category": "browser", "price": 249.0, "rating": 4.7, "stock": 9, "features": "headless chromium", "screenshots", "dynamic DOM extraction" , "related": "CRW-101", "CRW-404" , }, { "sku": "CRW-303", "name": "RAG Extraction Bundle", "category": "ai-data", "price": 199.0, "rating": 4.9, "stock": 13, "features": "clean text chunks", "metadata capture", "JSONL export" , "related": "CRW-101", "CRW-505" , }, { "sku": "CRW-404", "name": "Anti-Fragile Session Toolkit", "category": "resilience", "price": 299.0, "rating": 4.6, "stock": 5, "features": "session rotation", "state recovery", "graceful failures" , "related": "CRW-202", "CRW-505" , }, { "sku": "CRW-505", "name": "Data Export Control Plane", "category": "storage", "price": 179.0, "rating": 4.5, "stock": 21, "features": "datasets", "key-value store", "CSV and JSON export" , "related": "CRW-303", "CRW-404" , }, def layout title, body, extra head="", extra script="" : css = """ <style body { font-family: Inter, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif; margin: 0; background: f7f7fb; color: 1f2430; } header { background: 202638; color: white; padding: 28px 40px; } nav a { color: dbe7ff; margin-right: 18px; text-decoration: none; font-weight: 600; } main { max-width: 1050px; margin: 0 auto; padding: 32px; } .grid { display: grid; grid-template-columns: repeat auto-fit, minmax 230px, 1fr ; gap: 18px; } .card, article, .panel { background: white; border: 1px solid e5e7ef; border-radius: 16px; padding: 20px; box-shadow: 0 8px 25px rgba 20, 30, 60, 0.05 ; } .price { font-size: 1.3rem; font-weight: 800; } .tag { display: inline-block; background: edf2ff; border: 1px solid d6e0ff; border-radius: 999px; padding: 4px 10px; margin: 3px; font-size: 0.82rem; } .stock-low { color: b42318; font-weight: 700; } .stock-ok { color: 067647; font-weight: 700; } code, pre { background: 111827; color: d1fae5; border-radius: 10px; } pre { padding: 16px; overflow-x: auto; } footer { padding: 30px 40px; color: 606779; } </style """ return f""" < doctype html <html lang="en" <head <meta charset="utf-8" <meta name="viewport" content="width=device-width, initial-scale=1" <meta name="description" content="{title} page for a Crawlee Python tutorial demo website." <title {title}</title {css} {extra head} </head <body <header <h1 {title}</h1 <nav <a href="/index.html" Home</a <a href="/products/product-crw-101.html" Products</a <a href="/docs/getting-started.html" Docs</a <a href="/blog/crawling-at-scale.html" Blog</a <a href="/dynamic.html" Dynamic JS Page</a <a href="/admin/hidden.html" Admin</a </nav </header <main {body}</main <footer Local demo website generated for Crawlee Python advanced tutorial.</footer {extra script} </body </html """ def build demo site : write file SITE DIR / "robots.txt", """ User-agent: Disallow: /admin/ Allow: / """, product cards = for product in PRODUCTS: product cards.append f""" <div class="card product-teaser" data-sku="{product 'sku' }" data-category="{product 'category' }" <h2 <a href="/products/product-{safe slug product 'sku' }.html" {product 'name' }</a </h2 <p {product 'category' } crawler module with rating {product 'rating' }.</p <p class="price" data-price="{product 'price' }" ${product 'price' :.2f}</p <p class="{'stock-low' if product 'stock' < 10 else 'stock-ok'}" Stock: {product 'stock' }</p </div """ write file SITE DIR / "index.html", layout "Crawlee Demo Commerce + Docs Hub", f""" <section class="panel" <h2 Why this site exists</h2 <p This local website gives us predictable pages for testing Crawlee without scraping a third-party website. We include static HTML pages, documentation pages, product detail pages, a blog article, robots.txt, and a JavaScript-rendered page. </p </section <h2 Featured crawler modules</h2 <section class="grid" {''.join product cards } </section <section class="panel" <h2 Internal links for recursive crawling</h2 <ul <li <a href="/docs/getting-started.html" Getting started guide</a </li <li <a href="/docs/advanced-routing.html" Advanced routing guide</a </li <li <a href="/blog/crawling-at-scale.html" Crawling at scale article</a </li <li <a href="/dynamic.html" JavaScript-rendered catalog</a </li <li <a href="/admin/hidden.html" Admin page blocked by robots and crawler filters</a </li </ul </section """, , for product in PRODUCTS: related links = "\n".join f'<li <a class="related-link" href="/products/product-{safe slug sku }.html" {sku}</a </li ' for sku in product "related" feature list = "\n".join f"<li {feature}</li " for feature in product "features" json ld = json.dumps { "@context": "https://schema.org", "@type": "Product", "sku": product "sku" , "name": product "name" , "category": product "category" , "offers": { "@type": "Offer", "price": product "price" , "priceCurrency": "USD", }, "aggregateRating": { "@type": "AggregateRating", "ratingValue": product "rating" , }, }, indent=2, write file SITE DIR / "products" / f"product-{safe slug product 'sku' }.html", layout f"{product 'name' } | Product Detail", f""" <article class="product" data-sku="{product 'sku' }" data-category="{product 'category' }" data-rating="{product 'rating' }" data-stock="{product 'stock' }" <h2 class="product-title" {product 'name' }</h2 <p class="sku" SKU: <strong {product 'sku' }</strong </p <p class="category" Category: <strong {product 'category' }</strong </p <p class="price" data-price="{product 'price' }" ${product 'price' :.2f}</p <p class="rating" Rating: {product 'rating' } / 5</p <p class="{'stock-low' if product 'stock' < 10 else 'stock-ok'}" Stock: {product 'stock' }</p <h3 Features</h3 <ul class="features" {feature list}</ul <h3 Related modules</h3 <ul {related links}</ul </article <script type="application/ld+json" {json ld}</script """, , We create a realistic product catalog that becomes the structured data source for our demo website. We define reusable HTML layout logic, styling, navigation, and page templates to make the local website look and behave like a small commercial and documentation portal. We then generate the homepage and product detail pages, including prices, ratings, stock levels, product features, related links, and JSON-LD metadata. Adding Docs, Blog, Dynamic, and Admin Pages write file SITE DIR / "docs" / "getting-started.html", layout "Getting Started with Reliable Crawlers", """ <article class="doc" data-doc-id="getting-started" <h2 HTTP-first crawling strategy</h2 <p We start with HTTP crawlers because they are lightweight and efficient. Browser crawling is reserved for pages that need JavaScript rendering. </p <h2 Core extraction fields</h2 <p Each crawler extracts URL, title, page type, text summary, outgoing links, and page-specific metadata. </p <pre <code crawler = BeautifulSoupCrawler max requests per crawl=20 </code </pre <p <a href="/docs/advanced-routing.html" Next: advanced routing</a </p </article """, , write file SITE DIR / "docs" / "advanced-routing.html", layout "Advanced Routing and Storage", """ <article class="doc" data-doc-id="advanced-routing" <h2 Queue filtering</h2 <p We filter links to keep the crawl focused on the same local domain and skip admin pages. </p <h2 Storage design</h2 <p Structured rows go to datasets. Binary screenshots and snapshots go to a key-value store. </p <pre <code await context.enqueue links include= Glob "https://example.com/ " </code </pre <p <a href="/blog/crawling-at-scale.html" Read the scaling article</a </p </article """, , write file SITE DIR / "blog" / "crawling-at-scale.html", layout "Crawling at Scale", """ <article class="blog-post" data-author="demo-team" data-reading-time="7" <h2 Scaling crawler jobs without losing reliability</h2 <p Production crawlers need controlled concurrency, retry behavior, stable request queues, structured exports, and monitoring-ready output. </p <p For AI data workflows, we also normalize text, preserve source URLs, create chunks, and record extraction provenance. </p <span class="tag" queues</span <span class="tag" datasets</span <span class="tag" rag</span <span class="tag" playwright</span </article """, , dynamic items = json.dumps { "sku": "JS-900", "name": "Dynamic Inventory Scanner", "price": 329.0, "stock": 4, "desc": "Rendered only after JavaScript executes.", }, { "sku": "JS-901", "name": "Client-Side Review Miner", "price": 279.0, "stock": 11, "desc": "Created by browser-side DOM manipulation.", }, { "sku": "JS-902", "name": "Async Catalog Watcher", "price": 389.0, "stock": 7, "desc": "Useful for testing PlaywrightCrawler extraction.", }, , indent=2, dynamic script = f""" <script const dynamicItems = {dynamic items}; function renderItems {{ const root = document.querySelector " dynamic-products" ; root.innerHTML = ""; for const item of dynamicItems {{ const card = document.createElement "div" ; card.className = "card js-card"; card.dataset.sku = item.sku; card.dataset.price = item.price; card.dataset.stock = item.stock; card.innerHTML = <h3 ${{item.name}}</h3 <p class="desc" ${{item.desc}}</p <p class="price" $${{item.price.toFixed 2 }}</p <p class="${{item.stock < 8 ? "stock-low" : "stock-ok"}}" Stock: ${{item.stock}}</p ; root.appendChild card ; }} document.querySelector " render-status" .textContent = "Rendered " + dynamicItems.length + " JavaScript items."; }} setTimeout renderItems, 600 ; </script """ write file SITE DIR / "dynamic.html", layout "JavaScript Rendered Catalog", """ <section class="panel" <h2 Dynamic content test</h2 <p A plain HTTP crawler can download this page, but it will not see the cards below until JavaScript runs. PlaywrightCrawler opens a real browser and extracts the rendered DOM. </p <p id="render-status" Waiting for JavaScript rendering...</p </section <section id="dynamic-products" class="grid" </section """, extra script=dynamic script, , write file SITE DIR / "admin" / "hidden.html", layout "Hidden Admin Page", """ <article class="panel" <h2 This page should be skipped</h2 <p The crawler excludes this admin path to demonstrate control over the rawl scope </p </article """, , build demo site print f"Demo site generated at: {SITE DIR}" class QuietHandler SimpleHTTPRequestHandler : def log message self, format, args : pass def start local server directory : probe = socket.socket probe.bind "127.0.0.1", 0 port = probe.getsockname 1 probe.close handler = partial QuietHandler, directory=str directory httpd = ThreadingHTTPServer "127.0.0.1", port , handler thread = threading.Thread target=httpd.serve forever, daemon=True thread.start base url = f"http://127.0.0.1:{port}" time.sleep 0.5 return httpd, base url def extract json ld soup : blocks = for script in soup.select 'script type="application/ld+json" ' : raw = script.string or script.get text if not raw: continue try: blocks.append json.loads raw except Exception: blocks.append {"raw": raw} return blocks def write json path, rows : path = Path path path.write text json.dumps rows, ensure ascii=False, indent=2 , encoding="utf-8" def write csv path, rows : path = Path path if not rows: path.write text "", encoding="utf-8" return flattened = for row in rows: flat = {} for key, value in row.items : if isinstance value, list, dict : flat key = json.dumps value, ensure ascii=False else: flat key = value flattened.append flat fieldnames = sorted {key for row in flattened for key in row.keys } with path.open "w", newline="", encoding="utf-8" as f: writer = csv.DictWriter f, fieldnames=fieldnames writer.writeheader writer.writerows flattened We expand the demo website by adding documentation pages, a blog article, a JavaScript-rendered catalog page, and an admin page intended to be excluded from crawling. We use these pages to test different crawling scenarios, including static HTML extraction, documentation parsing, blog metadata extraction, dynamic browser rendering, and crawl filtering. We also start a local HTTP server and define utilities to extract JSON-LD content and export crawl results to JSON and CSV. Static Crawling with BeautifulSoupCrawler and ParselCrawler python async def run beautifulsoup crawl base url : print "\n=== 1 BeautifulSoupCrawler: fast recursive HTTP crawl ===" rows = crawler = BeautifulSoupCrawler parser="html.parser", max requests per crawl=30, max request retries=1, respect robots txt file=True, concurrency settings=ConcurrencySettings desired concurrency=4, max concurrency=6, , @crawler.router.default handler async def request handler context: BeautifulSoupCrawlingContext - None: soup = context.soup url = context.request.url title = normalize text soup.title.get text " ", strip=True if soup.title else "" meta description = "" meta tag = soup.find "meta", attrs={"name": "description"} if meta tag: meta description = normalize text meta tag.get "content", "" out links = for a in soup.select "a href " : href = a.get "href" label = normalize text a.get text " ", strip=True , 120 out links.append {"href": href, "label": label} page text = normalize text soup.get text " ", strip=True , 1000 if "/products/" in url: page type = "product" elif "/docs/" in url: page type = "documentation" elif "/blog/" in url: page type = "blog" elif "/dynamic" in url: page type = "dynamic-shell" else: page type = "index" row = { "source": "beautifulsoup-http", "url": url, "title": title, "page type": page type, "meta description": meta description, "text preview": page text, "out links": out links, "json ld": extract json ld soup , "extracted at unix": time.time , } if page type == "product": article = soup.select one "article.product" if article: price node = soup.select one ".price" row "product" = { "sku": article.get "data-sku" , "category": article.get "data-category" , "name": normalize text soup.select one ".product-title" .get text " ", strip=True if soup.select one ".product-title" else "" , "price": money to float price node.get "data-price" if price node else None , "rating": float article.get "data-rating" if article.get "data-rating" else None, "stock": int article.get "data-stock" if article.get "data-stock" else None, "features": normalize text li.get text " ", strip=True for li in soup.select ".features li" , } if page type == "documentation": row "doc" = { "headings": normalize text h.get text " ", strip=True for h in soup.select "h2, h3" , "code blocks": normalize text code.get text " ", strip=True for code in soup.select "pre code" , } if page type == "blog": row "blog" = { "author": soup.select one ".blog-post" .get "data-author" if soup.select one ".blog-post" else None, "reading time": soup.select one ".blog-post" .get "data-reading-time" if soup.select one ".blog-post" else None, "tags": normalize text tag.get text " ", strip=True for tag in soup.select ".tag" , } rows.append row await context.push data row await context.enqueue links include= Glob f"{base url}/ " , exclude= Glob f"{base url}/admin/ " , Glob f"{base url}/dynamic.html" , , await crawler.run f"{base url}/index.html" write json OUTPUT DIR / "beautifulsoup crawl.json", rows write csv OUTPUT DIR / "beautifulsoup crawl.csv", rows print f"BeautifulSoup rows extracted: {len rows }" return rows async def run parsel precision crawl base url : print "\n=== 2 ParselCrawler: precise CSS/XPath extraction from product pages ===" rows = product urls = f"{base url}/products/product-{safe slug product 'sku' }.html" for product in PRODUCTS crawler = ParselCrawler max requests per crawl=len product urls , max request retries=1, concurrency settings=ConcurrencySettings desired concurrency=5, max concurrency=8, , @crawler.router.default handler async def request handler context: ParselCrawlingContext - None: selector = context.selector title = selector.css "title::text" .get sku = selector.css "article.product::attr data-sku " .get category = selector.css "article.product::attr data-category " .get rating = selector.css "article.product::attr data-rating " .get stock = selector.css "article.product::attr data-stock " .get name = selector.css ".product-title::text" .get price = selector.css ".price::attr data-price " .get features = normalize text feature for feature in selector.css ".features li::text" .getall row = { "source": "parsel-precision", "url": context.request.url, "title": normalize text title , "sku": sku, "name": normalize text name , "category": category, "price": money to float price , "rating": float rating if rating else None, "stock": int stock if stock else None, "features": features, "xpath title": normalize text selector.xpath "//title/text " .get , } rows.append row await context.push data row await crawler.run product urls write json OUTPUT DIR / "parsel products.json", rows write csv OUTPUT DIR / "parsel products.csv", rows print f"Parsel product rows extracted: {len rows }" return rows We implement the static crawling part of the workflow using BeautifulSoupCrawler and ParselCrawler. With BeautifulSoupCrawler, we recursively crawl the local website and extract page titles, metadata, text previews, outgoing links, product details, documentation headings, code blocks, and blog tags. With ParselCrawler, we perform more targeted CSS and XPath extraction from product pages to collect clean product-level fields, including SKU, category, price, rating, stock, and features. Dynamic Rendering with PlaywrightCrawler and Link Graphs python async def run playwright dynamic crawl base url : print "\n=== 3 PlaywrightCrawler: browser-rendered JavaScript crawl ===" rows = crawler = PlaywrightCrawler max requests per crawl=2, max request retries=1, headless=True, browser type="chromium", browser launch options={ "args": "--no-sandbox", "--disable-dev-shm-usage" , }, goto options={ "wait until": "domcontentloaded", }, concurrency settings=ConcurrencySettings desired concurrency=1, max concurrency=2, , @crawler.router.default handler async def request handler context: PlaywrightCrawlingContext - None: await context.page.wait for selector ".js-card", timeout=10000 cards = await context.page.locator ".js-card" .evaluate all """ cards = cards.map card = { const h3 = card.querySelector "h3" ; const desc = card.querySelector ".desc" ; const price = card.querySelector ".price" ; return { sku: card.dataset.sku, name: h3 ? h3.textContent.trim : null, description: desc ? desc.textContent.trim : null, price text: price ? price.textContent.trim : null, price: Number card.dataset.price , stock: Number card.dataset.stock , rendered text: card.innerText.trim }; } """ screenshot bytes = await context.page.screenshot full page=True screenshot path = SCREENSHOT DIR / "dynamic catalog full page.png" screenshot path.write bytes screenshot bytes try: kvs = await context.get key value store await kvs.set value key="dynamic-catalog-full-page", value=screenshot bytes, content type="image/png", except Exception as exc: print "Key-value store screenshot save skipped:", repr exc for card in cards: row = { card, "source": "playwright-rendered-js", "url": context.request.url, "screenshot path": str screenshot path , "extracted at unix": time.time , } rows.append row await context.push data rows try: await crawler.run f"{base url}/dynamic.html" except Exception as exc: print "Playwright section failed gracefully." print "Reason:", repr exc write json OUTPUT DIR / "playwright dynamic.json", rows write csv OUTPUT DIR / "playwright dynamic.csv", rows print f"Playwright dynamic rows extracted: {len rows }" return rows def flatten products rows : products = for row in rows: if row.get "page type" == "product" and isinstance row.get "product" , dict : product = row "product" products.append { "source": row.get "source" , "url": row.get "url" , "sku": product.get "sku" , "name": product.get "name" , "category": product.get "category" , "price": product.get "price" , "rating": product.get "rating" , "stock": product.get "stock" , "features": "; ".join product.get "features", , } elif row.get "source" == "parsel-precision": products.append { "source": row.get "source" , "url": row.get "url" , "sku": row.get "sku" , "name": row.get "name" , "category": row.get "category" , "price": row.get "price" , "rating": row.get "rating" , "stock": row.get "stock" , "features": "; ".join row.get "features", , } elif row.get "source" == "playwright-rendered-js": products.append { "source": row.get "source" , "url": row.get "url" , "sku": row.get "sku" , "name": row.get "name" , "category": "dynamic-js", "price": row.get "price" or money to float row.get "price text" , "rating": None, "stock": row.get "stock" , "features": row.get "description" , } return products def absolute url base url, href : if not href: return None if href.startswith "http://" or href.startswith "https://" : return href if href.startswith "/" : return base url + href return base url + "/" + href def build link graph base url, rows : graph = nx.DiGraph for row in rows: src = row.get "url" if not src: continue graph.add node src, title=row.get "title", "" , page type=row.get "page type", "" , for link in row.get "out links", or : dst = absolute url base url, link.get "href" if not dst: continue if "/admin/" in dst: continue graph.add node dst graph.add edge src, dst, label=link.get "label", "" return graph We handle dynamic content using PlaywrightCrawler, which opens the JavaScript-rendered page in a headless Chromium browser. We wait for client-side product cards to appear, extract their rendered fields, capture a full-page screenshot, and save the browser-based results for later analysis. We then define helper functions to normalize product records and build a directed link graph from the internal links discovered during crawling. Building AI-Ready Outputs and Running the Pipeline python def make rag chunks rows, max chars=700 : chunks = for row in rows: text = row.get "text preview" or row.get "rendered text" or row.get "description" or "" text = normalize text text if not text: continue sentences = re.split r" ?<= . ? \s+", text current = "" for sentence in sentences: if len current + len sentence + 1 <= max chars: current = current + " " + sentence .strip else: if current: chunks.append { "chunk id": hashlib.sha1 row.get "url", "" + current .encode .hexdigest :12 , "url": row.get "url" , "source": row.get "source" , "page type": row.get "page type" , "title": row.get "title" or row.get "name" , "text": current, } current = sentence if current: chunks.append { "chunk id": hashlib.sha1 row.get "url", "" + current .encode .hexdigest :12 , "url": row.get "url" , "source": row.get "source" , "page type": row.get "page type" , "title": row.get "title" or row.get "name" , "text": current, } return chunks def analyze outputs base url, bs4 rows, parsel rows, playwright rows : all rows = bs4 rows + parsel rows + playwright rows products = flatten products all rows crawl df = pd.DataFrame all rows product df = pd.DataFrame products if not product df.empty: product df "price" = pd.to numeric product df "price" , errors="coerce" product df "stock" = pd.to numeric product df "stock" , errors="coerce" product df "rating" = pd.to numeric product df "rating" , errors="coerce" product df "inventory value" = product df "price" product df "stock" graph = build link graph base url, bs4 rows graph path = OUTPUT DIR / "site link graph.graphml" if graph.number of nodes 0: nx.write graphml graph, graph path chunks = make rag chunks all rows rag path = OUTPUT DIR / "rag chunks.jsonl" with rag path.open "w", encoding="utf-8" as f: for chunk in chunks: f.write json.dumps chunk, ensure ascii=False + "\n" crawl json path = OUTPUT DIR / "combined crawl results.json" crawl json path.write text json.dumps all rows, ensure ascii=False, indent=2 , encoding="utf-8", product csv path = OUTPUT DIR / "normalized product catalog.csv" if not product df.empty: product df.to csv product csv path, index=False price plot path = OUTPUT DIR / "product price chart.png" if not product df.empty and product df "price" .notna .any : plot df = product df.dropna subset= "price" .copy plot df "label" = plot df "sku" .fillna "unknown" + "\n" + plot df "source" .fillna "" ax = plot df.plot kind="bar", x="label", y="price", legend=False, figsize= 11, 5 , title="Extracted Product Prices by Source", ax.set xlabel "Product / extraction source" ax.set ylabel "Price" plt.xticks rotation=35, ha="right" plt.tight layout plt.savefig price plot path, dpi=160 plt.show graph stats = { "nodes": graph.number of nodes , "edges": graph.number of edges , "weakly connected components": nx.number weakly connected components graph if graph.number of nodes else 0 , } if graph.number of nodes 0: in degrees = dict graph.in degree out degrees = dict graph.out degree graph stats "top in degree" = sorted in degrees.items , key=lambda x: x 1 , reverse=True, :5 graph stats "top out degree" = sorted out degrees.items , key=lambda x: x 1 , reverse=True, :5 summary = { "base url": base url, "rows total": len all rows , "beautifulsoup rows": len bs4 rows , "parsel rows": len parsel rows , "playwright rows": len playwright rows , "products total": len product df , "rag chunks total": len chunks , "graph": graph stats, "outputs": { "beautifulsoup json": str OUTPUT DIR / "beautifulsoup crawl.json" , "beautifulsoup csv": str OUTPUT DIR / "beautifulsoup crawl.csv" , "parsel json": str OUTPUT DIR / "parsel products.json" , "parsel csv": str OUTPUT DIR / "parsel products.csv" , "playwright json": str OUTPUT DIR / "playwright dynamic.json" , "playwright csv": str OUTPUT DIR / "playwright dynamic.csv" , "combined json": str crawl json path , "product csv": str product csv path if product csv path.exists else None, "rag jsonl": str rag path , "graphml": str graph path if graph path.exists else None, "price plot": str price plot path if price plot path.exists else None, "screenshots dir": str SCREENSHOT DIR , }, } summary path = OUTPUT DIR / "run summary.md" summary path.write text " Crawlee Python Advanced Tutorial Run Summary\n\n" f"- Local demo site: {base url} \n" f"- Total extracted rows: {summary 'rows total' } \n" f"- BeautifulSoup rows: {summary 'beautifulsoup rows' } \n" f"- Parsel rows: {summary 'parsel rows' } \n" f"- Playwright rows: {summary 'playwright rows' } \n" f"- Normalized products: {summary 'products total' } \n" f"- RAG chunks: {summary 'rag chunks total' } \n" f"- Link graph nodes: {graph stats 'nodes' } \n" f"- Link graph edges: {graph stats 'edges' } \n\n" " Output files\n\n" + "\n".join f"- {k} : {v} " for k, v in summary "outputs" .items + "\n", encoding="utf-8", print "\n=== 4 Analysis summary ===" print json.dumps summary, indent=2, ensure ascii=False try: from IPython.display import display, Markdown, Image as IPImage display Markdown " Crawlee crawl preview" if not crawl df.empty: preview cols = col for col in "source", "page type", "title", "url" if col in crawl df.columns display crawl df preview cols .head 12 display Markdown " Normalized product catalog" if not product df.empty: display product df.head 20 if price plot path.exists : display Markdown " Product price chart" display IPImage filename=str price plot path screenshot path = SCREENSHOT DIR / "dynamic catalog full page.png" if screenshot path.exists : display Markdown " Playwright screenshot of JavaScript-rendered page" display IPImage filename=str screenshot path display Markdown f" Output directory\n {OUTPUT DIR} " except Exception as exc: print "Notebook display skipped:", repr exc return summary async def main : httpd, base url = start local server SITE DIR print f"\nLocal demo website is running at: {base url}/index.html" try: bs4 rows = await run beautifulsoup crawl base url parsel rows = await run parsel precision crawl base url playwright rows = await run playwright dynamic crawl base url summary = analyze outputs base url, bs4 rows, parsel rows, playwright rows return summary finally: httpd.shutdown print "\nLocal demo server shut down." loop = asyncio.get event loop summary = loop.run until complete main print "\nTutorial complete." print f"All outputs are in: {OUTPUT DIR}" print "Key files:" for file path in sorted OUTPUT DIR.rglob " " : if file path.is file : print " -", file path We process the extracted crawl data into analysis-ready and AI-ready outputs. We create RAG-style JSONL chunks, combine all crawl results, build a normalized product catalog, generate a GraphML link graph, and visualize product prices with Matplotlib. Finally, we run the full pipeline end-to-end, display previews in the notebook, save all generated artifacts, and print the final output file paths. Conclusion In conclusion, we have a complete Crawlee-based pipeline for crawling and data engineering that converts a small website into structured, reusable datasets. We used crawl scoping, robots.txt handling, concurrency settings, link enqueuing, browser rendering, key-value storage, and dataset exports to simulate patterns used in production web crawling systems. We normalized the extracted product data, saved the crawl outputs as JSON and CSV, created GraphML link graphs with NetworkX, generated JSONL chunks for retrieval-augmented generation workflows, and visualized the extracted product prices with Matplotlib. Check out the Full Codes here https://github.com/MARKTECHPOST-AI-MEDIA-INC/AI-Agents-Projects-Tutorials/blob/main/Agentic%20Workflows/crawlee python static dynamic web crawling Marktechpost.ipynb . https://github.com/NVlabs/SpatialClaw Also, feel free to follow us on Twitter and don’t forget to join our and Subscribe to 150k+ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58 Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions. - Sana Hassan - Sana Hassan - Sana Hassan - Sana Hassan