agents.md: Teaching AI Agents How to Scrape (The Future of Web Automation)

wpnews.pro

You're building a Scrapy scraper. You ask Claude Code for help. "Add pagination to my spider."

Claude reads your code. It sees scraper.py, settings.py, the folder structure. Then it makes suggestions.

But the suggestions don't match your style. You use specific naming conventions. Your error handling is particular. Your Scrapy middleware is custom. Claude doesn't know any of this.

So it writes code that technically works but doesn't fit. You spend 30 minutes rewriting it to match your patterns. You ask again. Same problem. Claude doesn't learn. You have to explain everything every time.

This is frustrating. You want the AI to understand your project. Not just the code, but the philosophy. The patterns. The conventions. The way you actually work.

Then you discover agents.md.

A simple Markdown file. You drop it in your project root. It teaches AI agents how you work. Your naming conventions. Your architectural decisions. Your scraping patterns. Your error handling approach. Everything.

Now when you ask Claude Code for help, it reads agents.md first. It understands your project. The code it writes matches your style immediately. No rewriting. No repeating yourself.

You're not building a scraper anymore. You're building a scraper that teaches AI how to build scrapers.

This is agents.md. And it's changing how developers work with AI.

Let me show you.

agents.md is a Markdown file that teaches AI coding agents how to work on your project.

Think of it like this:

README.md teaches humans about your project. What it does. How to install it. How to contribute.

agents.md teaches AI about your project. The patterns you use. The decisions you made. How to write code that fits your style.

Both live in your project. Both are important. They serve different audiences.

Scrapy projects are complex. You have spiders, pipelines, middleware, settings. You have patterns for error handling. Conventions for naming. Architectural decisions about how data flows.

When you ask AI for help, it sees all this complexity. But it doesn't understand the patterns. The philosophy. The way you think about scraping.

agents.md fixes this.

You document your patterns once. AI reads it. Now every suggestion, every piece of generated code, follows your patterns automatically.

There are several variants of this concept:

agents.md: Open standard for any AI coding agent. Works with GitHub Copilot, Cursor, OpenAI Codex CLI, and more.

SKILL.md: Anthropic's format for specific reusable skills. More structured. Separate folder per skill.

CLAUDE.md: Claude Code's specific variant (newer, more features).

agents.md is the universal format. It works everywhere. This blog focuses on agents.md.

agents.md has two parts:

Here's a minimal agents.md file:

---
name: ecommerce-scraper
description: Web scraper for e-commerce product data
---


## Project Structure

- spiders/ - Scrapy spiders for different sites
- pipelines.py - Data cleaning and storage
- settings.py - Scrapy configuration
- items.py - Item definitions

## Naming Conventions

Spiders are named: spider_<site_name>.py

Examples: spider_amazon.py, spider_ebay.py

## Error Handling

All network errors are caught and logged

That's it. Simple. The agent reads this. Understands your project. Follows your style.

name: Your project's identifier (used internally)

description: Short description (50-100 chars). Tells the agent when to use this.

Optional fields:

---
name: ecommerce-scraper
description: Web scraper for e-commerce product data
version: 1.0.0
author: Your Name
tags: scrapy, web-scraping, ecommerce
license: MIT
---

These fields help the agent understand the project scope. They're optional but helpful for teams.

Here's how to build an effective agents.md for a scraper.

In your project root:

touch agents.md
---
name: product-scraper
description: Scrapy spider for scraping product data
version: 1.0.0
author: Your Name
tags: scrapy, web-scraping, python
---
## Project Structure

- scrapers/spiders/ - All Scrapy spiders
  - One spider per website
  - Named spider_<site>.py

- scrapers/pipelines.py - Data processing
  - TextCleaningPipeline - Normalize text
  - DuplicateRemovalPipeline - Remove duplicates
  - DatabaseStoragePipeline - Save to PostgreSQL

- scrapers/items.py - Data schema definitions
- scrapers/settings.py - Scrapy configuration
- scrapers/middlewares.py - Custom middleware
## Naming Conventions

Spiders: spider_<site_name>.py
- Example: spider_amazon.py, spider_ebay.py
- Class: <SiteName>Spider (CamelCase)
- Example: AmazonSpider, EbaySpider

Pipelines: <FunctionName>Pipeline
- Example: TextCleaningPipeline, DuplicateRemovalPipeline

Items: <SiteName>Item
- Example: ProductItem, ReviewItem

Methods: snake_case
- parse_product(), extract_price(), clean_text()

Include code examples showing your approach:

## Spider Patterns

### Basic Spider Structure

Every spider follows this pattern:

class <SiteName>Spider(scrapy.Spider):
    name = '<site-identifier>'
    allowed_domains = ['<domain.com>']
    start_urls = ['<url>']

    def parse(self, response):
        for item in response.css('<selector>'):
            yield {
                'name': item.css('.name::text').get(),
                'price': item.css('.price::text').get(),
            }

        next_page = response.css('<next-page-selector>').get()
        if next_page:
            yield response.follow(next_page, self.parse)

### Error Handling

Always wrap network calls in try-except:

def parse_product(self, response):
    try:
        name = response.css('.name::text').get()
        if not name:
            self.logger.warning(f"No name found on {response.url}")
            return

        price = response.css('.price::text').get()

    except Exception as e:
        self.logger.error(f"Error parsing {response.url}: {e}")
        return

### Price Extraction

Prices should be extracted as floats:

def clean_price(self, price_str):
    import re
    match = re.search(r'\d+\.?\d*', price_str)
    return float(match.group(0)) if match else None

### Data Validation

Check for required fields before yielding:

def parse(self, response):
    for item in response.css('.product'):
        product = {
            'name': item.css('.name::text').get(),
            'price': self.clean_price(item.css('.price::text').get()),
            'url': response.url,
        }

        if product['name'] and product['price']:
            yield product
## Scrapy Settings

Key settings in settings.py:

CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

Pipelines (in order):
1. TextCleaningPipeline (normalize text)
2. DuplicateRemovalPipeline (remove duplicates)
3. DatabaseStoragePipeline (save to database)
## Common Tasks

### Adding a New Spider

1. Create scrapers/spiders/spider_<sitename>.py
2. Inherit from scrapy.Spider
3. Define name, allowed_domains, start_urls
4. Implement parse() method
5. Return items with required fields (name, price, url)
6. Test locally: scrapy crawl <spider-name>

### Adding a New Pipeline

1. Create new class in pipelines.py
2. Implement process_item() method
3. Add to ITEM_PIPELINES in settings.py
4. Assign integer priority (lower = earlier)

### Debugging a Spider

Debug specific spider:
scrapy crawl spider_name -a debug=True

See rendered HTML:
scrapy shell 'https://example.com'
response.css('.selector').get()

Check all items before pipeline:
ITEM_PIPELINES = {}
## Dependencies

Core:
- Scrapy 2.9+
- Python 3.8+

Data Processing:
- pandas (data cleaning)
- sqlalchemy (database)
- psycopg2 (PostgreSQL)

Testing:
- pytest
- pytest-scrapy

Install all:
pip install -r requirements.txt

Document what your items actually look like:

## Item Format

All items follow this structure:

{
    'name': str,              # Product name (required)
    'price': float,           # Price in USD (required)
    'original_price': float,  # Before discount (optional)
    'rating': float,          # 0-5 stars (optional)
    'review_count': int,      # Number of reviews (optional)
    'url': str,               # Source URL (required)
    'site': str,              # Which site (amazon, ebay, etc)
    'scraped_at': str,        # ISO datetime
}

Claude Code automatically reads agents.md in your project root.

Just ask for help and Claude will follow your patterns.

Example:

"Add a spider for Target.com following the existing patterns"

Claude reads agents.md. Knows your naming conventions. Your error handling. Your spider template. Generates code that fits perfectly.

Create agents.md in your repo root. GitHub Copilot reads it in VS Code/GitHub.

When you start typing a new spider, Copilot autocompletes following your patterns.

Same as GitHub Copilot. Cursor reads agents.md automatically.

You can reference agents.md when using Codex.

Start small. Add more as you need it.

---
name: my-scraper
description: Scrapy scraper for product data
---

## Project Structure

- spiders/ - Scrapy spiders
- pipelines.py - Data processing
- items.py - Data schema

## Naming

Spiders: spider_<site>.py
Methods: snake_case

## Error Handling

Wrap in try-except. Log errors with URL.

Start with this. Get working. Then expand.

Add spider template, pipeline patterns, common tasks.

Add full reference, all patterns, all conventions, troubleshooting.

Add complex patterns, edge cases, performance tips.

You write a 5,000-word agents.md. Too much. AI can't process it all.

Fix: Keep it under 2,000 words. Use progressive disclosure. Reference external files:

"See docs/advanced-patterns.md for complex scraping scenarios."

"Use good names."

Too vague. AI doesn't know what good means.

Fix: Be specific. Show examples.

"Spiders: spider_.py. Classes: Spider. Methods: snake_case."

You update your patterns but forget to update agents.md.

AI follows the old patterns from the file.

Fix: Update agents.md whenever you change patterns. Commit it to Git.

Shows example but no context.

Fix: Explain why. Show both example and reasoning.

"Remember to use kwargs for flexibility."

New developers don't understand this.

Fix: Explain for beginners. Show what it does.

When multiple developers work on the scraper, agents.md becomes crucial.

All developers read the same agents.md. Everyone follows the same patterns.

No more debates: "Should this be snake_case or CamelCase?" It's in agents.md.

New developer joins. You say: "Read agents.md. It explains how we work."

They read one file. Now they can contribute following your patterns immediately.

Reviewer checks if code follows agents.md patterns. Reduces comments about style.

Focus on logic, not formatting.

Add this to agents.md:

"AI will check:

When you ask Claude Code for help, it checks these automatically.

Here's a complete agents.md for an e-commerce scraper:

---
name: ecommerce-scraper
description: Scrapy spider for scraping product data
version: 1.0.0
author: Your Team
tags: scrapy, web-scraping, ecommerce, python
license: MIT
---


This Scrapy project scrapes product data from multiple e-commerce websites.

## Quick Facts

- Framework: Scrapy 2.9+
- Python: 3.8+
- Database: PostgreSQL
- Sites: Amazon, eBay, Walmart, Best Buy
- Update Frequency: Daily via cron job

## Project Structure

ecommerce-scraper/
├── scrapers/
│   ├── spiders/
│   │   ├── spider_amazon.py
│   │   ├── spider_ebay.py
│   │   └── spider_walmart.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── middlewares.py
├── requirements.txt
└── agents.md

## Data Flow

Start URLs
    ↓
Parse Product Page
    ↓
Extract: name, price, rating, reviews, url
    ↓
Validation Pipeline
    ↓
Text Cleaning
    ↓
Deduplication
    ↓
Database Storage

## Item Structure

All items follow this structure:

{
    'name': str,              # Product name (required)
    'price': float,           # Price in USD (required)
    'original_price': float,  # Before discount (optional)
    'rating': float,          # 0-5 stars (optional)
    'review_count': int,      # Number of reviews (optional)
    'url': str,               # Source URL (required)
    'site': str,              # Which site
    'scraped_at': str,        # ISO datetime
}

## Naming Conventions

Spiders: spider_<site>.py
Classes: <SiteName>Spider (CamelCase)
Methods: snake_case

Examples:
- Spider file: spider_amazon.py
- Class: AmazonSpider
- Method: extract_price()

## Spider Template

class <SiteName>Spider(scrapy.Spider):
    name = '<site-slug>'
    allowed_domains = ['<domain.com>']
    start_urls = ['<start-url>']

    def parse(self, response):
        try:
            for product in response.css('<selector>'):
                yield self.extract_product(product, response)

            next_page = response.css('<next-selector>').get()
            if next_page:
                yield response.follow(next_page, self.parse)

        except Exception as e:
            self.logger.error(f"Error parsing {response.url}: {e}")

    def extract_product(self, product, response):
        return {
            'name': product.css('.name::text').get('').strip(),
            'price': self.extract_price(product),
            'url': response.urljoin(product.css('a::attr(href)').get()),
            'site': self.name,
        }

    def extract_price(self, product):
        import re
        price_str = product.css('.price::text').get('')
        match = re.search(r'\d+\.?\d*', price_str)
        return float(match.group(0)) if match else None

## Error Handling

Always wrap in try-except. Log errors with URL.

## Scrapy Settings

CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

ITEM_PIPELINES = {
    'scrapers.pipelines.TextCleaningPipeline': 100,
    'scrapers.pipelines.DuplicateRemovalPipeline': 200,
    'scrapers.pipelines.DatabaseStoragePipeline': 300,
}

## Common Tasks

Adding a New Spider:
1. Create scrapers/spiders/spider_<sitename>.py
2. Inherit from scrapy.Spider
3. Define name, allowed_domains, start_urls
4. Implement parse() method
5. Test: scrapy crawl <spider-name>

Testing a Spider:
scrapy shell 'https://example.com'
response.css('.selector').get()

Running All Spiders:
scrapy crawl amazon && scrapy crawl ebay && scrapy crawl walmart

## Dependencies

- Scrapy 2.9+
- Python 3.8+
- sqlalchemy
- psycopg2-binary
- pandas

Install: pip install -r requirements.txt

This is a complete agents.md. When Claude Code reads this, it understands everything about your project. Every suggestion follows your patterns.

agents.md teaches AI agents how your project works.

Markdown file in project root. YAML frontmatter + Markdown body. Describes patterns, conventions, architecture. Read automatically by Claude Code, GitHub Copilot, Cursor.

AI writes code matching your style. No rewriting AI-generated code. Faster onboarding for new developers. Consistent project standards. Works with all AI coding agents.

Keep it under 2,000 words. Be specific, not vague. Update when you change patterns. Use progressive disclosure. Explain the why, not just examples.

One agents.md = everyone on same page. Faster code review. Easier onboarding. Enforce standards without debate.

You're not just building a scraper. You're building a scraper that teaches AI how to build scrapers.

The future of development is humans and AI working together. agents.md is how you teach the AI to be a good collaborator.

Next Steps:

The investment in documenting patterns pays dividends every time you ask for AI help.

source & further reading

dev.to — original article Prompt Injection as a Control/Data Boundary Problem Australian Cyber Security Centre Issues Alert on Mass Exploitation of CMS Vulnerabilities Malicious 'jscrambler' NPM Package Versions Deploy Cross-Platform Infostealer in Sophisticated Supply Chain Attack

agents.md: Teaching AI Agents How to Scrape (The Future of Web Automation)

Run your AI side-project on zahid.host