{"slug": "agents-md-teaching-ai-agents-how-to-scrape-the-future-of-web-automation", "title": "agents.md: Teaching AI Agents How to Scrape (The Future of Web Automation)", "summary": "A developer has created a Markdown file called \"agents.md\" that teaches AI coding agents how to work on a specific project, solving the problem of AI-generated code not matching a developer's style or conventions. By placing this file in the project root, AI agents like Claude Code first read the developer's naming conventions, architectural decisions, and error-handling patterns before generating code, eliminating the need for manual rewrites. The approach mirrors how README.md documents projects for humans, with agents.md serving as a universal instruction set for AI agents across tools like GitHub Copilot, Cursor, and OpenAI Codex CLI.", "body_md": "You're building a Scrapy scraper. You ask Claude Code for help. \"Add pagination to my spider.\"\n\nClaude reads your code. It sees scraper.py, settings.py, the folder structure. Then it makes suggestions.\n\nBut the suggestions don't match your style. You use specific naming conventions. Your error handling is particular. Your Scrapy middleware is custom. Claude doesn't know any of this.\n\nSo it writes code that technically works but doesn't fit. You spend 30 minutes rewriting it to match your patterns. You ask again. Same problem. Claude doesn't learn. You have to explain everything every time.\n\nThis is frustrating. You want the AI to understand your project. Not just the code, but the philosophy. The patterns. The conventions. The way you actually work.\n\nThen you discover agents.md.\n\nA simple Markdown file. You drop it in your project root. It teaches AI agents how you work. Your naming conventions. Your architectural decisions. Your scraping patterns. Your error handling approach. Everything.\n\nNow when you ask Claude Code for help, it reads agents.md first. It understands your project. The code it writes matches your style immediately. No rewriting. No repeating yourself.\n\nYou're not building a scraper anymore. You're building a scraper that teaches AI how to build scrapers.\n\nThis is agents.md. And it's changing how developers work with AI.\n\nLet me show you.\n\nagents.md is a Markdown file that teaches AI coding agents how to work on your project.\n\nThink of it like this:\n\nREADME.md teaches humans about your project. What it does. How to install it. How to contribute.\n\nagents.md teaches AI about your project. The patterns you use. The decisions you made. How to write code that fits your style.\n\nBoth live in your project. Both are important. They serve different audiences.\n\nScrapy projects are complex. You have spiders, pipelines, middleware, settings. You have patterns for error handling. Conventions for naming. Architectural decisions about how data flows.\n\nWhen you ask AI for help, it sees all this complexity. But it doesn't understand the patterns. The philosophy. The way you think about scraping.\n\nagents.md fixes this.\n\nYou document your patterns once. AI reads it. Now every suggestion, every piece of generated code, follows your patterns automatically.\n\nThere are several variants of this concept:\n\nagents.md: Open standard for any AI coding agent. Works with GitHub Copilot, Cursor, OpenAI Codex CLI, and more.\n\nSKILL.md: Anthropic's format for specific reusable skills. More structured. Separate folder per skill.\n\nCLAUDE.md: Claude Code's specific variant (newer, more features).\n\nagents.md is the universal format. It works everywhere. This blog focuses on agents.md.\n\nagents.md has two parts:\n\nHere's a minimal agents.md file:\n\n```\n---\nname: ecommerce-scraper\ndescription: Web scraper for e-commerce product data\n---\n\n# E-commerce Scraper\n\n## Project Structure\n\n- spiders/ - Scrapy spiders for different sites\n- pipelines.py - Data cleaning and storage\n- settings.py - Scrapy configuration\n- items.py - Item definitions\n\n## Naming Conventions\n\nSpiders are named: spider_<site_name>.py\n\nExamples: spider_amazon.py, spider_ebay.py\n\n## Error Handling\n\nAll network errors are caught and logged\n```\n\nThat's it. Simple. The agent reads this. Understands your project. Follows your style.\n\nname: Your project's identifier (used internally)\n\ndescription: Short description (50-100 chars). Tells the agent when to use this.\n\nOptional fields:\n\n```\n---\nname: ecommerce-scraper\ndescription: Web scraper for e-commerce product data\nversion: 1.0.0\nauthor: Your Name\ntags: scrapy, web-scraping, ecommerce\nlicense: MIT\n---\n```\n\nThese fields help the agent understand the project scope. They're optional but helpful for teams.\n\nHere's how to build an effective agents.md for a scraper.\n\nIn your project root:\n\n```\ntouch agents.md\n---\nname: product-scraper\ndescription: Scrapy spider for scraping product data\nversion: 1.0.0\nauthor: Your Name\ntags: scrapy, web-scraping, python\n---\n## Project Structure\n\n- scrapers/spiders/ - All Scrapy spiders\n  - One spider per website\n  - Named spider_<site>.py\n\n- scrapers/pipelines.py - Data processing\n  - TextCleaningPipeline - Normalize text\n  - DuplicateRemovalPipeline - Remove duplicates\n  - DatabaseStoragePipeline - Save to PostgreSQL\n\n- scrapers/items.py - Data schema definitions\n- scrapers/settings.py - Scrapy configuration\n- scrapers/middlewares.py - Custom middleware\n## Naming Conventions\n\nSpiders: spider_<site_name>.py\n- Example: spider_amazon.py, spider_ebay.py\n- Class: <SiteName>Spider (CamelCase)\n- Example: AmazonSpider, EbaySpider\n\nPipelines: <FunctionName>Pipeline\n- Example: TextCleaningPipeline, DuplicateRemovalPipeline\n\nItems: <SiteName>Item\n- Example: ProductItem, ReviewItem\n\nMethods: snake_case\n- parse_product(), extract_price(), clean_text()\n```\n\nInclude code examples showing your approach:\n\n```\n## Spider Patterns\n\n### Basic Spider Structure\n\nEvery spider follows this pattern:\n\nclass <SiteName>Spider(scrapy.Spider):\n    name = '<site-identifier>'\n    allowed_domains = ['<domain.com>']\n    start_urls = ['<url>']\n\n    def parse(self, response):\n        # Extract items\n        for item in response.css('<selector>'):\n            yield {\n                'name': item.css('.name::text').get(),\n                'price': item.css('.price::text').get(),\n            }\n\n        # Handle pagination\n        next_page = response.css('<next-page-selector>').get()\n        if next_page:\n            yield response.follow(next_page, self.parse)\n\n### Error Handling\n\nAlways wrap network calls in try-except:\n\ndef parse_product(self, response):\n    try:\n        name = response.css('.name::text').get()\n        if not name:\n            self.logger.warning(f\"No name found on {response.url}\")\n            return\n\n        price = response.css('.price::text').get()\n        # process...\n\n    except Exception as e:\n        self.logger.error(f\"Error parsing {response.url}: {e}\")\n        return\n\n### Price Extraction\n\nPrices should be extracted as floats:\n\ndef clean_price(self, price_str):\n    import re\n    match = re.search(r'\\d+\\.?\\d*', price_str)\n    return float(match.group(0)) if match else None\n\n### Data Validation\n\nCheck for required fields before yielding:\n\ndef parse(self, response):\n    for item in response.css('.product'):\n        product = {\n            'name': item.css('.name::text').get(),\n            'price': self.clean_price(item.css('.price::text').get()),\n            'url': response.url,\n        }\n\n        # Only yield if required fields present\n        if product['name'] and product['price']:\n            yield product\n## Scrapy Settings\n\nKey settings in settings.py:\n\nCONCURRENT_REQUESTS = 16\nCONCURRENT_REQUESTS_PER_DOMAIN = 2\nDOWNLOAD_DELAY = 2\nRANDOMIZE_DOWNLOAD_DELAY = True\nUSER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'\n\nPipelines (in order):\n1. TextCleaningPipeline (normalize text)\n2. DuplicateRemovalPipeline (remove duplicates)\n3. DatabaseStoragePipeline (save to database)\n## Common Tasks\n\n### Adding a New Spider\n\n1. Create scrapers/spiders/spider_<sitename>.py\n2. Inherit from scrapy.Spider\n3. Define name, allowed_domains, start_urls\n4. Implement parse() method\n5. Return items with required fields (name, price, url)\n6. Test locally: scrapy crawl <spider-name>\n\n### Adding a New Pipeline\n\n1. Create new class in pipelines.py\n2. Implement process_item() method\n3. Add to ITEM_PIPELINES in settings.py\n4. Assign integer priority (lower = earlier)\n\n### Debugging a Spider\n\nDebug specific spider:\nscrapy crawl spider_name -a debug=True\n\nSee rendered HTML:\nscrapy shell 'https://example.com'\nresponse.css('.selector').get()\n\nCheck all items before pipeline:\nITEM_PIPELINES = {}\n## Dependencies\n\nCore:\n- Scrapy 2.9+\n- Python 3.8+\n\nData Processing:\n- pandas (data cleaning)\n- sqlalchemy (database)\n- psycopg2 (PostgreSQL)\n\nTesting:\n- pytest\n- pytest-scrapy\n\nInstall all:\npip install -r requirements.txt\n```\n\nDocument what your items actually look like:\n\n```\n## Item Format\n\nAll items follow this structure:\n\n{\n    'name': str,              # Product name (required)\n    'price': float,           # Price in USD (required)\n    'original_price': float,  # Before discount (optional)\n    'rating': float,          # 0-5 stars (optional)\n    'review_count': int,      # Number of reviews (optional)\n    'url': str,               # Source URL (required)\n    'site': str,              # Which site (amazon, ebay, etc)\n    'scraped_at': str,        # ISO datetime\n}\n```\n\nClaude Code automatically reads agents.md in your project root.\n\nJust ask for help and Claude will follow your patterns.\n\nExample:\n\n\"Add a spider for Target.com following the existing patterns\"\n\nClaude reads agents.md. Knows your naming conventions. Your error handling. Your spider template. Generates code that fits perfectly.\n\nCreate agents.md in your repo root. GitHub Copilot reads it in VS Code/GitHub.\n\nWhen you start typing a new spider, Copilot autocompletes following your patterns.\n\nSame as GitHub Copilot. Cursor reads agents.md automatically.\n\nYou can reference agents.md when using Codex.\n\nStart small. Add more as you need it.\n\n```\n---\nname: my-scraper\ndescription: Scrapy scraper for product data\n---\n\n## Project Structure\n\n- spiders/ - Scrapy spiders\n- pipelines.py - Data processing\n- items.py - Data schema\n\n## Naming\n\nSpiders: spider_<site>.py\nMethods: snake_case\n\n## Error Handling\n\nWrap in try-except. Log errors with URL.\n```\n\nStart with this. Get working. Then expand.\n\nAdd spider template, pipeline patterns, common tasks.\n\nAdd full reference, all patterns, all conventions, troubleshooting.\n\nAdd complex patterns, edge cases, performance tips.\n\nYou write a 5,000-word agents.md. Too much. AI can't process it all.\n\nFix: Keep it under 2,000 words. Use progressive disclosure. Reference external files:\n\n\"See docs/advanced-patterns.md for complex scraping scenarios.\"\n\n\"Use good names.\"\n\nToo vague. AI doesn't know what good means.\n\nFix: Be specific. Show examples.\n\n\"Spiders: spider_.py. Classes: Spider. Methods: snake_case.\"\n\nYou update your patterns but forget to update agents.md.\n\nAI follows the old patterns from the file.\n\nFix: Update agents.md whenever you change patterns. Commit it to Git.\n\nShows example but no context.\n\nFix: Explain why. Show both example and reasoning.\n\n\"Remember to use kwargs for flexibility.\"\n\nNew developers don't understand this.\n\nFix: Explain for beginners. Show what it does.\n\nWhen multiple developers work on the scraper, agents.md becomes crucial.\n\nAll developers read the same agents.md. Everyone follows the same patterns.\n\nNo more debates: \"Should this be snake_case or CamelCase?\" It's in agents.md.\n\nNew developer joins. You say: \"Read agents.md. It explains how we work.\"\n\nThey read one file. Now they can contribute following your patterns immediately.\n\nReviewer checks if code follows agents.md patterns. Reduces comments about style.\n\nFocus on logic, not formatting.\n\nAdd this to agents.md:\n\n\"AI will check:\n\nWhen you ask Claude Code for help, it checks these automatically.\n\nHere's a complete agents.md for an e-commerce scraper:\n\n```\n---\nname: ecommerce-scraper\ndescription: Scrapy spider for scraping product data\nversion: 1.0.0\nauthor: Your Team\ntags: scrapy, web-scraping, ecommerce, python\nlicense: MIT\n---\n\n# E-commerce Product Scraper\n\nThis Scrapy project scrapes product data from multiple e-commerce websites.\n\n## Quick Facts\n\n- Framework: Scrapy 2.9+\n- Python: 3.8+\n- Database: PostgreSQL\n- Sites: Amazon, eBay, Walmart, Best Buy\n- Update Frequency: Daily via cron job\n\n## Project Structure\n\necommerce-scraper/\n├── scrapers/\n│   ├── spiders/\n│   │   ├── spider_amazon.py\n│   │   ├── spider_ebay.py\n│   │   └── spider_walmart.py\n│   ├── items.py\n│   ├── pipelines.py\n│   ├── settings.py\n│   └── middlewares.py\n├── requirements.txt\n└── agents.md\n\n## Data Flow\n\nStart URLs\n    ↓\nParse Product Page\n    ↓\nExtract: name, price, rating, reviews, url\n    ↓\nValidation Pipeline\n    ↓\nText Cleaning\n    ↓\nDeduplication\n    ↓\nDatabase Storage\n\n## Item Structure\n\nAll items follow this structure:\n\n{\n    'name': str,              # Product name (required)\n    'price': float,           # Price in USD (required)\n    'original_price': float,  # Before discount (optional)\n    'rating': float,          # 0-5 stars (optional)\n    'review_count': int,      # Number of reviews (optional)\n    'url': str,               # Source URL (required)\n    'site': str,              # Which site\n    'scraped_at': str,        # ISO datetime\n}\n\n## Naming Conventions\n\nSpiders: spider_<site>.py\nClasses: <SiteName>Spider (CamelCase)\nMethods: snake_case\n\nExamples:\n- Spider file: spider_amazon.py\n- Class: AmazonSpider\n- Method: extract_price()\n\n## Spider Template\n\nclass <SiteName>Spider(scrapy.Spider):\n    name = '<site-slug>'\n    allowed_domains = ['<domain.com>']\n    start_urls = ['<start-url>']\n\n    def parse(self, response):\n        try:\n            # Extract products\n            for product in response.css('<selector>'):\n                yield self.extract_product(product, response)\n\n            # Handle pagination\n            next_page = response.css('<next-selector>').get()\n            if next_page:\n                yield response.follow(next_page, self.parse)\n\n        except Exception as e:\n            self.logger.error(f\"Error parsing {response.url}: {e}\")\n\n    def extract_product(self, product, response):\n        return {\n            'name': product.css('.name::text').get('').strip(),\n            'price': self.extract_price(product),\n            'url': response.urljoin(product.css('a::attr(href)').get()),\n            'site': self.name,\n        }\n\n    def extract_price(self, product):\n        import re\n        price_str = product.css('.price::text').get('')\n        match = re.search(r'\\d+\\.?\\d*', price_str)\n        return float(match.group(0)) if match else None\n\n## Error Handling\n\nAlways wrap in try-except. Log errors with URL.\n\n## Scrapy Settings\n\nCONCURRENT_REQUESTS = 16\nCONCURRENT_REQUESTS_PER_DOMAIN = 2\nDOWNLOAD_DELAY = 2\nRANDOMIZE_DOWNLOAD_DELAY = True\nUSER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'\n\nITEM_PIPELINES = {\n    'scrapers.pipelines.TextCleaningPipeline': 100,\n    'scrapers.pipelines.DuplicateRemovalPipeline': 200,\n    'scrapers.pipelines.DatabaseStoragePipeline': 300,\n}\n\n## Common Tasks\n\nAdding a New Spider:\n1. Create scrapers/spiders/spider_<sitename>.py\n2. Inherit from scrapy.Spider\n3. Define name, allowed_domains, start_urls\n4. Implement parse() method\n5. Test: scrapy crawl <spider-name>\n\nTesting a Spider:\nscrapy shell 'https://example.com'\nresponse.css('.selector').get()\n\nRunning All Spiders:\nscrapy crawl amazon && scrapy crawl ebay && scrapy crawl walmart\n\n## Dependencies\n\n- Scrapy 2.9+\n- Python 3.8+\n- sqlalchemy\n- psycopg2-binary\n- pandas\n\nInstall: pip install -r requirements.txt\n```\n\nThis is a complete agents.md. When Claude Code reads this, it understands everything about your project. Every suggestion follows your patterns.\n\nagents.md teaches AI agents how your project works.\n\nMarkdown file in project root. YAML frontmatter + Markdown body. Describes patterns, conventions, architecture. Read automatically by Claude Code, GitHub Copilot, Cursor.\n\nAI writes code matching your style. No rewriting AI-generated code. Faster onboarding for new developers. Consistent project standards. Works with all AI coding agents.\n\nKeep it under 2,000 words. Be specific, not vague. Update when you change patterns. Use progressive disclosure. Explain the why, not just examples.\n\nOne agents.md = everyone on same page. Faster code review. Easier onboarding. Enforce standards without debate.\n\nYou're not just building a scraper. You're building a scraper that teaches AI how to build scrapers.\n\nThe future of development is humans and AI working together. agents.md is how you teach the AI to be a good collaborator.\n\n**Next Steps:**\n\nThe investment in documenting patterns pays dividends every time you ask for AI help.", "url": "https://wpnews.pro/news/agents-md-teaching-ai-agents-how-to-scrape-the-future-of-web-automation", "canonical_source": "https://dev.to/ikram_khan/-agentsmd-teaching-ai-agents-how-to-scrape-the-future-of-web-automation-4266", "published_at": "2026-05-28 10:36:47+00:00", "updated_at": "2026-05-28 10:53:13.928526+00:00", "lang": "en", "topics": ["ai-agents", "ai-tools", "large-language-models"], "entities": ["Claude Code", "Scrapy", "agents.md"], "alternates": {"html": "https://wpnews.pro/news/agents-md-teaching-ai-agents-how-to-scrape-the-future-of-web-automation", "markdown": "https://wpnews.pro/news/agents-md-teaching-ai-agents-how-to-scrape-the-future-of-web-automation.md", "text": "https://wpnews.pro/news/agents-md-teaching-ai-agents-how-to-scrape-the-future-of-web-automation.txt", "jsonld": "https://wpnews.pro/news/agents-md-teaching-ai-agents-how-to-scrape-the-future-of-web-automation.jsonld"}}