{"slug": "why-i-built-a-cli-to-automate-web-research-instead-of-relying-on-browser-tabs", "title": "Why I built a CLI to automate web research instead of relying on browser tabs", "summary": "A developer built a modular CLI tool called 'research-loop' to automate web research, addressing the inefficiency of manually collecting and synthesizing information from browser tabs. The tool runs a complete research loop—search, scrape, clean, synthesize, and report—without requiring a GPU or database, and supports scheduled monitoring with notifications via Discord or Telegram. It uses a five-file architecture for separation of concerns and includes fallback parsing strategies for non-article pages.", "body_md": "A few months ago I noticed something annoying about how I worked: I was spending more time *collecting* information than actually thinking about it.\n\nThe pattern was always the same. Open a search engine, open a dozen tabs, skim past the SEO filler and cookie banners, copy the paragraphs that actually mattered into a doc, paste the whole mess into an LLM and ask it to make sense of things. Then, a week later, do it again because whatever I was tracking had changed.\n\nAt some point I stopped asking \"how do I do this faster\" and started asking why I was doing it by hand at all.\n\nChatGPT and Perplexity are fine for a single question. They're worse at the part I actually needed help with, which was repetition: running the same research loop on a schedule, keeping a record of what changed, and getting a notification when it did. Neither tool is built to sit in the background and check on a topic for you.\n\nPlain scraping scripts have the opposite problem. They get you raw HTML, not understanding. You still have to strip out nav bars and footers by hand, and the moment you point one at a list-style page like Hacker News instead of a blog post, it falls apart.\n\nAnd bookmarking is just deferring the problem. A folder of forty saved links isn't research, it's homework you haven't done yet.\n\nI wanted something in between: automated enough to skip the tab-hoarding, but still producing something I could read and trust, not just a black-box answer.\n\nIt's a modular CLI that runs the whole research loop, search, scrape, clean, synthesize, report, on its own, and stays lightweight enough to run on a laptop with no GPU and no database.\n\nA single run looks like this: you give it a topic and a focus area (what you specifically want answered), it searches the web, pulls and cleans the pages, synthesizes a report, and writes it to disk. There's also a loop mode, so the same query can re-run every few hours and ping you on Discord or Telegram if you want to monitor something over time instead of researching it once.\n\nI deliberately didn't build this as one big script. It's five files, each doing one job, called in sequence:\n\n```\nmain.py        → terminal UI and orchestration\nscraper.py     → search + concurrent crawling + HTML parsing\nanalyzer.py    → synthesis (AI or offline)\nnotifier.py    → saving reports, sending alerts\nconfig_manager → reading/writing settings\n```\n\n`main.py`\n\ndoesn't know anything about how scraping works internally, and `scraper.py`\n\ndoesn't know anything about Discord webhooks. That separation made it much easier to add the offline summarizer later without touching the scraping code at all, and it's the kind of decision that only pays off once you try to change something six weeks in.\n\n**Getting clean text out of arbitrary HTML.** `readability-lxml`\n\nis good at finding \"the article\" inside a page, but it assumes the page *is* an article. Point it at a Hacker News thread or a GitHub repo listing and it often returns almost nothing, because there's no single article body to extract. The fix was to treat readability as the first attempt, not the only one: if it returns under 200 characters of usable text, the code falls back to a structural BeautifulSoup pass that looks for `<article>`\n\n, `<main>`\n\n, or common content selectors instead. Two different parsing strategies, picked automatically based on what the page actually looks like.\n\n**Supporting three different LLM providers without three different code paths.** Gemini, OpenAI, and Claude all have different request shapes, but the thing I actually cared about (send scraped context, get back a structured Markdown report) is identical across all of them. So each provider gets its own thin function that builds the right payload and hits the right endpoint, but all three feed into the same `synthesize_topics`\n\nrouter, and all three fall back to the same offline summarizer if the API call fails for any reason. The interface is the constant; the providers are interchangeable behind it.\n\n**An offline mode that's actually usable.** No API key, no internet dependency on a third-party model, still get a real report. This is where most of the actual algorithm work went: extract keywords from the query and focus area, score every sentence in the scraped text by keyword density and position (earlier sentences in a paragraph, earlier paragraphs in a document, weighted higher), then deduplicate near-identical sentences before assembling the top results into a report with an executive summary and per-source findings. It's not as fluent as an LLM-written summary, but it's not nothing either, and it means the tool works the moment you clone it.\n\n**Staying out of the database trap.** It would have been easy to reach for SQLite the moment I wanted history or saved searches. I didn't. Reports are timestamped Markdown and JSON files in a `reports/`\n\nfolder, and saved search presets live in a plain `config.json`\n\n. You can read everything by double-clicking it. No schema, no migrations, no ORM.\n\nFocal Harvest isn't trying to replace search engines or chat-based AI assistants. It automates the mechanical part, gathering and organizing information, so you spend your attention on evaluating it instead of assembling it. If you want a single deep conversational answer to one question, this is the wrong tool. If you want a repeatable, schedulable research pipeline that produces a file you can actually keep, that's the gap it's filling.\n\nA few areas I haven't gotten to yet:\n\nIf any of that sounds interesting, or if you've built something in this space and have opinions about the architecture, I'd genuinely like to hear them. Issues and pull requests are open.\n\nIf you've got your own version of the tab-hoarding problem, I'd like to hear about it in the comments. What does your research loop look like, and where does it break down?", "url": "https://wpnews.pro/news/why-i-built-a-cli-to-automate-web-research-instead-of-relying-on-browser-tabs", "canonical_source": "https://dev.to/techno_neighbour/why-i-built-a-cli-to-automate-web-research-instead-of-relying-on-browser-tabs-5b35", "published_at": "2026-06-30 09:35:10+00:00", "updated_at": "2026-06-30 09:48:45.248116+00:00", "lang": "en", "topics": ["developer-tools", "artificial-intelligence", "large-language-models", "ai-agents"], "entities": ["ChatGPT", "Perplexity", "Gemini", "OpenAI", "Claude", "Discord", "Telegram", "BeautifulSoup"], "alternates": {"html": "https://wpnews.pro/news/why-i-built-a-cli-to-automate-web-research-instead-of-relying-on-browser-tabs", "markdown": "https://wpnews.pro/news/why-i-built-a-cli-to-automate-web-research-instead-of-relying-on-browser-tabs.md", "text": "https://wpnews.pro/news/why-i-built-a-cli-to-automate-web-research-instead-of-relying-on-browser-tabs.txt", "jsonld": "https://wpnews.pro/news/why-i-built-a-cli-to-automate-web-research-instead-of-relying-on-browser-tabs.jsonld"}}