{"slug": "show-hn-takoqa-a-harness-to-get-a-swarm-of-agents-to-break-your-application", "title": "Show HN: TakoQA – A harness to get a swarm of agents to break your application", "summary": "TakoQA, a new open-source testing tool, uses a swarm of LLM-driven browser agents to autonomously break web applications by executing plain-language missions. The tool observes pages visually, decides actions via an LLM, and reports bugs with screenshots, video, and replay data, enabling functional, exploratory, UX, and regression testing without product-specific knowledge.", "body_md": "**A swarm of browser agents that breaks your web app before your users do.**\n\n*Plain-language missions in, real bugs out.*\n\ntakoqa drives a real Chromium browser against your running app, perceives each page the way a person does, and decides its next action with an LLM — clicking, typing, uploading, and exploring toward a goal you describe in plain language. Along the way it watches for broken behavior and reports what it finds, with screenshots, a video, and a step-by-step replay.\n\nThe engine knows nothing about any specific product. Everything app-specific\nlives in a single profile file, so pointing takoqa at a new app is just writing\na new `profiles/*.yaml`\n\n.\n\nEach step runs a four-beat loop:\n\n**Observe**— tag every visible interactive element with a ref number, plus a screenshot and the page text.** Decide**— the LLM is given that list (and the screenshot) and picks one human action, addressing elements by ref — never by CSS selector.**Act**— Playwright performs the action; the target is highlighted on-page first so the recording shows exactly what was clicked.**Check**— captured console errors, uncaught exceptions, and HTTP responses run through the oracles. A finding is raised when something looks broken.\n\nAt the end of each mission an LLM judge decides whether the user's goal was actually met and flags UX/quality issues even when the flow technically worked.\n\n**Functional bugs**— JS exceptions, 5xx responses, console errors, crash text.** Exploratory/edge cases**— give it a goal and no script; it wanders.** UX/quality**— the judge flags confusing or degraded flows.** Regressions**— every run is saved (JSON, screenshots, video, trace) for run-to-run comparison.\n\ntakoqa gets smarter the more it runs, without anyone editing the profile:\n\n**Known-bugs baseline**(`--baseline`\n\n) classifies each finding`new`\n\n/`known`\n\n/`muted`\n\nso a repeat run reports only what changed.**Learned store**— during`--loop`\n\nthe harness distills durable app facts from what it saw (routes that turned out to be gated, controls that never did anything, what each page actually offers, missions already tried) into a per-profile JSON sidecar. The next run merges the confident subset into the app map it hands the acting agent, so it stops re-discovering the same things. Facts need ≥2 sightings to count and decay if not re-seen, so a one-off flake never ossifies. Learnings inform the*agent*only — never the judge.marks a finding a known non-bug. It is dropped from the report and the CI gate, and the reason is fed to the LLM judge as a \"do not flag\" exclusion next run — so a triaged non-bug stops coming back. (The reason is the`--mute \"<kind|title>\" --as \"<reason>\"`\n\n*only*feedback signal allowed to reach the judge.)\n\nThe baseline (`baseline/`\n\n), recipes (`recipes/`\n\n), and learned store (`learned/`\n\n)\nare plain, human-inspectable JSON — delete an entry to forget it.\n\n```\nnpm install\nnpx playwright install chromium\n\n# Copy the template and point it at your app:\ncp profiles/example.yaml profiles/myapp.local.yaml   # *.local.yaml is gitignored\n\nANTHROPIC_API_KEY=sk-... npx tsx src/run.ts \\\n  --profile profiles/myapp.local.yaml --tag smoke\n```\n\nOutputs land in `runs/<profile>-<timestamp>/`\n\n:\n\n`index.html`\n\n— self-contained replay: step timeline, screenshots, embedded video, and findings. Open it in any browser.`findings.txt`\n\n/`run.json`\n\n— human- and machine-readable results.`missions/<id>/video.webm`\n\nand`trace.zip`\n\n— per-mission recordings (`npx playwright show-trace <path>`\n\nfor the time-travel viewer).\n\n| Flag | Effect |\n|---|---|\n`--headed` |\nWatch the browser live |\n`--tag <t>` |\nRun only missions with this tag |\n`--base-url <url>` |\nOverride the profile's baseUrl (local/staging/prod) |\n`--no-record` |\nSkip video/trace for fast headless runs |\n`--mock` |\nRun the loop with a scripted client (no API key) |\n\nA profile declares **intent and failure conditions**, not clicks. See\n[ profiles/example.yaml](/Tako-Research/TakoQA/blob/main/profiles/example.yaml) for a documented template:\n\n`baseUrl`\n\n, an `auth`\n\nstrategy, `personas`\n\n(who's driving), `invariants`\n\n(what\ncounts as a bug), and `missions`\n\n(goals + success criteria the LLM judge uses).takoqa is verified against a deliberately-buggy fixture app — no real app or API key needed:\n\n```\nnpm test          # oracle unit tests + engine integration tests\nnpm run test:unit # fast, browserless oracle tests only\nnpm run selfeval  # absolute gate: does it catch the planted bugs? (see below)\nnpm run eval      # comparative gate: did it regress vs the previous state?\nnpm run metaeval  # meta gate: is every detector exercised AND protected?\n```\n\nThese are three gates on three different questions. `selfeval`\n\nasks *do we catch\nthe planted bugs* (absolute recall/precision). `eval`\n\nasks *did we get worse than\nlast time* (comparative, per-case). `metaeval`\n\nasks *would we even notice if a\ndetector broke* (coverage + mutation) — the question the other two can't answer.\n\n`npm run selfeval`\n\nis the regression gate on takoqa's own coverage. It runs the\nreal engine over the planted-bug fixture in two passes (functional + security),\nscores the findings against a co-located ground-truth manifest\n(`test/fixture-manifest.ts`\n\n), and asserts full recall over every must-catch case\nwith zero false positives on the clean routes. A refactor that stops an oracle\nfrom firing — or starts crying wolf on a clean page — fails this gate and names\nthe exact case. Adding a planted route to the fixture forces a matching manifest\nentry, so coverage can't silently rot.\n\n`npm run eval`\n\ngoes one step further than the absolute self-eval gate: it scores\nthe harness against the planted-bug fixture **and** diffs that score against the\nprevious committed record (`eval/eval_ledger.jsonl`\n\n) — reporting the delta, not\njust the value. A per-case regression (a bug caught before, missed now) fails the\ngate even when aggregate recall is unchanged, which the absolute recall gate\ncan't see. Each record stamps git provenance + a byte-hash of the fixture, so a\nstale baseline over a different fixture simply stops being comparable. `npm run eval -- --record`\n\nappends a new record, so every accepted improvement becomes the\nprior state the next change is measured against.\n\n`npm run metaeval`\n\ngates the gate itself. The self-eval proves takoqa catches the\nplanted bugs, but it can't tell you whether every detector takoqa *ships* is\nactually exercised — a detector with no fixture case, or one always co-caught by\nanother kind, could quietly stop firing and both gates above would stay green.\nThe meta-eval answers two questions:\n\n**Coverage**— is every deterministic detector kind exercised by a fixture case?`KIND_CLASS`\n\n(in`src/metaeval.ts`\n\n) classifies every`FindingKind`\n\nas a`detector`\n\nor an LLM/agent`judgment`\n\n; because it's an exhaustive map, adding a new kind is a*compile error*until it's classified, so a detector can't ship without a coverage decision.**Mutation / ablation**— would the self-eval actually*fail*if a detector broke? For each detector it drops that kind's findings from a passing report and re-scores: if a previously-caught case now misses, the detector is`protected`\n\n; if the case stays caught (some other kind covers it), it's`shadowed`\n\n— covered on paper but the eval is blind to it breaking.\n\nLike the comparative eval, it records to `eval/eval_ledger.jsonl`\n\n(as the\n`harness_meta`\n\ntask) and diffs against the previous state, so a detector going\n`protected → unprotected`\n\nfails the gate. `npm run metaeval -- --record`\n\nappends a\nnew baseline.\n\nRoute discovery is pluggable, so takoqa points at any app — not just Next.js.\n`--explore`\n\n/`--matrix`\n\naccept `--app-dir <path>`\n\n(read a Next.js app-router\ntree), `--routes a,b,c`\n\n(an explicit, app-agnostic list), or `--sitemap <url>`\n\n(extract same-origin paths from a sitemap.xml). A profile can pin the same via\n`explore.source`\n\n(or keep the `explore.appDir`\n\nshorthand).\n\n```\ndocker build -t takoqa .\ndocker run --rm --network host -e ANTHROPIC_API_KEY=sk-... \\\n  -v \"$PWD/runs:/app/runs\" takoqa --profile profiles/example.yaml --tag smoke\n```\n\nSee [ docker-compose.example.yml](/Tako-Research/TakoQA/blob/main/docker-compose.example.yml) for wiring takoqa\ninto an app's compose stack.\n\nMIT — see [LICENSE](/Tako-Research/TakoQA/blob/main/LICENSE).", "url": "https://wpnews.pro/news/show-hn-takoqa-a-harness-to-get-a-swarm-of-agents-to-break-your-application", "canonical_source": "https://github.com/Tako-Research/TakoQA", "published_at": "2026-06-25 03:36:45+00:00", "updated_at": "2026-06-25 04:14:23.025643+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "developer-tools", "computer-vision", "generative-ai"], "entities": ["TakoQA", "Playwright", "Chromium", "Anthropic", "LLM"], "alternates": {"html": "https://wpnews.pro/news/show-hn-takoqa-a-harness-to-get-a-swarm-of-agents-to-break-your-application", "markdown": "https://wpnews.pro/news/show-hn-takoqa-a-harness-to-get-a-swarm-of-agents-to-break-your-application.md", "text": "https://wpnews.pro/news/show-hn-takoqa-a-harness-to-get-a-swarm-of-agents-to-break-your-application.txt", "jsonld": "https://wpnews.pro/news/show-hn-takoqa-a-harness-to-get-a-swarm-of-agents-to-break-your-application.jsonld"}}