cd /methodology ยท home โ€บ methodology
methodology refresh 30min model deepseek-chat

How Web Pulse works

The exact pipeline that turns ~58 RSS feeds into structured JSON, summaries, topics, entities, and sentiment โ€” refreshed every 30 minutes.

Sources sources

We poll a curated list of ~58 AI/tech RSS feeds. The full list is at /sources. New sources are added when a real human evaluates them โ€” no algorithmic firehose.

Sources are tagged by domain (org-blog, news-org, indie, vendor-doc) and weighted by past accuracy. No paywalled or aggregator-of-aggregators feeds.

Ingest ingest

Every 30 minutes a scheduler hits all feeds with HEAD requests, then GET for changed ones (RFC 5005). Articles are canonical-URL-deduped at intake โ€” the same story behind 3 trackers becomes one row.

Failures are retried with backoff up to 1h. Persistent failures mark the source degraded on /sources.

Process process

Each new article runs through DeepSeek (deepseek-chat) for:

  • Summary: 2-4 sentences, factual, no opinion.
  • Entity extraction: people, orgs, products, places. Linked to canonical entity records.

Classify classify

Articles get a primary topic + up to 3 secondary topics from a 24-topic ontology. Sentiment is positive / neutral / negative based on LLM rubric.

The 24 topics are curated โ€” not auto-clustered. Cluster-evolution happens manually as the AI landscape shifts.

Index index

Storage is SQLite (WAL mode) with FTS5 for full-text search. Vector embeddings (mxbai-embed-large-v1) for semantic similarity. ETag computed from content hash โ€” clients get 304s for free.

Corrections corrections

If a summary is wrong or an entity is misidentified, email oss@wpnews.pro. We fix it within 24h and post a correction in the article's history. No takedowns โ€” only corrections.