How Web Pulse works
The exact pipeline that turns ~58 RSS feeds into structured JSON, summaries, topics, entities, and sentiment โ refreshed every 30 minutes.
Sources sources
We poll a curated list of ~58 AI/tech RSS feeds. The full list is at /sources. New sources are added when a real human evaluates them โ no algorithmic firehose.
Sources are tagged by domain (org-blog, news-org, indie, vendor-doc) and weighted by past accuracy. No paywalled or aggregator-of-aggregators feeds.
Ingest ingest
Every 30 minutes a scheduler hits all feeds with HEAD requests, then GET for changed ones (RFC 5005). Articles are canonical-URL-deduped at intake โ the same story behind 3 trackers becomes one row.
Failures are retried with backoff up to 1h. Persistent failures mark the source degraded on /sources.
Process process
Each new article runs through DeepSeek (deepseek-chat) for:
- Summary: 2-4 sentences, factual, no opinion.
- Entity extraction: people, orgs, products, places. Linked to canonical entity records.
Classify classify
Articles get a primary topic + up to 3 secondary topics from a 24-topic ontology. Sentiment is positive / neutral / negative based on LLM rubric.
The 24 topics are curated โ not auto-clustered. Cluster-evolution happens manually as the AI landscape shifts.
Index index
Storage is SQLite (WAL mode) with FTS5 for full-text search. Vector embeddings (mxbai-embed-large-v1) for semantic similarity. ETag computed from content hash โ clients get 304s for free.
Corrections corrections
If a summary is wrong or an entity is misidentified, email oss@wpnews.pro. We fix it within 24h and post a correction in the article's history. No takedowns โ only corrections.