{"slug": "extract-plain-text-from-medium-posts-for-rag-and-search-indexes", "title": "Extract Plain Text from Medium Posts for RAG and Search Indexes", "summary": "A developer created an API that extracts clean plain text from Medium articles, stripping navigation, clap bars, and scripts for use in embeddings, summarization, and full-text search. The tool provides separate endpoints for article content and metadata, enabling chunking of text for vector databases and RAG pipelines. The solution supports integration with OpenAI embeddings, Ollama, or other models for LLM training and retrieval applications.", "body_md": "Chunk clean article content for embeddings, summarization, and full-text search—skip nav, clap bars, and scripts.\n\n**HTML embeds** are for humans; **plain text** is for chunking, embeddings, and summarization. One call should return body text without nav, clap bars, or script tags.\n\nTool outcome:`ingest-medium-article.ts`\n\n→ chunked documents in your vector DB.\n\n`GET /article/{id}/content`\n\n→ plain text.`GET /article/{id}`\n\nfor title, tags, author metadata.\n\n``` js\nconst API = 'https://api.zenndra.com';\nconst headers = { Authorization: `Bearer ${process.env.ZENNDRA_API_KEY}` };\n\nexport async function fetchArticleText(articleId) {\n  const [contentRes, metaRes] = await Promise.all([\n    fetch(`${API}/article/${articleId}/content`, { headers }),\n    fetch(`${API}/article/${articleId}`, { headers }),\n  ]);\n\n  const { content } = await contentRes.json();\n  const meta = await metaRes.json();\n\n  return {\n    id: articleId,\n    title: meta.title,\n    tags: meta.tags,\n    text: content,\n  };\n}\n\nexport function chunkText(text, { size = 800, overlap = 100 } = {}) {\n  const words = text.split(/\\s+/);\n  const chunks = [];\n  for (let i = 0; i < words.length; i += size - overlap) {\n    chunks.push(words.slice(i, i + size).join(' '));\n  }\n  return chunks.filter(Boolean);\n}\n```\n\nWire `chunkText`\n\nto [OpenAI embeddings](https://platform.openai.com/docs/guides/embeddings), [Ollama](https://ollama.com/), or your host’s model—swap the vector client, keep the ingest shape.\n\n`article_id`\n\nand `chunk_index`\n\nin metadata for citations.For human-readable syndication, see [embed articles](https://./embed-medium-articles-on-website.md)—different threat model than LLM training.\n\n`medium plain text api`\n\n, `medium rag pipeline`\n\n, `medium embeddings`\n\n, `medium article content extraction`\n\n, `llm medium`\n\n.", "url": "https://wpnews.pro/news/extract-plain-text-from-medium-posts-for-rag-and-search-indexes", "canonical_source": "https://dev.to/zenndraapi/extract-plain-text-from-medium-posts-for-rag-and-search-indexes-14mm", "published_at": "2026-05-30 09:15:17+00:00", "updated_at": "2026-05-30 09:41:23.102314+00:00", "lang": "en", "topics": ["ai-tools", "ai-infrastructure", "ai-products", "artificial-intelligence", "large-language-models"], "entities": ["Zenndra", "OpenAI", "Ollama"], "alternates": {"html": "https://wpnews.pro/news/extract-plain-text-from-medium-posts-for-rag-and-search-indexes", "markdown": "https://wpnews.pro/news/extract-plain-text-from-medium-posts-for-rag-and-search-indexes.md", "text": "https://wpnews.pro/news/extract-plain-text-from-medium-posts-for-rag-and-search-indexes.txt", "jsonld": "https://wpnews.pro/news/extract-plain-text-from-medium-posts-for-rag-and-search-indexes.jsonld"}}