{"slug": "adding-markdown-support-end-to-end-part-7", "title": "Adding Markdown Support End-to-End (Part 7)", "summary": "A developer added Markdown file support to the Sift RAG stack, touching validation, extraction, MIME types, and a database schema migration. The change required updates across six layers including API validation, content-type mapping, extraction dispatch, frontend file picker, and a CHECK constraint. A subtle browser bug emerged from mismatched Content-Type headers in presigned S3 URLs.", "body_md": "*What it actually takes to wire a new file type through a layered RAG stack — validation, extraction, MIME quirks, and a schema migration.*\n\nOne of the things I deliberately built into Sift is a clear separation of concerns across the stack. The API validates what's allowed, the pipeline handles extraction, the frontend controls what you can drop, and the database enforces the constraint at rest. That structure pays off when you add features — but it also means \"add Markdown support\" touches more layers than you might expect.\n\nThis post walks through what that change actually looked like: every file it touched, the subtle bug that emerged in the browser, and the schema migration that closed the loop.\n\nBefore making any change, I mapped out every place the stack has an opinion about file types:\n\n`DocumentsFunction`\n\nvalidates the extension before issuing a presigned upload URL`DocumentService`\n\nmaps extensions to S3 content types for signing`extract_handler.py`\n\ndispatches on extension to the right extractor`UploadDropzone`\n\ncontrols what the file picker and drag-and-drop accept`useDocuments.ts`\n\nsets the `Content-Type`\n\nheader on the S3 PUT`CHECK`\n\nconstraint on `documents.file_type`\n\nenforces the allowed set at restEach of these is independent. A gap in any one of them causes a different failure mode: the API rejects the upload at step 1, the pipeline silently fails at step 3, the S3 PUT returns a 403 at step 5, or the database insert throws at step 6.\n\nThe entry point is `DocumentsFunction.cs`\n\n. When a client calls `POST /documents/upload-url`\n\n, the function checks the extension against an allowed set before doing anything else:\n\n``` js\n// Before\nvar allowedExtensions = new HashSet<string> { \"pdf\", \"docx\", \"csv\", \"txt\" };\n\n// After\nvar allowedExtensions = new HashSet<string> { \"pdf\", \"docx\", \"csv\", \"txt\", \"md\" };\n```\n\nAnd in `DocumentService.cs`\n\n, the content-type map that drives the presigned URL signing:\n\n```\nprivate static readonly Dictionary<string, string> ContentTypes = new()\n{\n    [\"pdf\"]  = \"application/pdf\",\n    [\"docx\"] = \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\",\n    [\"csv\"]  = \"text/csv\",\n    [\"txt\"]  = \"text/plain\",\n    [\"md\"]   = \"text/markdown\",   // added\n};\n```\n\nThe presigned URL is signed for a specific `Content-Type`\n\n. Whatever the client sends in the PUT must match exactly — S3 rejects mismatches with a 403. That matching requirement is what caused the browser bug described below.\n\nThe extraction handler in `extract_handler.py`\n\nalready had a clean dispatch pattern:\n\n```\nif ext == \"pdf\":\n    text, page_count = _extract_pdf(content)\nelif ext == \"docx\":\n    text, page_count = _extract_docx(content)\nelif ext == \"csv\":\n    text, page_count = _extract_csv(content)\nelif ext == \"txt\":\n    text       = content.decode(\"utf-8\", errors=\"replace\")\n    page_count = 1\nelse:\n    raise ValueError(f\"Unsupported file type: {ext}\")\n```\n\nMarkdown is plain UTF-8 text with formatting syntax. The right extraction strategy here is just to read it as-is and let the chunker and embedding model deal with the content. The Markdown syntax (headers, bold, code fences) doesn't hurt RAG quality — the embedding model handles natural text well enough that the punctuation is just noise rather than a problem.\n\nThe change was a one-liner:\n\n```\nelif ext in (\"txt\", \"md\"):\n    text       = content.decode(\"utf-8\", errors=\"replace\")\n    page_count = 1\n```\n\nIf you wanted to strip Markdown syntax before embedding, you could run the content through a parser like `mistune`\n\nand extract just the text nodes. For the scope of this project that's premature — the current approach works and keeps the pipeline dependency-free for this case.\n\n`UploadDropzone.tsx`\n\nuses the `accept`\n\nprop to tell the browser which files to allow:\n\n```\n// Before\naccept={{ \"application/pdf\": [\".pdf\"], \"text/plain\": [\".txt\"], ... }}\n\n// After\naccept={{ \"application/pdf\": [\".pdf\"], \"text/plain\": [\".txt\"], \"text/markdown\": [\".md\"], ... }}\n```\n\nThis controls both the native file picker dialog (what's visible and selectable) and drag-and-drop validation (what gets highlighted vs. rejected). Both are client-side UX — neither is a security boundary — but they matter for usability.\n\nThis is the part that didn't work on the first try.\n\nWhen the frontend uploads a file to S3, it needs to set the `Content-Type`\n\nheader to match whatever the presigned URL was signed for. The original code used `file.type`\n\n— the MIME type the browser reports for the selected file:\n\n```\nawait axios.put(uploadUrl, file, {\n  headers: { \"Content-Type\": file.type },\n});\n```\n\nFor PDFs this works fine. For `.md`\n\nfiles it doesn't. `file.type`\n\nfor Markdown is unreliable across browsers: Chrome reports `\"\"`\n\n(empty string), some environments report `\"text/plain\"`\n\n. The presigned URL was signed for `\"text/markdown\"`\n\n. An empty string or `\"text/plain\"`\n\nin the `Content-Type`\n\nheader causes S3 to reject the PUT with a 403.\n\nThe fix is to not trust `file.type`\n\nat all. Instead, derive the content type from the file extension:\n\n``` js\nconst MIME_MAP: Record<string, string> = {\n  pdf:  \"application/pdf\",\n  docx: \"application/vnd.openxmlformats-officedocument.wordprocessingml.document\",\n  csv:  \"text/csv\",\n  txt:  \"text/plain\",\n  md:   \"text/markdown\",\n};\n\nfunction getMimeType(filename: string): string {\n  const ext = filename.split(\".\").pop()?.toLowerCase() ?? \"\";\n  return MIME_MAP[ext] ?? \"application/octet-stream\";\n}\n\n// Usage\nawait axios.put(uploadUrl, file, {\n  headers: { \"Content-Type\": getMimeType(file.name) },\n});\n```\n\nThis makes the MIME type determination consistent across every browser and OS, and it keeps the frontend and API in sync — both derive the content type from the same extension mapping.\n\nThe lesson here is general: `file.type`\n\nis a hint from the operating system, not a contract. For any workflow where the content type has downstream consequences (like S3 presigned URL validation), always derive it yourself from the extension.\n\nThe database had a CHECK constraint on `documents.file_type`\n\nfrom the initial schema:\n\n```\nCHECK (file_type IN ('pdf', 'csv', 'docx', 'txt'))\n```\n\nWithout updating this, every Markdown document insert would fail with a constraint violation — after the file had already been uploaded to S3 and the pipeline had started. The migration is straightforward:\n\n```\n-- migrations/002_add_md_file_type.sql\nALTER TABLE documents\n  DROP CONSTRAINT documents_file_type_check,\n  ADD  CONSTRAINT documents_file_type_check\n       CHECK (file_type IN ('pdf', 'csv', 'docx', 'txt', 'md'));\n```\n\nDrop the old constraint, add the new one. Because Aurora Serverless v2 is the backing store and this is a DDL statement with no data rewrite, it completes nearly instantly regardless of table size.\n\nThe migration is applied via `scripts/migrate-local.py`\n\nagainst the RDS Data API. No VPN, no bastion host — just a boto3 `execute_statement`\n\ncall.\n\nHere's the complete path for a Markdown upload after all the changes:\n\n`README.md`\n\nonto the dropzone — accepted because `text/markdown`\n\nis in the `accept`\n\nmap`POST /documents/upload-url`\n\nwith `{ fileName: \"README.md\", fileType: \"md\" }`\n\n`\"md\"`\n\nagainst the allowed set, maps it to `text/markdown`\n\n, issues a presigned S3 PUT URL`Content-Type: text/markdown`\n\nderived from the extension map`Object Created`\n\nevent → EventBridge → Step Functions`ExtractText`\n\nLambda reads the S3 object, sees extension `md`\n\n, decodes UTF-8 — done`MarkReady`\n\nsets status to `ready`\n\n; the database insert succeeds because the CHECK constraint now includes `md`\n\nSeven steps that could each fail independently. The layered change ensures they all agree.\n\n`.txt`\n\n?\nThe question comes up: since Markdown is plain text, why not just rename it to `.txt`\n\nat upload time and skip all of this?\n\nThe immediate answer is that it loses information. A `.txt`\n\nfile and a Markdown file aren't the same thing — Markdown has structure (headers, lists, code blocks) that could eventually be used to improve chunking or embedding quality. Stripping it at upload time forecloses that option.\n\nThe deeper answer is that the explicit `md`\n\ntype in the database lets you query by format later. If you want to add a Markdown-aware chunker that splits on heading boundaries instead of character windows, you can target those documents specifically. A generic `txt`\n\nlabel makes that kind of targeted improvement impossible without re-classifying every document.\n\nIf you wanted to add EPUB or HTML support, the same checklist applies:\n\n`DocumentsFunction.cs`\n\n`DocumentService.cs`\n\n`extract_handler.py`\n\n`accept`\n\nmap in `UploadDropzone.tsx`\n\n`useDocuments.ts`\n\n`file_type`\n\nCHECK constraintEach layer is independently responsible for its concern. The checklist is mechanical, but that's actually the goal — a new file type shouldn't require rethinking the architecture.", "url": "https://wpnews.pro/news/adding-markdown-support-end-to-end-part-7", "canonical_source": "https://dev.to/josh_blair/adding-markdown-support-end-to-end-part-7-24g1", "published_at": "2026-06-15 20:45:32+00:00", "updated_at": "2026-06-15 21:02:32.962228+00:00", "lang": "en", "topics": ["developer-tools", "artificial-intelligence", "machine-learning", "large-language-models", "ai-infrastructure"], "entities": ["Sift", "S3", "Markdown", "DocumentsFunction", "DocumentService", "UploadDropzone", "extract_handler.py", "useDocuments.ts"], "alternates": {"html": "https://wpnews.pro/news/adding-markdown-support-end-to-end-part-7", "markdown": "https://wpnews.pro/news/adding-markdown-support-end-to-end-part-7.md", "text": "https://wpnews.pro/news/adding-markdown-support-end-to-end-part-7.txt", "jsonld": "https://wpnews.pro/news/adding-markdown-support-end-to-end-part-7.jsonld"}}