What it actually takes to wire a new file type through a layered RAG stack β validation, extraction, MIME quirks, and a schema migration.
One of the things I deliberately built into Sift is a clear separation of concerns across the stack. The API validates what's allowed, the pipeline handles extraction, the frontend controls what you can drop, and the database enforces the constraint at rest. That structure pays off when you add features β but it also means "add Markdown support" touches more layers than you might expect.
This post walks through what that change actually looked like: every file it touched, the subtle bug that emerged in the browser, and the schema migration that closed the loop.
Before making any change, I mapped out every place the stack has an opinion about file types:
DocumentsFunction
validates the extension before issuing a presigned upload URLDocumentService
maps extensions to S3 content types for signingextract_handler.py
dispatches on extension to the right extractorUploadDropzone
controls what the file picker and drag-and-drop acceptuseDocuments.ts
sets the Content-Type
header on the S3 PUTCHECK
constraint on documents.file_type
enforces the allowed set at restEach of these is independent. A gap in any one of them causes a different failure mode: the API rejects the upload at step 1, the pipeline silently fails at step 3, the S3 PUT returns a 403 at step 5, or the database insert throws at step 6.
The entry point is DocumentsFunction.cs
. When a client calls POST /documents/upload-url
, the function checks the extension against an allowed set before doing anything else:
// Before
var allowedExtensions = new HashSet<string> { "pdf", "docx", "csv", "txt" };
// After
var allowedExtensions = new HashSet<string> { "pdf", "docx", "csv", "txt", "md" };
And in DocumentService.cs
, the content-type map that drives the presigned URL signing:
private static readonly Dictionary<string, string> ContentTypes = new()
{
["pdf"] = "application/pdf",
["docx"] = "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
["csv"] = "text/csv",
["txt"] = "text/plain",
["md"] = "text/markdown", // added
};
The presigned URL is signed for a specific Content-Type
. Whatever the client sends in the PUT must match exactly β S3 rejects mismatches with a 403. That matching requirement is what caused the browser bug described below.
The extraction handler in extract_handler.py
already had a clean dispatch pattern:
if ext == "pdf":
text, page_count = _extract_pdf(content)
elif ext == "docx":
text, page_count = _extract_docx(content)
elif ext == "csv":
text, page_count = _extract_csv(content)
elif ext == "txt":
text = content.decode("utf-8", errors="replace")
page_count = 1
else:
raise ValueError(f"Unsupported file type: {ext}")
Markdown is plain UTF-8 text with formatting syntax. The right extraction strategy here is just to read it as-is and let the chunker and embedding model deal with the content. The Markdown syntax (headers, bold, code fences) doesn't hurt RAG quality β the embedding model handles natural text well enough that the punctuation is just noise rather than a problem.
The change was a one-liner:
elif ext in ("txt", "md"):
text = content.decode("utf-8", errors="replace")
page_count = 1
If you wanted to strip Markdown syntax before embedding, you could run the content through a parser like mistune
and extract just the text nodes. For the scope of this project that's premature β the current approach works and keeps the pipeline dependency-free for this case.
UploadDropzone.tsx
uses the accept
prop to tell the browser which files to allow:
// Before
accept={{ "application/pdf": [".pdf"], "text/plain": [".txt"], ... }}
// After
accept={{ "application/pdf": [".pdf"], "text/plain": [".txt"], "text/markdown": [".md"], ... }}
This controls both the native file picker dialog (what's visible and selectable) and drag-and-drop validation (what gets highlighted vs. rejected). Both are client-side UX β neither is a security boundary β but they matter for usability.
This is the part that didn't work on the first try.
When the frontend uploads a file to S3, it needs to set the Content-Type
header to match whatever the presigned URL was signed for. The original code used file.type
β the MIME type the browser reports for the selected file:
await axios.put(uploadUrl, file, {
headers: { "Content-Type": file.type },
});
For PDFs this works fine. For .md
files it doesn't. file.type
for Markdown is unreliable across browsers: Chrome reports ""
(empty string), some environments report "text/plain"
. The presigned URL was signed for "text/markdown"
. An empty string or "text/plain"
in the Content-Type
header causes S3 to reject the PUT with a 403.
The fix is to not trust file.type
at all. Instead, derive the content type from the file extension:
const MIME_MAP: Record<string, string> = {
pdf: "application/pdf",
docx: "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
csv: "text/csv",
txt: "text/plain",
md: "text/markdown",
};
function getMimeType(filename: string): string {
const ext = filename.split(".").pop()?.toLowerCase() ?? "";
return MIME_MAP[ext] ?? "application/octet-stream";
}
// Usage
await axios.put(uploadUrl, file, {
headers: { "Content-Type": getMimeType(file.name) },
});
This makes the MIME type determination consistent across every browser and OS, and it keeps the frontend and API in sync β both derive the content type from the same extension mapping.
The lesson here is general: file.type
is a hint from the operating system, not a contract. For any workflow where the content type has downstream consequences (like S3 presigned URL validation), always derive it yourself from the extension.
The database had a CHECK constraint on documents.file_type
from the initial schema:
CHECK (file_type IN ('pdf', 'csv', 'docx', 'txt'))
Without updating this, every Markdown document insert would fail with a constraint violation β after the file had already been uploaded to S3 and the pipeline had started. The migration is straightforward:
-- migrations/002_add_md_file_type.sql
ALTER TABLE documents
DROP CONSTRAINT documents_file_type_check,
ADD CONSTRAINT documents_file_type_check
CHECK (file_type IN ('pdf', 'csv', 'docx', 'txt', 'md'));
Drop the old constraint, add the new one. Because Aurora Serverless v2 is the backing store and this is a DDL statement with no data rewrite, it completes nearly instantly regardless of table size.
The migration is applied via scripts/migrate-local.py
against the RDS Data API. No VPN, no bastion host β just a boto3 execute_statement
call.
Here's the complete path for a Markdown upload after all the changes:
README.md
onto the dropzone β accepted because text/markdown
is in the accept
mapPOST /documents/upload-url
with { fileName: "README.md", fileType: "md" }
"md"
against the allowed set, maps it to text/markdown
, issues a presigned S3 PUT URLContent-Type: text/markdown
derived from the extension mapObject Created
event β EventBridge β Step FunctionsExtractText
Lambda reads the S3 object, sees extension md
, decodes UTF-8 β doneMarkReady
sets status to ready
; the database insert succeeds because the CHECK constraint now includes md
Seven steps that could each fail independently. The layered change ensures they all agree.
.txt
?
The question comes up: since Markdown is plain text, why not just rename it to .txt
at upload time and skip all of this?
The immediate answer is that it loses information. A .txt
file and a Markdown file aren't the same thing β Markdown has structure (headers, lists, code blocks) that could eventually be used to improve chunking or embedding quality. Stripping it at upload time forecloses that option.
The deeper answer is that the explicit md
type in the database lets you query by format later. If you want to add a Markdown-aware chunker that splits on heading boundaries instead of character windows, you can target those documents specifically. A generic txt
label makes that kind of targeted improvement impossible without re-classifying every document.
If you wanted to add EPUB or HTML support, the same checklist applies:
DocumentsFunction.cs
DocumentService.cs
extract_handler.py
accept
map in UploadDropzone.tsx
useDocuments.ts
file_type
CHECK constraintEach layer is independently responsible for its concern. The checklist is mechanical, but that's actually the goal β a new file type shouldn't require rethinking the architecture.