Preprocessing Different File Types Haystack released a tutorial on June 17, 2026, teaching beginners how to build an indexing pipeline that preprocesses markdown, PDF, and text files using the FileTypeRouter component. The pipeline converts files into documents, cleans and splits them, creates embeddings, and writes to a document store, with an optional RAG pipeline using a Hugging Face API key. Tutorial: Preprocessing Different File Types Last Updated: June 17, 2026 Level : Beginner Time to complete : 15 minutes Goal : After completing this tutorial, you’ll have learned how to build an indexing pipeline that will preprocess files based on their file type, using the FileTypeRouter . 💡 Optional : After creating the indexing pipeline in this tutorial, there is an optional section that shows you how to create a RAG pipeline on top of the document store you just created. You must have a Hugging Face API Key for this section Components Used - : This component will help you route files based on their corresponding MIME type to different components FileTypeRouter - : This component will help you convert markdown files into Haystack Documents MarkdownToDocument - : This component will help you convert pdf files into Haystack Documents PyPDFToDocument - : This component will help you convert text files into Haystack Documents TextFileToDocument - optional : This component will help you to make Documents more readable by removing extra whitespaces etc. DocumentCleaner - : This component will help you to split your Document into chunks DocumentSplitter - : This component will help you create embeddings for Documents. SentenceTransformersDocumentEmbedder - : This component will help you write Documents into the DocumentStore DocumentWriter Overview In this tutorial, you’ll build an indexing pipeline that preprocesses different types of files markdown, txt and pdf . Each file will have its own FileConverter . The rest of the indexing pipeline is fairly standard - split the documents into chunks, trim whitespace, create embeddings and write them to a Document Store. Optionally, you can keep going to see how to use these documents in a query pipeline as well. Installing dependencies %%bash pip install haystack-ai huggingface-api-haystack pip install sentence-transformers-haystack pip install markdown-it-py mdit plain pypdf pip install gdown Download All Files Files that you will use in this tutorial are stored in a GDrive folder https://drive.google.com/drive/folders/1n9yqq5Gl HWfND5bTlrCwAOycMDt5EMj . Either download files directly from the GDrive folder or run the code below. If you’re running this tutorial on colab, you’ll find the downloaded files under “/recipe files” folder in “files” tab on the left. Just like most real life data, these files are a mishmash of different types. python import gdown url = "https://drive.google.com/drive/folders/1n9yqq5Gl HWfND5bTlrCwAOycMDt5EMj" output dir = "recipe files" gdown.download folder url, quiet=True, output=output dir Create a Pipeline to Index Documents Next, you’ll create a pipeline to index documents. To keep things uncomplicated, you’ll use an InMemoryDocumentStore but this approach would also work with any other flavor of DocumentStore . You’ll need a different file converter class for each file type in our data sources: .pdf , .txt , and .md in this case. Our FileTypeRouter connects each file type to the proper converter. python from haystack.components.writers import DocumentWriter from haystack.components.converters import MarkdownToDocument, PyPDFToDocument, TextFileToDocument from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner from haystack.components.routers import FileTypeRouter from haystack integrations.components.embedders.sentence transformers import SentenceTransformersDocumentEmbedder from haystack import Pipeline from haystack.document stores.in memory import InMemoryDocumentStore document store = InMemoryDocumentStore file type router = FileTypeRouter mime types= "text/plain", "application/pdf", "text/markdown" text file converter = TextFileToDocument markdown converter = MarkdownToDocument pdf converter = PyPDFToDocument From there, the steps to this indexing pipeline are a bit more standard. The DocumentCleaner removes whitespace. Then this DocumentSplitter breaks them into chunks of 150 words, with a bit of overlap to avoid missing context. document cleaner = DocumentCleaner document splitter = DocumentSplitter split by="word", split length=150, split overlap=50 Now you’ll add a SentenceTransformersDocumentEmbedder to create embeddings from the documents. As the last step in this pipeline, the DocumentWriter will write them to the InMemoryDocumentStore . document embedder = SentenceTransformersDocumentEmbedder model="sentence-transformers/all-MiniLM-L6-v2" document writer = DocumentWriter document store After creating all the components, add them to the indexing pipeline. preprocessing pipeline = Pipeline preprocessing pipeline.add component instance=file type router, name="file type router" preprocessing pipeline.add component instance=text file converter, name="text file converter" preprocessing pipeline.add component instance=markdown converter, name="markdown converter" preprocessing pipeline.add component instance=pdf converter, name="pypdf converter" preprocessing pipeline.add component instance=document cleaner, name="document cleaner" preprocessing pipeline.add component instance=document splitter, name="document splitter" preprocessing pipeline.add component instance=document embedder, name="document embedder" preprocessing pipeline.add component instance=document writer, name="document writer" Next, connect them 👇 preprocessing pipeline.connect "file type router.text/plain", "text file converter.sources" preprocessing pipeline.connect "file type router.application/pdf", "pypdf converter.sources" preprocessing pipeline.connect "file type router.text/markdown", "markdown converter.sources" preprocessing pipeline.connect "text file converter", "document cleaner" preprocessing pipeline.connect "pypdf converter", "document cleaner" preprocessing pipeline.connect "markdown converter", "document cleaner" preprocessing pipeline.connect "document cleaner", "document splitter" preprocessing pipeline.connect "document splitter", "document embedder" preprocessing pipeline.connect "document embedder", "document writer" Let’s test this pipeline with a few recipes I’ve written. Are you getting hungry yet? python from pathlib import Path preprocessing pipeline.run {"file type router": {"sources": list Path output dir .glob " / " }} 🎉 If you only wanted to learn how to preprocess documents, you can stop here If you want to see an example of using those documents in a RAG pipeline, read on. Optional Build a pipeline to query documents Now, let’s build a RAG pipeline that answers queries based on the documents you just created in the section above. For this step, we will be using the HuggingFaceAPIChatGenerator https://docs.haystack.deepset.ai/docs/huggingfaceapichatgenerator so must have a Hugging Face API Key https://huggingface.co/settings/tokens for this section. We will be using the Qwen/Qwen2.5-7B-Instruct model. python import os from getpass import getpass if "HF API TOKEN" not in os.environ: os.environ "HF API TOKEN" = getpass "Enter Hugging Face token:" In this step you’ll build a query pipeline to answer questions about the documents. This pipeline takes the prompt, searches the document store for relevant documents, and passes those documents along to the LLM to formulate an answer. ⚠️ Notice how we used sentence-transformers/all-MiniLM-L6-v2 to create embeddings for our documents before. This is why we will be using the same model to embed incoming questions. python from haystack integrations.components.embedders.sentence transformers import SentenceTransformersTextEmbedder from haystack.components.retrievers.in memory import InMemoryEmbeddingRetriever from haystack.components.builders import ChatPromptBuilder from haystack.dataclasses import ChatMessage from haystack integrations.components.generators.huggingface api import HuggingFaceAPIChatGenerator template = ChatMessage.from user """ Answer the questions based on the given context. Context: {% for document in documents %} {{ document.content }} {% endfor %} Question: {{ question }} Answer: """ pipe = Pipeline pipe.add component "embedder", SentenceTransformersTextEmbedder model="sentence-transformers/all-MiniLM-L6-v2" pipe.add component "retriever", InMemoryEmbeddingRetriever document store=document store pipe.add component "chat prompt builder", ChatPromptBuilder template=template pipe.add component "llm", HuggingFaceAPIChatGenerator api type="serverless inference api", api params={"model": "Qwen/Qwen2.5-7B-Instruct", "provider": "together"} , pipe.connect "embedder.embedding", "retriever.query embedding" pipe.connect "retriever", "chat prompt builder.documents" pipe.connect "chat prompt builder.prompt", "llm.messages" Try it out yourself by running the code below. If all has gone well, you should have a complete shopping list from all the recipe sources. 🧂🥥🧄 question = "What ingredients would I need to make vegan keto eggplant lasagna, vegan persimmon flan, and vegan hemp cheese?" pipe.run {"embedder": {"text": question}, "chat prompt builder": {"question": question}} {’llm’: {‘replies’: ChatMessage role=