Preprocessing Different File Types

Haystack released a tutorial on June 17, 2026, teaching beginners how to build an indexing pipeline that preprocesses markdown, PDF, and text files using the FileTypeRouter component. The pipeline converts files into documents, cleans and splits them, creates embeddings, and writes to a document store, with an optional RAG pipeline using a Hugging Face API key.

Tutorial: Preprocessing Different File Types Last Updated: June 17, 2026 Level : Beginner Time to complete : 15 minutes Goal : After completing this tutorial, you’ll have learned how to build an indexing pipeline that will preprocess files based on their file type, using the FileTypeRouter . 💡 Optional : After creating the indexing pipeline in this tutorial, there is an optional section that shows you how to create a RAG pipeline on top of the document store you just created. You must have a Hugging Face API Key for this section Components Used - : This component will help you route files based on their corresponding MIME type to different components FileTypeRouter - : This component will help you convert markdown files into Haystack Documents MarkdownToDocument - : This component will help you convert pdf files into Haystack Documents PyPDFToDocument - : This component will help you convert text files into Haystack Documents TextFileToDocument - optional : This component will help you to make Documents more readable by removing extra whitespaces etc. DocumentCleaner - : This component will help you to split your Document into chunks DocumentSplitter - : This component will help you create embeddings for Documents. SentenceTransformersDocumentEmbedder - : This component will help you write Documents into the DocumentStore DocumentWriter Overview In this tutorial, you’ll build an indexing pipeline that preprocesses different types of files markdown, txt and pdf . Each file will have its own FileConverter . The rest of the indexing pipeline is fairly standard - split the documents into chunks, trim whitespace, create embeddings and write them to a Document Store. Optionally, you can keep going to see how to use these documents in a query pipeline as well. Installing dependencies %%bash pip install haystack-ai huggingface-api-haystack pip install sentence-transformers-haystack pip install markdown-it-py mdit plain pypdf pip install gdown Download All Files Files that you will use in this tutorial are stored in a GDrive folder https://drive.google.com/drive/folders/1n9yqq5Gl HWfND5bTlrCwAOycMDt5EMj . Either download files directly from the GDrive folder or run the code below. If you’re running this tutorial on colab, you’ll find the downloaded files under “/recipe files” folder in “files” tab on the left. Just like most real life data, these files are a mishmash of different types. python import gdown url = "https://drive.google.com/drive/folders/1n9yqq5Gl HWfND5bTlrCwAOycMDt5EMj" output dir = "recipe files" gdown.download folder url, quiet=True, output=output dir Create a Pipeline to Index Documents Next, you’ll create a pipeline to index documents. To keep things uncomplicated, you’ll use an InMemoryDocumentStore but this approach would also work with any other flavor of DocumentStore . You’ll need a different file converter class for each file type in our data sources: .pdf , .txt , and .md in this case. Our FileTypeRouter connects each file type to the proper converter. python from haystack.components.writers import DocumentWriter from haystack.components.converters import MarkdownToDocument, PyPDFToDocument, TextFileToDocument from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner from haystack.components.routers import FileTypeRouter from haystack integrations.components.embedders.sentence transformers import SentenceTransformersDocumentEmbedder from haystack import Pipeline from haystack.document stores.in memory import InMemoryDocumentStore document store = InMemoryDocumentStore file type router = FileTypeRouter mime types= "text/plain", "application/pdf", "text/markdown" text file converter = TextFileToDocument markdown converter = MarkdownToDocument pdf converter = PyPDFToDocument From there, the steps to this indexing pipeline are a bit more standard. The DocumentCleaner removes whitespace. Then this DocumentSplitter breaks them into chunks of 150 words, with a bit of overlap to avoid missing context. document cleaner = DocumentCleaner document splitter = DocumentSplitter split by="word", split length=150, split overlap=50 Now you’ll add a SentenceTransformersDocumentEmbedder to create embeddings from the documents. As the last step in this pipeline, the DocumentWriter will write them to the InMemoryDocumentStore . document embedder = SentenceTransformersDocumentEmbedder model="sentence-transformers/all-MiniLM-L6-v2" document writer = DocumentWriter document store After creating all the components, add them to the indexing pipeline. preprocessing pipeline = Pipeline preprocessing pipeline.add component instance=file type router, name="file type router" preprocessing pipeline.add component instance=text file converter, name="text file converter" preprocessing pipeline.add component instance=markdown converter, name="markdown converter" preprocessing pipeline.add component instance=pdf converter, name="pypdf converter" preprocessing pipeline.add component instance=document cleaner, name="document cleaner" preprocessing pipeline.add component instance=document splitter, name="document splitter" preprocessing pipeline.add component instance=document embedder, name="document embedder" preprocessing pipeline.add component instance=document writer, name="document writer" Next, connect them 👇 preprocessing pipeline.connect "file type router.text/plain", "text file converter.sources" preprocessing pipeline.connect "file type router.application/pdf", "pypdf converter.sources" preprocessing pipeline.connect "file type router.text/markdown", "markdown converter.sources" preprocessing pipeline.connect "text file converter", "document cleaner" preprocessing pipeline.connect "pypdf converter", "document cleaner" preprocessing pipeline.connect "markdown converter", "document cleaner" preprocessing pipeline.connect "document cleaner", "document splitter" preprocessing pipeline.connect "document splitter", "document embedder" preprocessing pipeline.connect "document embedder", "document writer" Let’s test this pipeline with a few recipes I’ve written. Are you getting hungry yet? python from pathlib import Path preprocessing pipeline.run {"file type router": {"sources": list Path output dir .glob " / " }} 🎉 If you only wanted to learn how to preprocess documents, you can stop here If you want to see an example of using those documents in a RAG pipeline, read on. Optional Build a pipeline to query documents Now, let’s build a RAG pipeline that answers queries based on the documents you just created in the section above. For this step, we will be using the HuggingFaceAPIChatGenerator https://docs.haystack.deepset.ai/docs/huggingfaceapichatgenerator so must have a Hugging Face API Key https://huggingface.co/settings/tokens for this section. We will be using the Qwen/Qwen2.5-7B-Instruct model. python import os from getpass import getpass if "HF API TOKEN" not in os.environ: os.environ "HF API TOKEN" = getpass "Enter Hugging Face token:" In this step you’ll build a query pipeline to answer questions about the documents. This pipeline takes the prompt, searches the document store for relevant documents, and passes those documents along to the LLM to formulate an answer. ⚠️ Notice how we used sentence-transformers/all-MiniLM-L6-v2 to create embeddings for our documents before. This is why we will be using the same model to embed incoming questions. python from haystack integrations.components.embedders.sentence transformers import SentenceTransformersTextEmbedder from haystack.components.retrievers.in memory import InMemoryEmbeddingRetriever from haystack.components.builders import ChatPromptBuilder from haystack.dataclasses import ChatMessage from haystack integrations.components.generators.huggingface api import HuggingFaceAPIChatGenerator template = ChatMessage.from user """ Answer the questions based on the given context. Context: {% for document in documents %} {{ document.content }} {% endfor %} Question: {{ question }} Answer: """ pipe = Pipeline pipe.add component "embedder", SentenceTransformersTextEmbedder model="sentence-transformers/all-MiniLM-L6-v2" pipe.add component "retriever", InMemoryEmbeddingRetriever document store=document store pipe.add component "chat prompt builder", ChatPromptBuilder template=template pipe.add component "llm", HuggingFaceAPIChatGenerator api type="serverless inference api", api params={"model": "Qwen/Qwen2.5-7B-Instruct", "provider": "together"} , pipe.connect "embedder.embedding", "retriever.query embedding" pipe.connect "retriever", "chat prompt builder.documents" pipe.connect "chat prompt builder.prompt", "llm.messages" Try it out yourself by running the code below. If all has gone well, you should have a complete shopping list from all the recipe sources. 🧂🥥🧄 question = "What ingredients would I need to make vegan keto eggplant lasagna, vegan persimmon flan, and vegan hemp cheese?" pipe.run {"embedder": {"text": question}, "chat prompt builder": {"question": question}} {’llm’: {‘replies’: ChatMessage role=<ChatRole.ASSISTANT: ‘assistant’ , content= TextContent text=‘To make vegan keto eggplant lasagna, vegan persimmon flan, and vegan hemp cheese, you would need the following ingredients:\n\n Vegan Keto Eggplant Lasagna\n- 2 large eggplants\n- Salt Hella salt \n- 1/2 cup store-bought vegan mozzarella for topping\n- Pesto ingredients: 4 oz basil, 1/4 cup almonds, 1/4 cup nutritional yeast, 1/4 cup olive oil, 1 recipe vegan pesto \n- Spinach Tofu Ricotta 14 oz firm or extra firm tofu, 10 oz spinach, juice of 1 lemon, garlic powder to taste, salt to taste \n- Macadamia Nut Cheese 1 cup macadamia nuts, 1/4 cup nutritional yeast, 1/4 cup olive oil, 1 recipe vegan pesto, 1 recipe spinach tofu ricotta, 1 tsp garlic powder, juice of half a lemon, salt to taste \n\n Vegan Persimmon Flan\n- 2 average-sized fuyu persimmons, strained\n- 1 tbsp cornstarch\n- 1/2 tsp agar agar\n- 1 tbsp agave nectar, or to taste\n- 2 tbsp granulated sugar\n- 1/4 cup coconut creme\n- 1/2 cup almond milk\n- 1/2 tsp vanilla\n\n Vegan Hemp Cheese\n- 1/2 cup sunflower seeds\n- 1/2 cup hemp hearts\n- 1.5 teaspoons miso paste\n- 1 tsp nutritional yeast\n- 1/4 cup rejuvelac\n- 1/4th teaspoon salt, or to taste\n\n Additional Tools and Notes\n- Casserole dish 9 x 13 \n- 2 ramekins\n- Blender or food processor\n- Saucepan\n- Immersion blender optional \n- Clean glass bowl\n- Rubber band\n- Dish towel\n- Knife\n- Hot water bath method for flan \n\nThese ingredients and tools will allow you to prepare all three dishes as described in the provided context.’ , name=None, meta={‘model’: ‘Qwen/Qwen2.5-7B-Instruct’, ‘finish reason’: ‘stop’, ‘index’: 0, ‘usage’: {‘prompt tokens’: 2031, ‘completion tokens’: 446}} }} What’s next Congratulations on building an indexing pipeline that can preprocess different file types. Go forth and ingest all the messy real-world data into your workflows. 💥 If you liked this tutorial, you may also enjoy: To stay up to date on the latest Haystack developments, you can sign up for our newsletter https://landing.deepset.ai/haystack-community-updates . Thanks for reading