{"slug": "preprocessing-different-file-types", "title": "Preprocessing Different File Types", "summary": "Haystack released a tutorial on June 17, 2026, teaching beginners how to build an indexing pipeline that preprocesses markdown, PDF, and text files using the FileTypeRouter component. The pipeline converts files into documents, cleans and splits them, creates embeddings, and writes to a document store, with an optional RAG pipeline using a Hugging Face API key.", "body_md": "# Tutorial: Preprocessing Different File Types\n\nLast Updated: June 17, 2026\n\n**Level**: Beginner** Time to complete**: 15 minutes** Goal**: After completing this tutorial, you’ll have learned how to build an indexing pipeline that will preprocess files based on their file type, using the`FileTypeRouter`\n\n.\n\n💡 (Optional): After creating the indexing pipeline in this tutorial, there is an optional section that shows you how to create a RAG pipeline on top of the document store you just created. You must have a\n\n[Hugging Face API Key]for this section\n\n## Components Used\n\n-\n: This component will help you route files based on their corresponding MIME type to different components`FileTypeRouter`\n\n-\n: This component will help you convert markdown files into Haystack Documents`MarkdownToDocument`\n\n-\n: This component will help you convert pdf files into Haystack Documents`PyPDFToDocument`\n\n-\n: This component will help you convert text files into Haystack Documents`TextFileToDocument`\n\n-\n(optional): This component will help you to make Documents more readable by removing extra whitespaces etc.`DocumentCleaner`\n\n-\n: This component will help you to split your Document into chunks`DocumentSplitter`\n\n-\n: This component will help you create embeddings for Documents.`SentenceTransformersDocumentEmbedder`\n\n-\n: This component will help you write Documents into the DocumentStore`DocumentWriter`\n\n## Overview\n\nIn this tutorial, you’ll build an indexing pipeline that preprocesses different types of files (markdown, txt and pdf). Each file will have its own `FileConverter`\n\n. The rest of the indexing pipeline is fairly standard - split the documents into chunks, trim whitespace, create embeddings and write them to a Document Store.\n\nOptionally, you can keep going to see how to use these documents in a query pipeline as well.\n\n## Installing dependencies\n\n```\n%%bash\npip install haystack-ai huggingface-api-haystack\npip install sentence-transformers-haystack\npip install markdown-it-py mdit_plain pypdf\npip install gdown\n```\n\n## Download All Files\n\nFiles that you will use in this tutorial are stored in a\n[GDrive folder](https://drive.google.com/drive/folders/1n9yqq5Gl_HWfND5bTlrCwAOycMDt5EMj). Either download files directly from the GDrive folder or run the code below. If you’re running this tutorial on colab, you’ll find the downloaded files under “/recipe_files” folder in “files” tab on the left.\n\nJust like most real life data, these files are a mishmash of different types.\n\n``` python\nimport gdown\n\nurl = \"https://drive.google.com/drive/folders/1n9yqq5Gl_HWfND5bTlrCwAOycMDt5EMj\"\noutput_dir = \"recipe_files\"\n\ngdown.download_folder(url, quiet=True, output=output_dir)\n```\n\n## Create a Pipeline to Index Documents\n\nNext, you’ll create a pipeline to index documents. To keep things uncomplicated, you’ll use an `InMemoryDocumentStore`\n\nbut this approach would also work with any other flavor of `DocumentStore`\n\n.\n\nYou’ll need a different file converter class for each file type in our data sources: `.pdf`\n\n, `.txt`\n\n, and `.md`\n\nin this case. Our `FileTypeRouter`\n\nconnects each file type to the proper converter.\n\n``` python\nfrom haystack.components.writers import DocumentWriter\nfrom haystack.components.converters import MarkdownToDocument, PyPDFToDocument, TextFileToDocument\nfrom haystack.components.preprocessors import DocumentSplitter, DocumentCleaner\nfrom haystack.components.routers import FileTypeRouter\nfrom haystack_integrations.components.embedders.sentence_transformers import SentenceTransformersDocumentEmbedder\nfrom haystack import Pipeline\nfrom haystack.document_stores.in_memory import InMemoryDocumentStore\n\ndocument_store = InMemoryDocumentStore()\nfile_type_router = FileTypeRouter(mime_types=[\"text/plain\", \"application/pdf\", \"text/markdown\"])\ntext_file_converter = TextFileToDocument()\nmarkdown_converter = MarkdownToDocument()\npdf_converter = PyPDFToDocument()\n```\n\nFrom there, the steps to this indexing pipeline are a bit more standard. The `DocumentCleaner`\n\nremoves whitespace. Then this `DocumentSplitter`\n\nbreaks them into chunks of 150 words, with a bit of overlap to avoid missing context.\n\n```\ndocument_cleaner = DocumentCleaner()\ndocument_splitter = DocumentSplitter(split_by=\"word\", split_length=150, split_overlap=50)\n```\n\nNow you’ll add a `SentenceTransformersDocumentEmbedder`\n\nto create embeddings from the documents. As the last step in this pipeline, the `DocumentWriter`\n\nwill write them to the `InMemoryDocumentStore`\n\n.\n\n```\ndocument_embedder = SentenceTransformersDocumentEmbedder(model=\"sentence-transformers/all-MiniLM-L6-v2\")\ndocument_writer = DocumentWriter(document_store)\n```\n\nAfter creating all the components, add them to the indexing pipeline.\n\n```\npreprocessing_pipeline = Pipeline()\npreprocessing_pipeline.add_component(instance=file_type_router, name=\"file_type_router\")\npreprocessing_pipeline.add_component(instance=text_file_converter, name=\"text_file_converter\")\npreprocessing_pipeline.add_component(instance=markdown_converter, name=\"markdown_converter\")\npreprocessing_pipeline.add_component(instance=pdf_converter, name=\"pypdf_converter\")\npreprocessing_pipeline.add_component(instance=document_cleaner, name=\"document_cleaner\")\npreprocessing_pipeline.add_component(instance=document_splitter, name=\"document_splitter\")\npreprocessing_pipeline.add_component(instance=document_embedder, name=\"document_embedder\")\npreprocessing_pipeline.add_component(instance=document_writer, name=\"document_writer\")\n```\n\nNext, connect them 👇\n\n```\npreprocessing_pipeline.connect(\"file_type_router.text/plain\", \"text_file_converter.sources\")\npreprocessing_pipeline.connect(\"file_type_router.application/pdf\", \"pypdf_converter.sources\")\npreprocessing_pipeline.connect(\"file_type_router.text/markdown\", \"markdown_converter.sources\")\npreprocessing_pipeline.connect(\"text_file_converter\", \"document_cleaner\")\npreprocessing_pipeline.connect(\"pypdf_converter\", \"document_cleaner\")\npreprocessing_pipeline.connect(\"markdown_converter\", \"document_cleaner\")\npreprocessing_pipeline.connect(\"document_cleaner\", \"document_splitter\")\npreprocessing_pipeline.connect(\"document_splitter\", \"document_embedder\")\npreprocessing_pipeline.connect(\"document_embedder\", \"document_writer\")\n```\n\nLet’s test this pipeline with a few recipes I’ve written. Are you getting hungry yet?\n\n``` python\nfrom pathlib import Path\n\npreprocessing_pipeline.run({\"file_type_router\": {\"sources\": list(Path(output_dir).glob(\"**/*\"))}})\n```\n\n🎉 If you only wanted to learn how to preprocess documents, you can stop here! If you want to see an example of using those documents in a RAG pipeline, read on.\n\n## (Optional) Build a pipeline to query documents\n\nNow, let’s build a RAG pipeline that answers queries based on the documents you just created in the section above. For this step, we will be using the\n[ HuggingFaceAPIChatGenerator](https://docs.haystack.deepset.ai/docs/huggingfaceapichatgenerator) so must have a\n\n[Hugging Face API Key](https://huggingface.co/settings/tokens)for this section. We will be using the\n\n`Qwen/Qwen2.5-7B-Instruct`\n\nmodel.\n\n``` python\nimport os\nfrom getpass import getpass\n\nif \"HF_API_TOKEN\" not in os.environ:\n    os.environ[\"HF_API_TOKEN\"] = getpass(\"Enter Hugging Face token:\")\n```\n\nIn this step you’ll build a query pipeline to answer questions about the documents.\n\nThis pipeline takes the prompt, searches the document store for relevant documents, and passes those documents along to the LLM to formulate an answer.\n\n⚠️ Notice how we used\n\n`sentence-transformers/all-MiniLM-L6-v2`\n\nto create embeddings for our documents before. This is why we will be using the same model to embed incoming questions.\n\n``` python\nfrom haystack_integrations.components.embedders.sentence_transformers import SentenceTransformersTextEmbedder\nfrom haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever\nfrom haystack.components.builders import ChatPromptBuilder\nfrom haystack.dataclasses import ChatMessage\nfrom haystack_integrations.components.generators.huggingface_api import HuggingFaceAPIChatGenerator\n\ntemplate = [\n    ChatMessage.from_user(\n        \"\"\"\nAnswer the questions based on the given context.\n\nContext:\n{% for document in documents %}\n    {{ document.content }}\n{% endfor %}\n\nQuestion: {{ question }}\nAnswer:\n\"\"\"\n    )\n]\npipe = Pipeline()\npipe.add_component(\"embedder\", SentenceTransformersTextEmbedder(model=\"sentence-transformers/all-MiniLM-L6-v2\"))\npipe.add_component(\"retriever\", InMemoryEmbeddingRetriever(document_store=document_store))\npipe.add_component(\"chat_prompt_builder\", ChatPromptBuilder(template=template))\npipe.add_component(\n    \"llm\",\n    HuggingFaceAPIChatGenerator(\n        api_type=\"serverless_inference_api\",\n        api_params={\"model\": \"Qwen/Qwen2.5-7B-Instruct\", \"provider\": \"together\"}\n    ),\n)\n\npipe.connect(\"embedder.embedding\", \"retriever.query_embedding\")\npipe.connect(\"retriever\", \"chat_prompt_builder.documents\")\npipe.connect(\"chat_prompt_builder.prompt\", \"llm.messages\")\n```\n\nTry it out yourself by running the code below. If all has gone well, you should have a complete shopping list from all the recipe sources. 🧂🥥🧄\n\n```\nquestion = (\n    \"What ingredients would I need to make vegan keto eggplant lasagna, vegan persimmon flan, and vegan hemp cheese?\"\n)\n\npipe.run({\"embedder\": {\"text\": question}, \"chat_prompt_builder\": {\"question\": question}})\n```\n\n{’llm’: {‘replies’: [ChatMessage(_role=<ChatRole.ASSISTANT: ‘assistant’>, _content=[TextContent(text=‘To make vegan keto eggplant lasagna, vegan persimmon flan, and vegan hemp cheese, you would need the following ingredients:\\n\\n### Vegan Keto Eggplant Lasagna\\n- 2 large eggplants\\n- Salt (Hella salt)\\n- 1/2 cup store-bought vegan mozzarella for topping\\n- Pesto (ingredients: 4 oz basil, 1/4 cup almonds, 1/4 cup nutritional yeast, 1/4 cup olive oil, 1 recipe vegan pesto)\\n- Spinach Tofu Ricotta (14 oz firm or extra firm tofu, 10 oz spinach, juice of 1 lemon, garlic powder to taste, salt to taste)\\n- Macadamia Nut Cheese (1 cup macadamia nuts, 1/4 cup nutritional yeast, 1/4 cup olive oil, 1 recipe vegan pesto, 1 recipe spinach tofu ricotta, 1 tsp garlic powder, juice of half a lemon, salt to taste)\\n\\n### Vegan Persimmon Flan\\n- 2 average-sized fuyu persimmons, strained\\n- 1 tbsp cornstarch\\n- 1/2 tsp agar agar\\n- 1 tbsp agave nectar, or to taste\\n- 2 tbsp granulated sugar\\n- 1/4 cup coconut creme\\n- 1/2 cup almond milk\\n- 1/2 tsp vanilla\\n\\n### Vegan Hemp Cheese\\n- 1/2 cup sunflower seeds\\n- 1/2 cup hemp hearts\\n- 1.5 teaspoons miso paste\\n- 1 tsp nutritional yeast\\n- 1/4 cup rejuvelac\\n- 1/4th teaspoon salt, or to taste\\n\\n### Additional Tools and Notes\\n- Casserole dish (9 x 13)\\n- 2 ramekins\\n- Blender or food processor\\n- Saucepan\\n- Immersion blender (optional)\\n- Clean glass bowl\\n- Rubber band\\n- Dish towel\\n- Knife\\n- Hot water bath method (for flan)\\n\\nThese ingredients and tools will allow you to prepare all three dishes as described in the provided context.’)], _name=None, _meta={‘model’: ‘Qwen/Qwen2.5-7B-Instruct’, ‘finish_reason’: ‘stop’, ‘index’: 0, ‘usage’: {‘prompt_tokens’: 2031, ‘completion_tokens’: 446}})]}}\n\n## What’s next\n\nCongratulations on building an indexing pipeline that can preprocess different file types. Go forth and ingest all the messy real-world data into your workflows. 💥\n\nIf you liked this tutorial, you may also enjoy:\n\nTo stay up to date on the latest Haystack developments, you can\n[sign up for our newsletter](https://landing.deepset.ai/haystack-community-updates). Thanks for reading!", "url": "https://wpnews.pro/news/preprocessing-different-file-types", "canonical_source": "https://haystack.deepset.ai/tutorials/30_file_type_preprocessing_index_pipeline/", "published_at": "2026-06-17 00:00:00+00:00", "updated_at": "2026-06-24 12:18:46.031427+00:00", "lang": "en", "topics": ["developer-tools", "ai-tools", "natural-language-processing"], "entities": ["Haystack", "FileTypeRouter", "MarkdownToDocument", "PyPDFToDocument", "TextFileToDocument", "DocumentCleaner", "DocumentSplitter", "SentenceTransformersDocumentEmbedder"], "alternates": {"html": "https://wpnews.pro/news/preprocessing-different-file-types", "markdown": "https://wpnews.pro/news/preprocessing-different-file-types.md", "text": "https://wpnews.pro/news/preprocessing-different-file-types.txt", "jsonld": "https://wpnews.pro/news/preprocessing-different-file-types.jsonld"}}