{"slug": "aichain-tools-search-conversion-embeddings", "title": "AIchain Tools: Search, Conversion, Embeddings", "summary": "Yait_aichain, an open-source AI pipeline framework, introduces a modular tool system for search, document conversion, and embeddings. The framework keeps its core dependency surface small, shipping only Markdown conversion as a hard dependency while search providers and other tools plug in via optional extras or external MCP servers. This design avoids fragile dependency trees as specialized tools evolve rapidly.", "body_md": "An LLM knows everything up to its training cutoff and nothing after. Ask it about yesterday's stock price, hand it a PDF, or expect it to tell you which of your 10,000 support tickets are duplicates — and you'll hit a wall. Tools are how you close that gap.\n\nThis is article #6 in the series on building AI pipelines with yait_aichain. Three categories today: search for live data, Markdown conversion for universal document ingestion, and embeddings for tasks well beyond basic RAG.\n\nBefore any code, one architectural decision is worth understanding. **yait_aichain** deliberately keeps its core dependency surface small. Only one tool ships inside the package as a hard dependency: Markdown conversion. Everything else — search providers, embeddings, vector stores, rerankers — is either an optional extra you install separately or a pluggable **MCP (Model Context Protocol)** server running as a separate process.\n\nWhy? Specialized tools change fast. Search APIs come and go. OpenAI has shipped three generations of embedding models since 2022, and the pace hasn't slowed. Tying all of that into the core library would create a fragile dependency tree that breaks every time a provider ships a breaking change. So the universal stuff lives inside. The specialized things plug in from outside.\n\nA model trained on data through April 2024 can't tell you what happened yesterday. Search tools fix that.\n\n[yait_aichain](https://github.com/yaitio/aichain) includes a **Perplexity search integration** as an optional extra. Install it first, then import:\n\n```\npip install yait-aichain[perplexity]\npython\nfrom yait_aichain.tools import searchPerplexity\n\nsearch = searchPerplexity()\nresult = search.run(\"latest LLM benchmarks 2025\")\nprint(result)\n```\n\nOne function call, one string back — grounded, citation-backed, pulled from live web results.\n\nBut Perplexity isn't the only option. You might prefer Brave Search for its independence from big-tech indices, or SerpAPI for its structured Google results. These providers connect through MCP servers — external processes that yait_aichain discovers and calls at runtime:\n\n```\n\"\"\"\nStart the MCP server first:\n    python your_newsapi_server.py\n\nMCP server: newsapi on http://127.0.0.1:8009/mcp\n\nTools available:\n    search_news        — search articles (requires: q, sources, or domains)\n    get_top_headlines  — breaking news\n    get_sources        — list all available news sources\n\"\"\"\n\nfrom yait_aichain.tools import Tool\n# The Skill/Chain connects to MCP tools via the protocol at runtime.\n# No MCP-specific dependencies live inside yait_aichain itself.\n```\n\nThe MCP pattern is intentional. A NewsAPI server has its own API key, its own rate limits, its own Python dependencies. Keeping that in a separate process means your core pipeline stays clean. Swap Brave for SerpAPI by pointing at a different MCP server — no code changes in your Chain.\n\nHere's a scenario every developer hits: you need an LLM to process a PDF report, a DOCX contract, a PowerPoint deck, and a URL. Each format needs a different parser. Each parser has its own quirks, its own edge cases, its own dependencies.\n\nOr you use one line:\n\n``` python\nfrom yait_aichain.tools import convertToMD\n\ntool = convertToMD()\n\n# Convert a URL\nweb_content = tool.run(\"https://example.com\")\nprint(web_content[:500])\n\n# Convert a local PDF\npdf_content = tool.run(\"quarterly_report.pdf\")\n\n# Convert a DOCX file\ndoc_content = tool.run(\"contract_v3.docx\")\n```\n\nUnder the hood, `convertToMD`\n\nwraps Microsoft's open-source `markitdown`\n\nlibrary, bundled as a hard dependency of yait_aichain — no separate install required. It handles PDF, DOCX, PPTX, XLSX, HTML, and URLs out of the box. The output is always Markdown, which happens to be the format LLMs consume best.\n\nNo configuration objects. No format detection logic. No parsing pipelines. One call turns nearly any document into clean Markdown ready for a Skill or Chain. That's why it's the only tool that lives inside yait_aichain as a hard dependency — document-to-Markdown conversion is universal enough to earn its place in the core.\n\nWhen developers hear \"embeddings,\" they usually think retrieval-augmented generation — chunk documents, embed them, store them in a vector database, retrieve at query time. That's one use case. Not the only one.\n\n``` python\nfrom yait_aichain.tools import Embedding\n\nemb = Embedding(model=\"text-embedding-3-small\")\nvectors = emb.embed([\n    \"How do I reset my password?\",\n    \"I can't log into my account\",\n    \"What are your pricing plans?\",\n    \"Password reset not working\",\n])\n```\n\nEach call returns a list of float vectors. What you do with those vectors determines the use case.\n\n**Semantic search.** Skip keyword matching entirely. A query for \"authentication issues\" will surface tickets about password resets, SSO failures, and 2FA problems — because the meaning matches, not the words.\n\n**Deduplication.** Compare cosine similarity between support tickets. Tickets 0, 1, and 3 above will cluster tightly — they're semantically identical despite different wording. Set a similarity threshold of 0.92 and you can automatically flag duplicates across thousands of entries without a human reading a single one.\n\n**Clustering.** Feed the vectors into k-means or HDBSCAN. Group product reviews by theme without writing a single regex. Find natural topic boundaries in a corpus of research papers.\n\nThe `VectorDB`\n\nclass gives you an in-process vector store for quick prototyping:\n\n``` python\nfrom yait_aichain.tools import VectorDB, vectorQuery\n\ndb = VectorDB()\ndb.add(\n    documents=[\"Password reset guide\", \"Pricing FAQ\", \"SSO troubleshooting\"],\n    ids=[\"d1\", \"d2\", \"d3\"],\n)\n\nresults = vectorQuery(db, query=\"I can't log in\", top_k=2)\nfor r in results:\n    print(r)\n```\n\nNote: `vectorQuery`\n\nis a standalone function rather than a method on `VectorDB`\n\nbecause it operates across multiple database instances in pipeline contexts — the separation is intentional.\n\nWhen result quality matters and your document set is large, add a reranking step:\n\n``` python\nfrom yait_aichain.tools import Reranker\n\nreranker = Reranker()\nranked = reranker.rerank(\n    query=\"latest AI benchmarks\",\n    documents=[\"doc A about benchmarks\", \"doc B about pricing\", \"doc C about model eval\"],\n)\nprint(ranked)\n```\n\nThe reranker takes coarse results from vector search and reorders them with a cross-encoder — more expensive per comparison, but on published benchmarks like BEIR, cross-encoder reranking consistently improves precision at top-k positions over vector search alone. Think of `results`\n\nas your candidate pool and `ranked`\n\nas the final ordered list you pass to the model.\n\nEvery tool in yait_aichain follows the same pattern: subclass `Tool`\n\n, define `name`\n\nand `description`\n\n, implement `run(self, input: str)`\n\n. The single-string signature is what lets the framework invoke your tool through the standard LLM tool-calling protocol — the model sends a string (or a JSON blob you parse yourself), your tool returns a string. Here's a live currency exchange tool:\n\n``` python\nimport json\nimport urllib3\nfrom yait_aichain.tools import Tool\nfrom yait_aichain import Model, Skill\n\nclass FXRateTool(Tool):\n    \"\"\"Fetch live FX rate from frankfurter.app.\"\"\"\n    name = \"fx_rate\"\n    description = (\n        \"Return the current exchange rate between two currencies. \"\n        \"Input must be a JSON string with keys 'base' and 'target', \"\n        \"e.g. '{\\\"base\\\": \\\"USD\\\", \\\"target\\\": \\\"EUR\\\"}'.\"\n    )\n\n    def run(self, input: str) -> str:\n        # Production code should validate input and handle network errors.\n        params = json.loads(input)\n        base = params.get(\"base\", \"USD\")\n        target = params.get(\"target\", \"EUR\")\n        http = urllib3.PoolManager()\n        resp = http.request(\n            \"GET\",\n            f\"https://api.frankfurter.app/latest?from={base}&to={target}\",\n        )\n        data = json.loads(resp.data.decode())\n        rate = data[\"rates\"][target]\n        return f\"1 {base} = {rate} {target}\"\n\n# Use it standalone\ntool = FXRateTool()\nprint(tool.run('{\"base\": \"USD\", \"target\": \"EUR\"}'))  # 1 USD = 0.8842 EUR\n\n# Or plug it into a Skill\nmodel = Model(\"claude-3-5-sonnet-20241022\")\nskill = Skill(\n    model=model,\n    input={\n        \"messages\": [\n            {\n                \"role\": \"user\",\n                # yait_aichain uses \"content\" as the universal message field\n                # across all supported providers.\n                \"content\": \"What is the USD to EUR rate? Use the fx_rate tool.\",\n            }\n        ]\n    },\n    tools=[FXRateTool()],\n)\n```\n\nThe `tools=[]`\n\nparameter on Skill accepts any list of Tool instances. The model sees the tool names and descriptions, decides when to call them, and incorporates the results into its response. You write the logic once. The framework handles the tool-calling protocol across all eight supported providers — OpenAI, Anthropic, Google Gemini, xAI Grok, Mistral, Groq, Ollama, and any OpenAI-compatible endpoint.\n\n| Inside yait_aichain | Outside (pluggable) |\n|---|---|\n`convertToMD` — Markdown conversion |\nSearch (Perplexity via optional extra; Brave, SerpAPI via MCP) |\n`Tool` base class |\nCustom tools you write |\n| Core: Model, Skill, Chain, Pool, Agent | MCP servers with their own dependencies |\n| — | Embeddings, VectorDB, Reranker (optional extras) |\n\nThe boundary is deliberate. Markdown conversion is universal — every pipeline needs to ingest documents. Search providers, embedding models, and vector stores are opinionated choices that vary by project. Keep them outside and your base `pip install yait-aichain`\n\nstays light. Add only what you actually use.\n\nTools extend what a language model can do: read today's news, parse your company's PDFs, find semantic patterns across thousands of documents. The API surface stays small. The capabilities grow with your project.", "url": "https://wpnews.pro/news/aichain-tools-search-conversion-embeddings", "canonical_source": "https://dev.to/yait/aichain-tools-search-conversion-embeddings-1gnp", "published_at": "2026-06-17 13:11:00+00:00", "updated_at": "2026-06-17 13:21:52.750532+00:00", "lang": "en", "topics": ["developer-tools", "large-language-models", "ai-infrastructure"], "entities": ["yait_aichain", "Perplexity", "Brave Search", "SerpAPI", "NewsAPI", "Microsoft", "markitdown", "OpenAI"], "alternates": {"html": "https://wpnews.pro/news/aichain-tools-search-conversion-embeddings", "markdown": "https://wpnews.pro/news/aichain-tools-search-conversion-embeddings.md", "text": "https://wpnews.pro/news/aichain-tools-search-conversion-embeddings.txt", "jsonld": "https://wpnews.pro/news/aichain-tools-search-conversion-embeddings.jsonld"}}