AIchain Tools: Search, Conversion, Embeddings

Yait_aichain, an open-source AI pipeline framework, introduces a modular tool system for search, document conversion, and embeddings. The framework keeps its core dependency surface small, shipping only Markdown conversion as a hard dependency while search providers and other tools plug in via optional extras or external MCP servers. This design avoids fragile dependency trees as specialized tools evolve rapidly.

An LLM knows everything up to its training cutoff and nothing after. Ask it about yesterday's stock price, hand it a PDF, or expect it to tell you which of your 10,000 support tickets are duplicates — and you'll hit a wall. Tools are how you close that gap. This is article 6 in the series on building AI pipelines with yait aichain. Three categories today: search for live data, Markdown conversion for universal document ingestion, and embeddings for tasks well beyond basic RAG. Before any code, one architectural decision is worth understanding. yait aichain deliberately keeps its core dependency surface small. Only one tool ships inside the package as a hard dependency: Markdown conversion. Everything else — search providers, embeddings, vector stores, rerankers — is either an optional extra you install separately or a pluggable MCP Model Context Protocol server running as a separate process. Why? Specialized tools change fast. Search APIs come and go. OpenAI has shipped three generations of embedding models since 2022, and the pace hasn't slowed. Tying all of that into the core library would create a fragile dependency tree that breaks every time a provider ships a breaking change. So the universal stuff lives inside. The specialized things plug in from outside. A model trained on data through April 2024 can't tell you what happened yesterday. Search tools fix that. yait aichain https://github.com/yaitio/aichain includes a Perplexity search integration as an optional extra. Install it first, then import: pip install yait-aichain perplexity python from yait aichain.tools import searchPerplexity search = searchPerplexity result = search.run "latest LLM benchmarks 2025" print result One function call, one string back — grounded, citation-backed, pulled from live web results. But Perplexity isn't the only option. You might prefer Brave Search for its independence from big-tech indices, or SerpAPI for its structured Google results. These providers connect through MCP servers — external processes that yait aichain discovers and calls at runtime: """ Start the MCP server first: python your newsapi server.py MCP server: newsapi on http://127.0.0.1:8009/mcp Tools available: search news — search articles requires: q, sources, or domains get top headlines — breaking news get sources — list all available news sources """ from yait aichain.tools import Tool The Skill/Chain connects to MCP tools via the protocol at runtime. No MCP-specific dependencies live inside yait aichain itself. The MCP pattern is intentional. A NewsAPI server has its own API key, its own rate limits, its own Python dependencies. Keeping that in a separate process means your core pipeline stays clean. Swap Brave for SerpAPI by pointing at a different MCP server — no code changes in your Chain. Here's a scenario every developer hits: you need an LLM to process a PDF report, a DOCX contract, a PowerPoint deck, and a URL. Each format needs a different parser. Each parser has its own quirks, its own edge cases, its own dependencies. Or you use one line: python from yait aichain.tools import convertToMD tool = convertToMD Convert a URL web content = tool.run "https://example.com" print web content :500 Convert a local PDF pdf content = tool.run "quarterly report.pdf" Convert a DOCX file doc content = tool.run "contract v3.docx" Under the hood, convertToMD wraps Microsoft's open-source markitdown library, bundled as a hard dependency of yait aichain — no separate install required. It handles PDF, DOCX, PPTX, XLSX, HTML, and URLs out of the box. The output is always Markdown, which happens to be the format LLMs consume best. No configuration objects. No format detection logic. No parsing pipelines. One call turns nearly any document into clean Markdown ready for a Skill or Chain. That's why it's the only tool that lives inside yait aichain as a hard dependency — document-to-Markdown conversion is universal enough to earn its place in the core. When developers hear "embeddings," they usually think retrieval-augmented generation — chunk documents, embed them, store them in a vector database, retrieve at query time. That's one use case. Not the only one. python from yait aichain.tools import Embedding emb = Embedding model="text-embedding-3-small" vectors = emb.embed "How do I reset my password?", "I can't log into my account", "What are your pricing plans?", "Password reset not working", Each call returns a list of float vectors. What you do with those vectors determines the use case. Semantic search. Skip keyword matching entirely. A query for "authentication issues" will surface tickets about password resets, SSO failures, and 2FA problems — because the meaning matches, not the words. Deduplication. Compare cosine similarity between support tickets. Tickets 0, 1, and 3 above will cluster tightly — they're semantically identical despite different wording. Set a similarity threshold of 0.92 and you can automatically flag duplicates across thousands of entries without a human reading a single one. Clustering. Feed the vectors into k-means or HDBSCAN. Group product reviews by theme without writing a single regex. Find natural topic boundaries in a corpus of research papers. The VectorDB class gives you an in-process vector store for quick prototyping: python from yait aichain.tools import VectorDB, vectorQuery db = VectorDB db.add documents= "Password reset guide", "Pricing FAQ", "SSO troubleshooting" , ids= "d1", "d2", "d3" , results = vectorQuery db, query="I can't log in", top k=2 for r in results: print r Note: vectorQuery is a standalone function rather than a method on VectorDB because it operates across multiple database instances in pipeline contexts — the separation is intentional. When result quality matters and your document set is large, add a reranking step: python from yait aichain.tools import Reranker reranker = Reranker ranked = reranker.rerank query="latest AI benchmarks", documents= "doc A about benchmarks", "doc B about pricing", "doc C about model eval" , print ranked The reranker takes coarse results from vector search and reorders them with a cross-encoder — more expensive per comparison, but on published benchmarks like BEIR, cross-encoder reranking consistently improves precision at top-k positions over vector search alone. Think of results as your candidate pool and ranked as the final ordered list you pass to the model. Every tool in yait aichain follows the same pattern: subclass Tool , define name and description , implement run self, input: str . The single-string signature is what lets the framework invoke your tool through the standard LLM tool-calling protocol — the model sends a string or a JSON blob you parse yourself , your tool returns a string. Here's a live currency exchange tool: python import json import urllib3 from yait aichain.tools import Tool from yait aichain import Model, Skill class FXRateTool Tool : """Fetch live FX rate from frankfurter.app.""" name = "fx rate" description = "Return the current exchange rate between two currencies. " "Input must be a JSON string with keys 'base' and 'target', " "e.g. '{\"base\": \"USD\", \"target\": \"EUR\"}'." def run self, input: str - str: Production code should validate input and handle network errors. params = json.loads input base = params.get "base", "USD" target = params.get "target", "EUR" http = urllib3.PoolManager resp = http.request "GET", f"https://api.frankfurter.app/latest?from={base}&to={target}", data = json.loads resp.data.decode rate = data "rates" target return f"1 {base} = {rate} {target}" Use it standalone tool = FXRateTool print tool.run '{"base": "USD", "target": "EUR"}' 1 USD = 0.8842 EUR Or plug it into a Skill model = Model "claude-3-5-sonnet-20241022" skill = Skill model=model, input={ "messages": { "role": "user", yait aichain uses "content" as the universal message field across all supported providers. "content": "What is the USD to EUR rate? Use the fx rate tool.", } }, tools= FXRateTool , The tools= parameter on Skill accepts any list of Tool instances. The model sees the tool names and descriptions, decides when to call them, and incorporates the results into its response. You write the logic once. The framework handles the tool-calling protocol across all eight supported providers — OpenAI, Anthropic, Google Gemini, xAI Grok, Mistral, Groq, Ollama, and any OpenAI-compatible endpoint. | Inside yait aichain | Outside pluggable | |---|---| convertToMD — Markdown conversion | Search Perplexity via optional extra; Brave, SerpAPI via MCP | Tool base class | Custom tools you write | | Core: Model, Skill, Chain, Pool, Agent | MCP servers with their own dependencies | | — | Embeddings, VectorDB, Reranker optional extras | The boundary is deliberate. Markdown conversion is universal — every pipeline needs to ingest documents. Search providers, embedding models, and vector stores are opinionated choices that vary by project. Keep them outside and your base pip install yait-aichain stays light. Add only what you actually use. Tools extend what a language model can do: read today's news, parse your company's PDFs, find semantic patterns across thousands of documents. The API surface stays small. The capabilities grow with your project.