Agent engineering: smolagents

HuggingFace's smolagents framework introduces CodeAgent, a code-writing agent that generates and executes Python in a sandboxed Jupyter kernel environment, replacing traditional tool-calling approaches. The agent uses a ReAct loop to write, run, and observe code, with remote executors providing VM-level isolation and state persistence across steps.

Agent engineering: smolagents Agent Engineering /series/agent-engineering/ series This is the sixth take on the same news reader. Previous versions used tool-calling agents. This time I used smolagents https://huggingface.co/docs/smolagents/ from HuggingFace, specifically its CodeAgent . Instead of calling predefined tools, the agent writes Python code and executes it in a sandbox. How CodeAgent works A CodeAgent follows a ReAct loop https://huggingface.co/docs/smolagents/conceptual guides/react : think, act, observe, repeat. At each step, the LLM generates a Python snippet, the framework executes it, and stdout plus the return value feed back as the next observation. To finish, the agent calls final answer value . python from smolagents import CodeAgent, LiteLLMModel agent = CodeAgent tools= , model=LiteLLMModel "anthropic/claude-haiku-4-5" , executor type="blaxel", executor kwargs={"sandbox name": "news-reader"}, additional authorized imports= "httpx", "bs4", "pydantic" , max steps=15, verbosity level=2, result = agent.run task prompt smolagents has no native Anthropic client. LiteLLMModel wraps the LiteLLM https://docs.litellm.ai/ Python library, which is better known as a proxy server but here works as a local library that sends requests directly to api.anthropic.com . The anthropic/ prefix in the model ID tells LiteLLM which provider to use. Sandboxes: Jupyter repurposed smolagents has several executors https://huggingface.co/docs/smolagents/reference/python executors that determine where the generated code runs. The default LocalPythonExecutor is a restricted AST interpreter running in your process. It walks the generated code’s syntax tree and evaluates it node by node, blocking modules like os and subprocess , and functions like eval and exec . Only a small whitelist of safe imports is available. You can expand it with additional authorized imports , but the local executor still can’t make HTTP requests or touch the filesystem unless you explicitly opt in. The remote executors take a different approach. They spin up an isolated environment and run a Jupyter kernel inside it. smolagents connects to the kernel over WebSocket and sends code the same way a Jupyter notebook sends cells to its kernel. This is the Jupyter ecosystem repurposed. The kernel manages state between steps, so variables defined in one step are available in the next. With executor type="blaxel" , the sandbox is a remote VM that boots in under 25ms from hibernation. The additional authorized imports parameter tells smolagents to pip install those packages into the VM before the first step. The isolation is at the VM level rather than the interpreter level. What the agent actually generated Here’s what the agent generated with verbosity level=2 . Step 1 — imports, Pydantic models, and fetch both pages: python import httpx from bs4 import BeautifulSoup from pydantic import BaseModel, Field class NewsItem BaseModel : title: str url: str source: str tags: list str summary: str discussion url: str | None = None class ScraperResult BaseModel : items: list NewsItem report: str hn response = httpx.get "https://news.ycombinator.com", timeout=10 hn html = hn response.text lobsters response = httpx.get "https://lobste.rs", timeout=10 lobsters html = lobsters response.text Step 2 — parse Hacker News with keyword filtering: relevant keywords = 'python', 'ai', 'ml', 'machine learning', 'tool', 'developer', 'architecture', 'software', 'programming', 'database', 'framework', 'library', 'algorithm', 'performance', 'deployment', 'devops', 'rust', 'golang', 'javascript', 'typescript', 'web', 'backend', 'frontend', 'distributed', 'system' skip keywords = 'business', 'funding', 'vc', 'startup funding', 'social media', 'twitter', 'elon', 'drama', 'crypto', 'bitcoin', 'nft', 'politics', 'hiring', 'jobs' hn soup = BeautifulSoup hn html, 'html.parser' hn rows = hn soup.find all 'tr', class ='athing' for idx, row in enumerate hn rows :30 : title cell = row.find 'span', class ='titleline' a tag = title cell.find 'a' title = a tag.get text strip=True url = a tag.get 'href', '' title lower = title.lower is relevant = any kw in title lower for kw in relevant keywords is skip = any kw in title lower for kw in skip keywords if is relevant and not is skip: hn items.append ... Output: “Found 7 relevant items on Hacker News” Step 3 — parse Lobsters: lobsters soup = BeautifulSoup lobsters html, 'html.parser' lobsters rows = lobsters soup.find all 'li', class ='story' for idx, row in enumerate lobsters rows :40 : title elem = row.find 'a', class ='u-url' title = title elem.get text strip=True url = title elem.get 'href', '' tags container = row.find 'ul', class ='tags' ... comments link = row.find 'a', class ='comments label' ... Output: “Found 6 relevant items on Lobsters” The Lobsters selectors li.story , a.u-url were correct for titles and URLs, but the tag and discussion URL selectors returned nothing. Steps 4 and 5 generated tags from a keyword map and validated with Pydantic: validated = ScraperResult result final answer validated.model dump Output: “Validation successful 13 items validated” Structured output through self-validation Unlike Pydantic AI’s output type=MyModel , smolagents has no built-in schema enforcement for the final answer. final answer is a built-in tool that smolagents injects into every agent. The system prompt tells the agent to call it when the job is done, and the framework stops the loop. The agent can pass any value to it. I included Pydantic models in the prompt and told the agent to validate before returning. The agent could ignore the instruction, but in practice it doesn’t. If validation fails, the exception becomes the next observation and the agent has remaining steps to fix the data. For a harder guarantee, smolagents has final answer checks : functions that run on the host before accepting the result. If a check returns False, the agent continues: python def validate result answer, kwargs : try: ScraperResult answer return True except Exception: return False agent = CodeAgent ..., final answer checks= validate result This runs on your machine, outside the sandbox. The agent can’t bypass it. The tradeoff: code vs. judgment A tool-calling agent would decide “this article about perceptrons is relevant to AI and Python” because it reads the content. A CodeAgent writes keyword-matching code at generation time. In my run, “Trusted Computing Frequently Asked Questions” got tagged as 'ml', 'web', 'rust' and “How to fix a laptop that reboots randomly” got tagged as 'ml', 'web' because the keyword filter matched on substrings. The summaries were just the article titles repeated verbatim. The Perplexity team recently published research https://research.perplexity.ai/articles/rethinking-search-as-code-generation arguing that code-generating agents outperform tool-calling agents for search tasks. Their claim is that code expresses complex retrieval logic more naturally than a sequence of tool calls. The news reader task is too simple to test this, but the approach is gaining traction beyond HuggingFace. What would make this better I intentionally kept the implementation naive to see what a zero-tool CodeAgent produces out of the box. The agent guessed page structure. The Hacker News selectors tr.athing , span.titleline were correct. The Lobsters selectors for titles li.story , a.u-url worked, but the selectors for tags and discussion URLs didn’t match anything. The agent has no way to verify its guesses against the actual HTML. A tool-calling agent with a web fetch tool would have read the markup and adapted. For a CodeAgent, the fix is to give it a deterministic parsing tool. You’d write a parse hn and parse lobsters tool with tested selectors, and let the agent call them from its code. The smolagents docs https://huggingface.co/docs/smolagents/tutorials/building good agents recommend exactly this: “Whenever possible, logic should be based on deterministic functions rather than agentic decisions.” Keyword matching replaced LLM judgment. The agent wrote a keyword filter instead of evaluating each article. A better architecture would split scraping from judgment: one agent or deterministic code fetches and parses the raw data, and a second agent reads the titles and summaries to filter and tag them. smolagents has managed agents https://huggingface.co/docs/smolagents/tutorials/building good agents for this. You pass one agent as a managed agent to another, and the manager calls it like a function from its generated code. Pydantic validation felt bolted on. I told the agent to validate with Pydantic in the prompt, and it did, but defining models in generated code to validate generated data is circular. If the agent controls both the schema and the data, validation catches typos but not structural problems. A more natural approach for a CodeAgent would be to write results to a structured store SQLite, for example where the schema is enforced externally. The sandbox can run sqlite3 or any Python library. The agent writes INSERT statements, and the database rejects malformed data. Everything runs sequentially. The agent uses one kernel and parses one site after another. A more natural architecture would run two CodeAgents in parallel, one per site, each with its own sandbox. A third agent would collect their results, filter and summarize them. smolagents supports this with managed agents , where one agent calls others as functions from its generated code. Comparing the approaches | Tool-calling agents | CodeAgent | | |---|---|---| | What the LLM produces | Tool name + arguments | Python code | | Execution | Framework calls the function | Jupyter kernel runs the code | | Available capabilities | Only registered tools | Anything Python can do | | Safety model | Tool allowlist + argument validation | Sandbox isolation VM/container | | Structured output | Schema validation with retry | Self-validation in generated code | | Best for | Content requiring LLM judgment | Procedural tasks with clear logic | The full project is on GitHub https://github.com/imankulov/news-reader .