How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU

A developer built a self-managing AI workspace using Hermes Agent on an Intel Arc GPU, enabling an autonomous agent to run local inference, automate research, and manage cron jobs without human intervention. The system, running on a GMKtec EVO-T1 mini-PC with an Intel Core Ultra 9 processor and 64GB RAM, coordinates multiple LLM backends and maintains 40+ specialized skills for devops and research tasks. The agent also diagnosed and fixed a critical SYCL backend bug that prevented models from loading at 131K context, demonstrating its ability to self-maintain and troubleshoot.

This is a submission for the Hermes Agent Challenge https://dev.to/challenges/hermes-agent-2026-05-15 : Write About Hermes Agent What I Built A self-managing AI workspace powered by Hermes Agent https://hermes-agent.nousresearch.com — where an autonomous agent runs the local inference stack on Intel Arc GPU, automates research and documentation, manages cron jobs, and coordinates multiple LLM backends without human micro-management. The human directs goals; the agent executes everything. Hardware: GMKtec EVO-T1 mini-PC Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5-5600 — a pocketable 45W system that runs autonomous AI agents 24/7. The system manages: - Local LLM inference via llama.cpp on Intel Arc SYCL iGPU - Automated research pipelines feeding structured docs into a persistent vault - Multi-model testing and benchmarking — 9+ models across 9B to 35B parameters - Cron-driven monitoring — market data, system health, memory management - Self-maintaining skills — the agent updates its own skills and docs when things change Architecture The agent runs as a Hermes session with: - Persistent memory — notes about the environment, user preferences, tool quirks, project conventions - Durable skills — 40+ specialized procedures for devops, mlops, research, etc. - Toolsets — terminal, browser, file, cron, git, and more - Full system access — builds, debugs, tunes, and documents everything autonomously GMKtec EVO-T1 Hardware The host is a GMKtec EVO-T1 mini-PC: - CPU: Intel Core Ultra 9 285H Arrow Lake, 16 cores, up to 5.4GHz - iGPU: Intel Arc 140T 128 Xe cores, shares system DDR5 as VRAM - RAM: 64GB DDR5-5600 ~58GB addressable by GPU - Power: ~45W sustained under full load - Form factor: ~0.6L, pocketable The Intel Arc 140T iGPU is the inference engine. With llama.cpp SYCL backend and Intel oneAPI 2026.0, the agent runs GGUF models locally at 131K context. A critical kernel-level SYCL fix removing the -ze-intel-greater-than-4GB-buffer-required CUDA-style linker flag and setting ONEAPI DEVICE SELECTOR=level zero:gpu was required to prevent JIT compilation crashes at large context sizes — diagnosed and applied by the agent. How It Was Built All implementation was done by Hermes Agent. The human directed high-level goals; the agent executed every technical step. Step 1: Local Inference Server llama.cpp on Intel Arc Built a llama.cpp inference server backed by Intel Arc SYCL. The server handles model loading, context sizing per model, and spec decode configuration. The critical subtlety: different models need different context sizes. CTX SIZE must be set per-model, not globally. A 9B coder model gets 130k; a 27B model gets 65k. The agent handles this via model-specific startup configs. Major SYCL fix: The SYCL backend had a critical bug — the -ze-intel-greater-than-4GB-buffer-required linker flag in ggml-sycl/CMakeLists.txt caused JIT compilation failures on the CPU SYCL device when any operation fell back from GPU. Removing this flag and setting ONEAPI DEVICE SELECTOR=level zero:gpu to restrict to GPU-only eliminated the RMS NORM crash that prevented models from loading at 131K context. The agent found this, diagnosed it, and fixed it. Step 2: Hermes Agent Configuration Configured Hermes with: - OpenRouter as default provider cloud fallback - Local llama-server as local provider primary for privacy-bound work - Skills system for recurring task patterns - Memory persistence across sessions Step 3: Cron Jobs for Automation The agent uses Hermes cron to run scheduled research, commit/push cycles, and health checks: - Market data monitoring Polymarket, Kalshi feeds - Workspace backup automation - Codebase quality scans - Security monitoring SSH brute-force, system health, CVE feeds Step 4: Research Pipeline research vault The agent does autonomous research and documents findings in a structured vault: Model Lineup The system coordinates multiple GGUF models depending on task type: | Model | Architecture | Params | Context | Quant | Role | Notes | Qwen3.5-9B-Sushi-Coder-RL | Qwen 3.5 MoE | 9B | 130K | Q4 K M | Daily driver | RL-tuned, best agentic quality, clean JSON output | Qwen3-Coder-30B-A3B | Qwen 3 MoE | 30B 3B active | 65K | Q3 K M | Coding specialist | Best decode throughput, strong at code generation | Qwen3.6-35B-UD-IQ4 NL | Qwen 3.5 MoE | 35B | 65K | UD-IQ4 NL | Reasoning | Highest reasoning quality, heavier VRAM cost | Qwen3.5-9B-DeepSeek-V4-Flash | Qwen 3.5 hybrid | 9B | 130K | Q4 K M | Secondary | Fastest prefill, but output is reasoning-only content field empty | Qwopus3.5-9B-Coder-MTP | Qwen 3.5 w/ MTP | 9B | 8K effective | Q4 K M | Deprecated | MTP merge caused KV cache contamination, garbled output | Why These Models - Sushi 9B is the only production-viable 9B model for agentic work on this hardware — passed all 6 agentic tests with 0 HTTP 500 errors, produced valid JSON, retained multi-turn context correctly - Coder 30B is a MoE model 30B total, 3B active parameters so decode is fast despite the large parameter count — 11.52 t/s decode vs 8.24 t/s for the 9B model - DS-V4-Flash is useful for quick reasoning tasks where you don't need structured output — 190 t/s prefill makes it fast for short prompts - 27B class models fill the gap between 9B and 35B — reasonable quality without the VRAM overhead of the larger model in the shared memory pool Agentic Benchmark Results Ran comprehensive agentic evaluations across all 9B models at 131K context: | Model | Tests Pass | HTTP 500 | JSON Valid | Total Time | Quality | Sushi 9B | 6/6 | 0 | Yes 3/3 | 561s | Best | DS-V4-Flash | 6/6 | 0 | No 0/3 | 592s | Reasoning-only | Qwopus MTP | 2/6 | 4 | No 0/3 | 256s | Broken | Key Findings Sushi 9B production daily driver : - Only model to pass all 6 agentic tests without errors - Correct multi-turn context retention across 3 turns - Valid structured JSON output T2: 3/3 score - Correct VRAM calculations all 9B models: ~9.7GB at 130K ctx, no OOM risk on 58GB headroom - Best instruction following 10 constraints, 4 paragraphs Qwopus MTP deprecated : - 4 out of 6 tests returned HTTP 500 internal server errors - Garbled output containing mixed Chinese/English pseudotext - KV cache contamination — corrupted output poisons subsequent requests - This is a model quality issue in the MTP merge — not fixable by configuration DS-V4-Flash secondary : - Stable, but all output is in reasoning content only content field empty - Coherent reasoning but cannot produce valid structured JSON in content - Fast prefill 190 t/s but 8.24 t/s decode Technical Decisions Validated - Local-first, cloud-fallback : All inference runs local by default. Cloud only for models not running locally. - Per-model context sizing : Context window sizes are model-specific, not global. This prevents OOM on the Arc GPU's shared VRAM. - Skills over prompting : Every recurring workflow is encoded as a skill file. The system maintains itself. - Git-backed vault : All research auto-commits to GitHub. The workspace is the artifact. - Automated security monitoring : The agent watches for intrusions, monitors CVE feeds, and posts alerts to Discord — the workspace defends itself. Security Infrastructure The server runs automated security monitoring set up by Hermes Agent: - UFW firewall — default deny incoming, SSH only from LAN + Tailscale - fail2ban — auto-ban after 3 failed SSH attempts - Cron: security-monitor — every 30 min, checks brute-force, new devices, firewall, services, gateway - Cron: vulnerability-feed-monitor — every 12 hours, CVE monitoring for Ubuntu, kernel, Docker, Freebox OS - Discord alerts — CRITICAL and HIGH severity findings posted automatically - Pentest tools — nmap, masscan, tcpdump, arp-scan, netcat, wireshark Key Numbers - 58GB shared VRAM on Intel Arc 140T - 130K context window Sushi 9B - 9.7GB total VRAM usage at 130K ctx for 9B models weights + KV cache - 48GB VRAM headroom at 130K ctx - 8.24 t/s decode speed Sushi 9B - 166 t/s prefill speed Sushi 9B - 190 t/s prefill speed DS-V4-Flash - ~36-37s per generation turn Sushi 9B at 256 max tokens - 0 HTTP 500 errors across 6 agentic tests Sushi 9B - 9+ GGUF models tested 9B through 35B parameters - 6+ months of continuous local inference development by Hermes Agent - Automated security monitoring — log analysis, intrusion detection, CVE feed monitoring, Discord alerts Demo / How to Replicate The entire setup — llama.cpp SYCL build, Hermes Agent config, benchmark suite, and documentation — was built and maintained by Hermes Agent. Minimal setup: All local model research, SYCL GPU debugging, production inference setup, benchmark design, security hardening, and this blog article were implemented by Hermes Agent. The human-directed goals and validated results. The agent executed every step — from kernel flag surgery to final documentation.