*This is a submission for the *Hermes Agent Challenge: Write About Hermes Agent
#
What I Built
A self-managing AI workspace powered by Hermes Agent β where an autonomous agent runs the local inference stack on Intel Arc GPU, automates research and documentation, manages cron jobs, and coordinates multiple LLM backends without human micro-management. The human directs goals; the agent executes everything.
Hardware: GMKtec EVO-T1 mini-PC (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5-5600) β a pocketable 45W system that runs autonomous AI agents 24/7.
The system manages:
Local LLM inference via llama.cpp on Intel Arc SYCL (iGPU) #
Automated research pipelines feeding structured docs into a persistent vault #
Multi-model testing and benchmarking β 9+ models across 9B to 35B parameters #
Cron-driven monitoring β market data, system health, memory management #
Self-maintaining skills β the agent updates its own skills and docs when things change
#
Architecture
The agent runs as a Hermes session with:
Persistent memory β notes about the environment, user preferences, tool quirks, project conventions #
Durable skills β 40+ specialized procedures for devops, mlops, research, etc. #
Toolsets β terminal, browser, file, cron, git, and more #
Full system access β builds, debugs, tunes, and documents everything autonomously
GMKtec EVO-T1 Hardware
The host is a GMKtec EVO-T1 mini-PC: #
CPU: Intel Core Ultra 9 285H (Arrow Lake, 16 cores, up to 5.4GHz) #
iGPU: Intel Arc 140T (128 Xe cores, shares system DDR5 as VRAM) #
RAM: 64GB DDR5-5600 (~58GB addressable by GPU) #
Power: ~45W sustained under full load #
Form factor: ~0.6L, pocketable
The Intel Arc 140T iGPU is the inference engine. With llama.cpp SYCL backend and Intel oneAPI 2026.0, the agent runs GGUF models locally at 131K context. A critical kernel-level SYCL fix (removing the -ze-intel-greater-than-4GB-buffer-required
CUDA-style linker flag and setting ONEAPI_DEVICE_SELECTOR=level_zero:gpu
) was required to prevent JIT compilation crashes at large context sizes β diagnosed and applied by the agent.
#
How It Was Built
All implementation was done by Hermes Agent. The human directed high-level goals; the agent executed every technical step.
Step 1: Local Inference Server (llama.cpp on Intel Arc) Built a llama.cpp inference server backed by Intel Arc SYCL. The server handles model , context sizing per model, and spec decode configuration.
The critical subtlety: different models need different context sizes. CTX_SIZE must be set per-model, not globally. A 9B coder model gets 130k; a 27B model gets 65k. The agent handles this via model-specific startup configs.
Major SYCL fix: The SYCL backend had a critical bug β the -ze-intel-greater-than-4GB-buffer-required
linker flag in ggml-sycl/CMakeLists.txt
caused JIT compilation failures on the CPU SYCL device when any operation fell back from GPU. Removing this flag and setting ONEAPI_DEVICE_SELECTOR=level_zero:gpu
to restrict to GPU-only eliminated the RMS_NORM crash that prevented models from at 131K context. The agent found this, diagnosed it, and fixed it.
Step 2: Hermes Agent Configuration
Configured Hermes with:
- OpenRouter as default provider (cloud fallback)
- Local llama-server as local provider (primary for privacy-bound work)
- Skills system for recurring task patterns
- Memory persistence across sessions
Step 3: Cron Jobs for Automation
The agent uses Hermes cron to run scheduled research, commit/push cycles, and health checks:
-
Market data monitoring (Polymarket, Kalshi feeds)
-
Workspace backup automation
-
Codebase quality scans
-
Security monitoring (SSH brute-force, system health, CVE feeds)
Step 4: Research Pipeline (research vault) The agent does autonomous research and documents findings in a structured vault:
#
Model Lineup
The system coordinates multiple GGUF models depending on task type:
| Model | Architecture | Params | Context | Quant | Role | Notes | Qwen3.5-9B-Sushi-Coder-RL | Qwen 3.5 MoE | 9B | 130K | Q4_K_M | Daily driver | RL-tuned, best agentic quality, clean JSON output | Qwen3-Coder-30B-A3B | Qwen 3 MoE | 30B (3B active) | 65K | Q3_K_M | Coding specialist | Best decode throughput, strong at code generation | Qwen3.6-35B-UD-IQ4_NL | Qwen 3.5 MoE | 35B | 65K | UD-IQ4_NL | Reasoning | Highest reasoning quality, heavier VRAM cost | Qwen3.5-9B-DeepSeek-V4-Flash | Qwen 3.5 hybrid | 9B | 130K | Q4_K_M | Secondary | Fastest prefill, but output is reasoning-only (content field empty) | Qwopus3.5-9B-Coder-MTP | Qwen 3.5 w/ MTP | 9B | 8K effective | Q4_K_M | Deprecated | MTP merge caused KV cache contamination, garbled output |
Why These Models
Sushi 9B is the only production-viable 9B model for agentic work on this hardware β passed all 6 agentic tests with 0 HTTP 500 errors, produced valid JSON, retained multi-turn context correctly #
Coder 30B is a MoE model (30B total, 3B active parameters) so decode is fast despite the large parameter count β 11.52 t/s decode vs 8.24 t/s for the 9B model #
DS-V4-Flash is useful for quick reasoning tasks where you don't need structured output β 190 t/s prefill makes it fast for short prompts #
27B class models fill the gap between 9B and 35B β reasonable quality without the VRAM overhead of the larger model in the shared memory pool
#
Agentic Benchmark Results
Ran comprehensive agentic evaluations across all 9B models at 131K context:
| Model | Tests Pass | HTTP 500 | JSON Valid | Total Time | Quality | Sushi 9B | 6/6 | 0 | Yes (3/3) | 561s | Best | DS-V4-Flash | 6/6 | 0 | No (0/3) | 592s | Reasoning-only | Qwopus MTP | 2/6 | 4 | No (0/3) | 256s | Broken |
Key Findings
Sushi 9B (production daily driver):
- Only model to pass all 6 agentic tests without errors
- Correct multi-turn context retention across 3 turns
- Valid structured JSON output (T2: 3/3 score)
- Correct VRAM calculations (all 9B models: ~9.7GB at 130K ctx, no OOM risk on 58GB headroom)
- Best instruction following (10 constraints, 4 paragraphs)
**Qwopus MTP (deprecated):**
- 4 out of 6 tests returned HTTP 500 internal server errors
- Garbled output containing mixed Chinese/English pseudotext
- KV cache contamination β corrupted output poisons subsequent requests
- This is a model quality issue in the MTP merge β not fixable by configuration
DS-V4-Flash (secondary):
-
Stable, but all output is in reasoning_content only (content field empty)
-
Coherent reasoning but cannot produce valid structured JSON in content
-
Fast prefill (190 t/s) but 8.24 t/s decode
Technical Decisions Validated
Local-first, cloud-fallback: All inference runs local by default. Cloud only for models not running locally. #
Per-model context sizing: Context window sizes are model-specific, not global. This prevents OOM on the Arc GPU's shared VRAM. #
Skills over prompting: Every recurring workflow is encoded as a skill file. The system maintains itself. #
Git-backed vault: All research auto-commits to GitHub. The workspace is the artifact. #
Automated security monitoring: The agent watches for intrusions, monitors CVE feeds, and posts alerts to Discord β the workspace defends itself.
#
Security Infrastructure
The server runs automated security monitoring set up by Hermes Agent:
UFW firewall β default deny incoming, SSH only from LAN + Tailscale #
fail2ban β auto-ban after 3 failed SSH attempts #
Cron: security-monitor β every 30 min, checks brute-force, new devices, firewall, services, gateway #
Cron: vulnerability-feed-monitor β every 12 hours, CVE monitoring for Ubuntu, kernel, Docker, Freebox OS #
Discord alerts β CRITICAL and HIGH severity findings posted automatically #
Pentest tools β nmap, masscan, tcpdump, arp-scan, netcat, wireshark
#
Key Numbers
58GB shared VRAM on Intel Arc 140T #
130K context window (Sushi 9B) #
9.7GB total VRAM usage at 130K ctx for 9B models (weights + KV cache) #
48GB VRAM headroom at 130K ctx #
8.24 t/s decode speed (Sushi 9B) #
166 t/s prefill speed (Sushi 9B) #
190 t/s prefill speed (DS-V4-Flash) #
~36-37s per generation turn (Sushi 9B at 256 max_tokens) #
0 HTTP 500 errors across 6 agentic tests (Sushi 9B) #
9+ GGUF models tested (9B through 35B parameters) #
6+ months of continuous local inference development by Hermes Agent #
Automated security monitoring β log analysis, intrusion detection, CVE feed monitoring, Discord alerts
#
Demo / How to Replicate
The entire setup β llama.cpp SYCL build, Hermes Agent config, benchmark suite, and documentation β was built and maintained by Hermes Agent.
Minimal setup:
All local model research, SYCL GPU debugging, production inference setup, benchmark design, security hardening, and this blog article were implemented by Hermes Agent. The human-directed goals and validated results. The agent executed every step β from kernel flag surgery to final documentation.