How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU

wpnews.pro

*This is a submission for the *Hermes Agent Challenge: Write About Hermes Agent

#

What I Built

A self-managing AI workspace powered by Hermes Agent — where an autonomous agent runs the local inference stack on Intel Arc GPU, automates research and documentation, manages cron jobs, and coordinates multiple LLM backends without human micro-management. The human directs goals; the agent executes everything.

Hardware: GMKtec EVO-T1 mini-PC (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5-5600) — a pocketable 45W system that runs autonomous AI agents 24/7.

The system manages:

Local LLM inference via llama.cpp on Intel Arc SYCL (iGPU) #

Automated research pipelines feeding structured docs into a persistent vault #

Multi-model testing and benchmarking — 9+ models across 9B to 35B parameters #

Cron-driven monitoring — market data, system health, memory management #

Self-maintaining skills — the agent updates its own skills and docs when things change

#

Architecture

The agent runs as a Hermes session with:

Persistent memory — notes about the environment, user preferences, tool quirks, project conventions #

Durable skills — 40+ specialized procedures for devops, mlops, research, etc. #

Toolsets — terminal, browser, file, cron, git, and more #

Full system access — builds, debugs, tunes, and documents everything autonomously

GMKtec EVO-T1 Hardware

The host is a GMKtec EVO-T1 mini-PC: #

CPU: Intel Core Ultra 9 285H (Arrow Lake, 16 cores, up to 5.4GHz) #

iGPU: Intel Arc 140T (128 Xe cores, shares system DDR5 as VRAM) #

RAM: 64GB DDR5-5600 (~58GB addressable by GPU) #

Power: ~45W sustained under full load #

Form factor: ~0.6L, pocketable

The Intel Arc 140T iGPU is the inference engine. With llama.cpp SYCL backend and Intel oneAPI 2026.0, the agent runs GGUF models locally at 131K context. A critical kernel-level SYCL fix (removing the -ze-intel-greater-than-4GB-buffer-required

CUDA-style linker flag and setting ONEAPI_DEVICE_SELECTOR=level_zero:gpu

) was required to prevent JIT compilation crashes at large context sizes — diagnosed and applied by the agent.

#

How It Was Built

All implementation was done by Hermes Agent. The human directed high-level goals; the agent executed every technical step.

Step 1: Local Inference Server (llama.cpp on Intel Arc) Built a llama.cpp inference server backed by Intel Arc SYCL. The server handles model , context sizing per model, and spec decode configuration.

The critical subtlety: different models need different context sizes. CTX_SIZE must be set per-model, not globally. A 9B coder model gets 130k; a 27B model gets 65k. The agent handles this via model-specific startup configs.

Major SYCL fix: The SYCL backend had a critical bug — the -ze-intel-greater-than-4GB-buffer-required linker flag in ggml-sycl/CMakeLists.txt

caused JIT compilation failures on the CPU SYCL device when any operation fell back from GPU. Removing this flag and setting ONEAPI_DEVICE_SELECTOR=level_zero:gpu

to restrict to GPU-only eliminated the RMS_NORM crash that prevented models from at 131K context. The agent found this, diagnosed it, and fixed it.

Step 2: Hermes Agent Configuration

Configured Hermes with:

- OpenRouter as default provider (cloud fallback)
- Local llama-server as local provider (primary for privacy-bound work)

Skills system for recurring task patterns
Memory persistence across sessions

Step 3: Cron Jobs for Automation

The agent uses Hermes cron to run scheduled research, commit/push cycles, and health checks:

Market data monitoring (Polymarket, Kalshi feeds)
Workspace backup automation
Codebase quality scans
Security monitoring (SSH brute-force, system health, CVE feeds)

Step 4: Research Pipeline (research vault) The agent does autonomous research and documents findings in a structured vault:

#

Model Lineup

The system coordinates multiple GGUF models depending on task type:

Why These Models

Sushi 9B is the only production-viable 9B model for agentic work on this hardware — passed all 6 agentic tests with 0 HTTP 500 errors, produced valid JSON, retained multi-turn context correctly #

Coder 30B is a MoE model (30B total, 3B active parameters) so decode is fast despite the large parameter count — 11.52 t/s decode vs 8.24 t/s for the 9B model #

DS-V4-Flash is useful for quick reasoning tasks where you don't need structured output — 190 t/s prefill makes it fast for short prompts #

27B class models fill the gap between 9B and 35B — reasonable quality without the VRAM overhead of the larger model in the shared memory pool

#

Agentic Benchmark Results

Ran comprehensive agentic evaluations across all 9B models at 131K context:

| Model | Tests Pass | HTTP 500 | JSON Valid | Total Time | Quality | Sushi 9B | 6/6 | 0 | Yes (3/3) | 561s | Best | DS-V4-Flash | 6/6 | 0 | No (0/3) | 592s | Reasoning-only | Qwopus MTP | 2/6 | 4 | No (0/3) | 256s | Broken |

Key Findings

Sushi 9B (production daily driver):

Only model to pass all 6 agentic tests without errors
Correct multi-turn context retention across 3 turns
Valid structured JSON output (T2: 3/3 score)
Correct VRAM calculations (all 9B models: ~9.7GB at 130K ctx, no OOM risk on 58GB headroom)

- Best instruction following (10 constraints, 4 paragraphs)

**Qwopus MTP (deprecated):**

4 out of 6 tests returned HTTP 500 internal server errors
Garbled output containing mixed Chinese/English pseudotext
KV cache contamination — corrupted output poisons subsequent requests
This is a model quality issue in the MTP merge — not fixable by configuration

DS-V4-Flash (secondary):

Stable, but all output is in reasoning_content only (content field empty)
Coherent reasoning but cannot produce valid structured JSON in content
Fast prefill (190 t/s) but 8.24 t/s decode

Technical Decisions Validated

Local-first, cloud-fallback: All inference runs local by default. Cloud only for models not running locally. #

Per-model context sizing: Context window sizes are model-specific, not global. This prevents OOM on the Arc GPU's shared VRAM. #

Skills over prompting: Every recurring workflow is encoded as a skill file. The system maintains itself. #

Git-backed vault: All research auto-commits to GitHub. The workspace is the artifact. #

Automated security monitoring: The agent watches for intrusions, monitors CVE feeds, and posts alerts to Discord — the workspace defends itself.

#

Security Infrastructure

The server runs automated security monitoring set up by Hermes Agent:

UFW firewall — default deny incoming, SSH only from LAN + Tailscale #

fail2ban — auto-ban after 3 failed SSH attempts #

Cron: security-monitor — every 30 min, checks brute-force, new devices, firewall, services, gateway #

Cron: vulnerability-feed-monitor — every 12 hours, CVE monitoring for Ubuntu, kernel, Docker, Freebox OS #

Discord alerts — CRITICAL and HIGH severity findings posted automatically #

Pentest tools — nmap, masscan, tcpdump, arp-scan, netcat, wireshark

#

Key Numbers

58GB shared VRAM on Intel Arc 140T #

130K context window (Sushi 9B) #

9.7GB total VRAM usage at 130K ctx for 9B models (weights + KV cache) #

48GB VRAM headroom at 130K ctx #

8.24 t/s decode speed (Sushi 9B) #

166 t/s prefill speed (Sushi 9B) #

190 t/s prefill speed (DS-V4-Flash) #

~36-37s per generation turn (Sushi 9B at 256 max_tokens) #

0 HTTP 500 errors across 6 agentic tests (Sushi 9B) #

9+ GGUF models tested (9B through 35B parameters) #

6+ months of continuous local inference development by Hermes Agent #

Automated security monitoring — log analysis, intrusion detection, CVE feed monitoring, Discord alerts

#

Demo / How to Replicate

The entire setup — llama.cpp SYCL build, Hermes Agent config, benchmark suite, and documentation — was built and maintained by Hermes Agent.

Minimal setup:

All local model research, SYCL GPU debugging, production inference setup, benchmark design, security hardening, and this blog article were implemented by Hermes Agent. The human-directed goals and validated results. The agent executed every step — from kernel flag surgery to final documentation.

source & further reading

dev.to — original article How to run your first OpenAI-compatible API call with curl, Python, and Node.js Your AI Agent's Bill Tripled Overnight. The Prompt Cache Broke, Not the Model. My Project Docs Aren't For Humans Anymore. They're For an Agent That Re-Reads Them Every Session.