# How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU

> Source: <https://dev.to/starrzan/how-i-built-a-self-managing-ai-workspace-with-hermes-agent-2lf6>
> Published: 2026-05-30 21:28:10+00:00

*This is a submission for the *[Hermes Agent Challenge](https://dev.to/challenges/hermes-agent-2026-05-15): Write About Hermes Agent

##
What I Built

A self-managing AI workspace powered by [Hermes Agent](https://hermes-agent.nousresearch.com) — where an autonomous agent runs the local inference stack on Intel Arc GPU, automates research and documentation, manages cron jobs, and coordinates multiple LLM backends without human micro-management. The human directs goals; the agent executes everything.

**Hardware:** GMKtec EVO-T1 mini-PC (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5-5600) — a pocketable 45W system that runs autonomous AI agents 24/7.

The system manages:

-
**Local LLM inference** via llama.cpp on Intel Arc SYCL (iGPU)
-
**Automated research pipelines** feeding structured docs into a persistent vault
-
**Multi-model testing and benchmarking** — 9+ models across 9B to 35B parameters
-
**Cron-driven monitoring** — market data, system health, memory management
-
**Self-maintaining skills** — the agent updates its own skills and docs when things change

##
Architecture

The agent runs as a Hermes session with:

-
**Persistent memory** — notes about the environment, user preferences, tool quirks, project conventions
-
**Durable skills** — 40+ specialized procedures for devops, mlops, research, etc.
-
**Toolsets** — terminal, browser, file, cron, git, and more
-
**Full system access** — builds, debugs, tunes, and documents everything autonomously

###
GMKtec EVO-T1 Hardware

The host is a **GMKtec EVO-T1** mini-PC:

-
**CPU:** Intel Core Ultra 9 285H (Arrow Lake, 16 cores, up to 5.4GHz)
-
**iGPU:** Intel Arc 140T (128 Xe cores, shares system DDR5 as VRAM)
-
**RAM:** 64GB DDR5-5600 (~58GB addressable by GPU)
-
**Power:** ~45W sustained under full load
-
**Form factor:** ~0.6L, pocketable

The Intel Arc 140T iGPU is the inference engine. With llama.cpp SYCL backend and Intel oneAPI 2026.0, the agent runs GGUF models locally at 131K context. A critical kernel-level SYCL fix (removing the `-ze-intel-greater-than-4GB-buffer-required`

CUDA-style linker flag and setting `ONEAPI_DEVICE_SELECTOR=level_zero:gpu`

) was required to prevent JIT compilation crashes at large context sizes — diagnosed and applied by the agent.

##
How It Was Built

All implementation was done by Hermes Agent. The human directed high-level goals; the agent executed every technical step.

###
Step 1: Local Inference Server (llama.cpp on Intel Arc)

Built a llama.cpp inference server backed by Intel Arc SYCL. The server handles model loading, context sizing per model, and spec decode configuration.

The critical subtlety: different models need different context sizes. CTX_SIZE must be set per-model, not globally. A 9B coder model gets 130k; a 27B model gets 65k. The agent handles this via model-specific startup configs.

**Major SYCL fix:** The SYCL backend had a critical bug — the `-ze-intel-greater-than-4GB-buffer-required`

linker flag in `ggml-sycl/CMakeLists.txt`

caused JIT compilation failures on the CPU SYCL device when any operation fell back from GPU. Removing this flag and setting `ONEAPI_DEVICE_SELECTOR=level_zero:gpu`

to restrict to GPU-only eliminated the RMS_NORM crash that prevented models from loading at 131K context. The agent found this, diagnosed it, and fixed it.

###
Step 2: Hermes Agent Configuration

Configured Hermes with:

- OpenRouter as default provider (cloud fallback)
- Local llama-server as local provider (primary for privacy-bound work)
- Skills system for recurring task patterns
- Memory persistence across sessions

###
Step 3: Cron Jobs for Automation

The agent uses Hermes cron to run scheduled research, commit/push cycles, and health checks:

- Market data monitoring (Polymarket, Kalshi feeds)
- Workspace backup automation
- Codebase quality scans
- Security monitoring (SSH brute-force, system health, CVE feeds)

###
Step 4: Research Pipeline (research vault)

The agent does autonomous research and documents findings in a structured vault:

##
Model Lineup

The system coordinates multiple GGUF models depending on task type:

| Model |
Architecture |
Params |
Context |
Quant |
Role |
Notes |
**Qwen3.5-9B-Sushi-Coder-RL** |
Qwen 3.5 MoE |
9B |
130K |
Q4_K_M |
Daily driver |
RL-tuned, best agentic quality, clean JSON output |
**Qwen3-Coder-30B-A3B** |
Qwen 3 MoE |
30B (3B active) |
65K |
Q3_K_M |
Coding specialist |
Best decode throughput, strong at code generation |
**Qwen3.6-35B-UD-IQ4_NL** |
Qwen 3.5 MoE |
35B |
65K |
UD-IQ4_NL |
Reasoning |
Highest reasoning quality, heavier VRAM cost |
**Qwen3.5-9B-DeepSeek-V4-Flash** |
Qwen 3.5 hybrid |
9B |
130K |
Q4_K_M |
Secondary |
Fastest prefill, but output is reasoning-only (content field empty) |
**Qwopus3.5-9B-Coder-MTP** |
Qwen 3.5 w/ MTP |
9B |
8K effective |
Q4_K_M |
Deprecated |
MTP merge caused KV cache contamination, garbled output |

###
Why These Models

-
**Sushi 9B** is the only production-viable 9B model for agentic work on this hardware — passed all 6 agentic tests with 0 HTTP 500 errors, produced valid JSON, retained multi-turn context correctly
-
**Coder 30B** is a MoE model (30B total, 3B active parameters) so decode is fast despite the large parameter count — 11.52 t/s decode vs 8.24 t/s for the 9B model
-
**DS-V4-Flash** is useful for quick reasoning tasks where you don't need structured output — 190 t/s prefill makes it fast for short prompts
-
**27B class models** fill the gap between 9B and 35B — reasonable quality without the VRAM overhead of the larger model in the shared memory pool

##
Agentic Benchmark Results

Ran comprehensive agentic evaluations across all 9B models at 131K context:

| Model |
Tests Pass |
HTTP 500 |
JSON Valid |
Total Time |
Quality |
**Sushi 9B** |
6/6 |
0 |
Yes (3/3) |
561s |
Best |
**DS-V4-Flash** |
6/6 |
0 |
No (0/3) |
592s |
Reasoning-only |
**Qwopus MTP** |
2/6 |
4 |
No (0/3) |
256s |
Broken |

###
Key Findings

**Sushi 9B (production daily driver):**

- Only model to pass all 6 agentic tests without errors
- Correct multi-turn context retention across 3 turns
- Valid structured JSON output (T2: 3/3 score)
- Correct VRAM calculations (all 9B models: ~9.7GB at 130K ctx, no OOM risk on 58GB headroom)
- Best instruction following (10 constraints, 4 paragraphs)

**Qwopus MTP (deprecated):**

- 4 out of 6 tests returned HTTP 500 internal server errors
- Garbled output containing mixed Chinese/English pseudotext
- KV cache contamination — corrupted output poisons subsequent requests
- This is a model quality issue in the MTP merge — not fixable by configuration

**DS-V4-Flash (secondary):**

- Stable, but all output is in reasoning_content only (content field empty)
- Coherent reasoning but cannot produce valid structured JSON in content
- Fast prefill (190 t/s) but 8.24 t/s decode

###
Technical Decisions Validated

-
**Local-first, cloud-fallback**: All inference runs local by default. Cloud only for models not running locally.
-
**Per-model context sizing**: Context window sizes are model-specific, not global. This prevents OOM on the Arc GPU's shared VRAM.
-
**Skills over prompting**: Every recurring workflow is encoded as a skill file. The system maintains itself.
-
**Git-backed vault**: All research auto-commits to GitHub. The workspace is the artifact.
-
**Automated security monitoring**: The agent watches for intrusions, monitors CVE feeds, and posts alerts to Discord — the workspace defends itself.

##
Security Infrastructure

The server runs automated security monitoring set up by Hermes Agent:

-
**UFW firewall** — default deny incoming, SSH only from LAN + Tailscale
-
**fail2ban** — auto-ban after 3 failed SSH attempts
-
**Cron: security-monitor** — every 30 min, checks brute-force, new devices, firewall, services, gateway
-
**Cron: vulnerability-feed-monitor** — every 12 hours, CVE monitoring for Ubuntu, kernel, Docker, Freebox OS
-
**Discord alerts** — CRITICAL and HIGH severity findings posted automatically
-
**Pentest tools** — nmap, masscan, tcpdump, arp-scan, netcat, wireshark

##
Key Numbers

-
**58GB** shared VRAM on Intel Arc 140T
-
**130K** context window (Sushi 9B)
-
**9.7GB** total VRAM usage at 130K ctx for 9B models (weights + KV cache)
-
**48GB** VRAM headroom at 130K ctx
-
**8.24 t/s** decode speed (Sushi 9B)
-
**166 t/s** prefill speed (Sushi 9B)
-
**190 t/s** prefill speed (DS-V4-Flash)
-
**~36-37s** per generation turn (Sushi 9B at 256 max_tokens)
-
**0** HTTP 500 errors across 6 agentic tests (Sushi 9B)
-
**9+** GGUF models tested (9B through 35B parameters)
-
**6+ months** of continuous local inference development by Hermes Agent
-
**Automated security monitoring** — log analysis, intrusion detection, CVE feed monitoring, Discord alerts

##
Demo / How to Replicate

The entire setup — llama.cpp SYCL build, Hermes Agent config, benchmark suite, and documentation — was built and maintained by Hermes Agent.

Minimal setup:

*All local model research, SYCL GPU debugging, production inference setup, benchmark design, security hardening, and this blog article were implemented by Hermes Agent. The human-directed goals and validated results. The agent executed every step — from kernel flag surgery to final documentation.*
