Running ASR for smart homes in the NPU of Intel processors

wpnews.pro

I run my own smart home — Home Assistant, voice assistant pipeline, the whole self-hosted thing. The speech-to-text step (Parakeet TDT 0.6B v3 over the Wyoming protocol) had been running on my i3 1220P intel NUC with an 12gb RTX 3060 eGPU for months. I recently upgraded my home server to a full desktop with an AMD 7900XTX, and since I want to save as much of the VRAM as I can for LLMs, I've been running nvidia parakeet on CPU since then.

It works fine, but it always nagged me: my new home server has an Intel Core Ultra 7 265K (Arrow Lake) with the built-in "AI Boost" NPU, and that silicon was sitting completely idle.

With the hype of AI, chip manufacturers have started to slap NPUs on their chips mostly so they can put AI on their names, but little to no software actually makes use of them, although some projects are starting to pop here and there.

So I decided to actually try one if I could put that stupidly underused chunk of silicon to work on a workload that should, on paper, be ideal for it.

And it worked remarkably well, but the road was bumpy.

Same Spanish audios, similar wyoming-onnx-asr stack, but I swapped the inference backend from plain ORT-CPU to OpenVINO targeting the NPU, and I went from using the INT8 quantized model on the CPU to using the full precision FP32 model on the NPU.

Results averaged from 10 runs after 1 warmup round.

Audio	Backend	Avg latency	Energy / inference	Power above idle
10 s	CPU INT8	978 ms	44.6 J	45.6 W
10 s	NPU FP32
204 ms ⚡
4.2 J
20.5 W
20 s	CPU INT8	1 708 ms	79.8 J	46.7 W
20 s	NPU FP32
615 ms ⚡
7.8 J
12.7 W
60 s	CPU INT8	5 011 ms	237.7 J	47.4 W
60 s	NPU FP32
818 ms ⚡
11.0 J
13.4 W

3-6× faster wall time. 10-22× less energy per transcription. For a workload that runs quite often in my home (I have 5 satellites and I don't reach for switches often), this is the kind of result that makes me wonder why nobody seems to be doing it.

For a nice voice assistant, response speed is a critical part of the experience. It's not like 500ms extra makes for a terrible experience, but very little you save does improve the experience.

I've packaged the whole thing into a Docker image: 👉 ghcr.io/cibernox/wyoming-parakeet-on-intel-npu. If you have a Core Ultra chip and are Home Assistant, you can

docker run

it and skip everything below.But if you want the story…

Quick context. The home server is a Proxmox 9.x box, Intel Core Ultra 7 265K, 64 GB DDR5, an AMD 7900XTX dedicated GPU, and various LXC + Docker workloads (Home Assistant, llama.cpp on GPU, paperless-ngx, the usual). I'd been running Parakeet TDT on CPU at ~0.5-0.8 s per utterance. Acceptable but not "instant", but it was a downgrade from where I was running it in my RTX 3060 that I could live with but it could feel it too.

The CPU baseline is genuinely strong on this chip — Parakeet's INT8 ONNX through ORT-CPU benefits from AVX-VNNI INT8 matmuls and the 265K is beefier than most home servers. So when I say the NPU is 3-6× faster, I'm not comparing it to a low power N150 mini-cp. This is a 20-core desktop-class CPU at 125 W TDP.

The Intel NPU on Arrow Lake is rated at 13 TOPS. By LLM-accelerator standards that's tiny, and AMD boosts NPUs with 40TOPS already. But Parakeet's encoder is exactly the kind of work an NPU is designed for: matrix multiplications with predictable shapes and modest activation memory. Worth trying.

First time you'd think "yeah I just install OpenVINO and the NPU driver, right?" And it almost works. The container detected the NPU device node but reported available_devices: ['CPU']

. No NPU.

The reason, after some ZE_ENABLE__DEBUG_TRACE=1

archeology:

ZE__DEBUG_TRACE: Load Library of libze_intel_vpu.so.1 failed

Ubuntu 24.04's bundled Level Zero (libze1

v1.16) is looking for the legacy library name libze_intel_vpu.so.1

. I should have figured this faster than I did because this chip was released in 2025, so it's totally to be expected that Ubuntu needed some help getting it to work. Recent Intel NPU driver builds install libze_intel_npu.so.1

— different name, same library. The needs to be v1.17 or newer to know about the new name.

Fix is straightforward once you know:

RUN curl -fL -O \
    "https://github.com/oneapi-src/level-zero/releases/download/v1.28.6/libze1_1.28.6+u24.04_amd64.deb" \
    && apt-get install -y --no-install-recommends ./libze1*.deb

Now ov.Core().available_devices

returns ['CPU', 'NPU']

and the full device name comes back as Intel(R) AI Boost

. 🎉

The model I was already using is INT8 quantized. Natural first move: feed the same ONNX to OpenVINO targeting NPU. It blows up:

[OpenVINO-EP] Output names mismatch between OpenVINO and ONNX

What's happening: the INT8 Parakeet ONNX uses DynamicQuantizeLinear

/MatMulInteger

/DequantizeLinear

chains, and OpenVINO's graph optimizer aggressively folds those into native INT8 matmuls. The folding renames or drops intermediate tensors that the runtime is trying to read back. Hard fail at first inference.

Worse: even if you find a way to coax it through (I tried onnxruntime-openvino

, raw OpenVINO with enable_qdq_optimizer

, even NNCF post-training quantization), INT8 runs slower than FP32 on this NPU. The Intel NPU is BF16-native — it converts everything to BF16 internally. Feeding it INT8 just means extra dequant/requant on every operator boundary.

The right move is the opposite of what I expected: use the FP32 model. It's 4× bigger on disk (2.5 GB vs 650 MB) but the compiler converts it cleanly to BF16 for the NPU and runs full speed.

NOTE: After all theses tests I found that someone has created an FP16 version of Parakeet that is ~1.5 GB. I tried it briefly and if performed much better than INT8 but still 15% slower than fp32. I am not sure why, but if you are ram constrained you might prefer that one.

The Parakeet encoder accepts dynamic input shape (batch, 128, T)

where T

is the number of mel-feature frames — proportional to audio length. A 1.5 s "lights off" command is 150 frames; a 60 s dictation is 6 000 frames. ONNX Runtime on CPU handles that natively — every call allocates whatever shape comes in.

Quick aside: what's a "mel-feature frame" you may ask?(It's OK, I didn't know until yesterday) Speech models don't ingest raw audio. The audio is sliced into overlapping ~25 ms windows, each window converted into a 128-element vector of mel-frequency magnitudes (energies at different frequency bands, weighted to match human hearing). Parakeet does this conversion at 100 frames per second.T

=audio_seconds × 100

. That's the dimension that varies with utterance length.

The Intel NPU absolutely does not do dynamic shapes. At least I couldn't find a way. The compiler bakes the tile sizes and memory layout into the compiled blob based on the static input dimensions. Hand it an unbounded dynamic shape and OpenVINO refuses to compile:

[ERROR] Upper bounds are not specified for node '/pre_encode/Cast' (type 'Convert'):
        input '0' bounds are '[9223372036854775807]'

I tried bounded dynamic shapes too (ov.PartialShape([1, 128, ov.Dimension(1, 2000)])

) — the bounds don't propagate through every internal op of the Conformer, so the compiler still hits unbounded operands and bails out.

Three options:

Option 3 is the only sane answer unless someone can prove me wrong on allowing dynamic shapes. Since smart home commands are usually rather quick, here are the bucket sizes I settled on for my Spanish smart-home traffic and the NPU encoder time for each:

Bucket	Typical traffic	Encoder time on NPU
5 s
"Apaga la luz de la cocina y la del comedor"	~55 ms
20 s	Voice notes, reminders	~150 ms

Without buckets, every single utterance would pay the 20 s bucket's ~150 ms encoder cost. With the 5 s bucket added, the most common commands now spend only 55 ms on the encoder phase. We could have smaller buckets, and I did try that, but each bucket requires a new compilation step, and takes space and memory, so I though that 2 tiers was granular enough.

For most of this investigation I was getting "NPU" and "CPU" timings within noise of each other and was about to declare the NPU not worth it.

Turned out my integration shim was being attached to the wrong attribute on the loaded onnx-asr model.

onnx_asr.load_model()

returns a TextResultsAsrAdapter

that wraps the actual ASR object on .asr

. The wrapper does NOT proxy attribute writes. So this:

model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3", ...)
model._encoder = OpenVINOEncoderShim(...)  # ← attribute added to wrapper, ignored

…just adds an attribute to the wrapper that nothing reads. model.recognize()

still routes through model.asr._encoder

, which is the original ORT-CPU session. Every "NPU" benchmark I had been running was secretly plain ORT-CPU with an extra unused NPU encoder warming up uselessly in memory.

One-line fix:

model.asr._encoder = OpenVINOEncoderShim(...)  # ← actually used
model.asr._decoder_joint = OpenVINODecoderShim(...)

Once corrected, the real numbers landed where the silicon could deliver them. Lesson: when integrating with someone else's pipeline, add a tracer that confirms your code is actually being called before you trust any benchmark. This was on me.

encoder-model.onnx

2.4 GB external data) and FP32 decoder from the T

and compile for NPU:

import openvino as ov
core = ov.Core()
model = core.read_model("encoder-model.onnx")
model.reshape({"audio_signal": [1, 128, T_fixed], "length": [1]})
compiled = core.compile_model(model, "NPU", config={
    "CACHE_DIR": "/data/ov_cache",
    "PERFORMANCE_HINT": "LATENCY",
    "NPU_TURBO": "YES",
})

T_fixed

≥ the actual mel-frame count. Zero-pad to that bucket's length, pass length=actual

so the encoder knows where real audio ends.onnx_asr

by assigning to model.asr._encoder

and model.asr._decoder_joint

(NOT model._encoder

— see Trap #4).NPU compile time is ~12 s per bucket cold, ~1 s when the CACHE_DIR

blob hits. First container start is ~80 s with all buckets; subsequent restarts are fast because everything is cached.

So you don't have to:

onnxruntime-openvino

with the INT8 modelInferRequest

s on the decoderINFERENCE_PRECISION_HINT=f16

MODEL_PRIORITY=HIGH

Voice commands arrive sporadically — a few seconds of speech after several minutes of silence. The relevant metric isn't steady-state throughput transcribing a 90min podcast, it's single-shot cold-after-idle latency, because the CPU's caches/clocks are cold and the NPU might be in a low-power state.

I run my home server with aggressive power-saving (deep C states, PCIe sleep — my AMD 7900XTX idles at 4 W). "Idle" wall power is around 32-38 W (as idle as a server running 20 containers can be). I was worried these would punish cold inference. They don't.

The NPU has no observable wake-up penalty. Cold-after-idle:

Audio	CPU INT8	NPU FP32
10 s	918 ms	276 ms
20 s	1 628 ms	693 ms
60 s	4 756 ms	884 ms

Real Home Assistant trace for "apaga la luz de la cocina y la del comedor" (turn off the kitchen and dining-room lights which is a longer-than-average-sentence): CPU 0.71 s vs NPU 0.18 s, identical transcript.

The result that genuinely surprised me: this 13-TOPS NPU running Parakeet ends up as fast or faster than the same model running on an Nvidia RTX 3060 (~13 TFLOPS on FP16), which I had been using on my previous server as an eGPU. The RTX did 0.15-0.3 s per utterance. The NPU does 0.1-0.2 s. Same ballpark, and:

The NPU's active power is lower than the RTX's idle. On a workload that's mostly idle anyway, that's a 10× efficiency gain in steady state and infinite in active comparison.

For 13 TOPS, that's a remarkable use of silicon. The "NPUs are marketing" take is wrong for at least this workload.

Now, I am not claiming that the NPU is more powerful than a 3060, it clearly isn't, but I suspect it's able to match or best it because (and this is just a theory), it wakes up faster than a discrete GPU, and for a short burst of work like this, that gives it an early start that the nvidia card wasn't able to overcome. I'm sure that transcribing commands over 10 seconds the GPU would win, but those are very rare.

I packaged everything into a public Docker image. If you have:

/dev/accel/accel0

on your host (lsmod | grep intel_vpu

to verify)You can do this:

docker run -d \
  --name wyoming-parakeet-npu \
  --device /dev/accel/accel0 \
  -e LANGUAGE=es \
  -p 10300:10300 \
  -v parakeet-data:/data \
  --restart unless-stopped \
  ghcr.io/cibernox/wyoming-parakeet-on-intel-npu:latest

First boot downloads ~3.2 GB of model weights and compiles the NPU buckets (~60-90 s). Subsequent restarts are under 5 s. Point Home Assistant's Wyoming integration at tcp://<host>:10300

and you're done.

Repo with source, Dockerfile, docs and a docker-compose.yml

example: ** github.com/cibernox/wyoming-parakeet-on-intel-npu**.

If you're playing with this, things I haven't done yet that I think could move the needle:

onnx_asr

(numpy state-handling, control flow) accounts for a meaningful fraction of total time on long audio.PRs welcome.

This work stands on top of several open-source projects, all of which made this hack possible:

Thanks to all of them for shipping working code.

source & further reading

dev.to — original article AI Agents That Speak SQL: Text-to-SQL with Hugging Face smolagents Architecting an Enterprise RAG Platform: Shifting from AI Hype to Production Trust on AWS Everyone on the team is running agents. Nobody's running the same plan.

Running ASR for smart homes in the NPU of Intel processors

Run your AI side-project on zahid.host