Show HN: Dual YOLOv8n UAV Detection on RK3588S at 42 FPS Using NPU A developer built a real-time YOLOv8n UAV detection pipeline on the Rockchip RK3588S SoC that runs at 46 FPS using all three NPU cores, saturating the camera's frame rate. The pipeline uses only ~140 MB of RAM per stream and runs entirely on fixed-function silicon, keeping the CPU free. It also integrates an on-device LLM (Qwen2.5-0.5B) to generate natural-language assessments when a tracked UAV leaves the scene. Real-time YOLOv8n UAV detection at the sensor's 46 FPS ceiling, in ~140 MB of RAM. A high-throughput, low-footprint computer-vision pipeline for the Rockchip RK3588S SoC: it captures live 1080p MIPI frames, runs YOLOv8n across all 3 NPU cores in parallel lifting throughput from ~31 to 46 FPS — the camera, not the pipeline, is now the limit , and streams the annotated result to HDMI or RTSP. Capture, color-convert/resize and inference run entirely on fixed-function silicon ISP, RGA, NPU , so the CPU stays free and memory holds flat at ~140 MB per stream — small enough to run on even the cheapest 2 GB RK3588S boards, not just high-end dev kits. Targets any RK3588S board; built and tested on the Khadas Edge2 . Then it goes a step further: when a tracked UAV leaves the scene, an on-device LLM Qwen2.5-0.5B, on the same NPU writes a natural-language assessment of what just happened. The whole thing is a chain of small, independent processes connected by Unix-domain sockets — detections flow downstream into multi-object tracking, temporal-feature extraction, a presence FSM, and the on-demand LLM summary. Highlights Saturates the sensor: 3-thread NPU inference lifts throughput from ~31 FPS to the 46 FPS camera ceiling — the pipeline is no longer the bottleneck. Fully hardware-accelerated: capture ISP , color-convert/resize RGA , and inference NPU never touch the CPU, giving a flat ~140 MB RSS per stream. Runs on any RK3588S board: because the footprint is so small ~140 MB for one stream, ~290 MB for two , it fits comfortably on the cheapest RK3588S boards on the market — even 2 GB models that sell for as little as ~€90 — not just high-end dev kits. Two cameras at once: independent per-device sockets let two streams run and be controlled side by side. Composable pipeline: detection → ByteTrack → temporal features → presence FSM → on-demand LLM summary, each a separate process. NPU hand-off for the LLM: a blackout / resume control plane frees the whole NPU so the LLM runs at full speed, then hands it back to the cameras. Target hardware: any RK3588S-based board, aarch64 Linux, with an OS08A10 MIPI camera. Developed and tested on the Khadas Edge2 . Cross-compiles from x86-64/WSL or builds natively on the board. For the full software architecture Mermaid diagrams of the internal pipeline and the multi-process topology see docs/architecture.md /alebal123bal/khadas yolov8n multithread/blob/main/docs/architecture.md ; for launch commands see docs/usage.md /alebal123bal/khadas yolov8n multithread/blob/main/docs/usage.md . Related repositories — the entire pipeline for training, converting, and exporting the YOLO model into the Rockchip NPU RKNN TRAIN YOLO https://github.com/alebal123bal/RKNN TRAIN YOLO .rknn format used here.— the entire pipeline for running optimized LLM models on the RK3588S, either on the NPU RKLLM or the CPU llama . RKLLM LLAMA QWEN https://github.com/alebal123bal/RKLLM LLAMA QWEN A 3-thread inference pool runs one RKNN context per NPU core rknn dup context + rknn set core mask , pipelining capture, inference, and display across all three cores. At 1080p with YOLOv8n 640×640 this lifts throughput from ~31.2 FPS naïve single-threaded loop to the 46 FPS OS08A10 camera ceiling — the pipeline is no longer the bottleneck, the sensor is. Full per-model FPS, latency, and CPU/NPU/RAM numbers are in docs/benchmarks.md /alebal123bal/khadas yolov8n multithread/blob/main/docs/benchmarks.md . Every heavy per-frame operation runs on a dedicated fixed-function block of the RK3588S camera ISP, RGA, NPU , never on the CPU — so there are no large intermediate framebuffers or scratch tensors CPU-side. A fixed pool of pre-allocated buffers N BUF , see BufPool in src/main.cc /alebal123bal/khadas yolov8n multithread/blob/main/yolov8n cap multithread/src/main.cc is recycled instead of allocating per frame, so memory stays flat and bounded : ~137–152 MB RSS for one 1080p stream , ~276–304 MB for two and that double-counts the shared librknnrt.so / librga.so pages .Because the NPU, ISP and RGA are identical across the whole RK3588S range, the same binary runs at full speed on the cheapest 2 GB boards ~€90 — no 8/16 GB dev kit required. See docs/architecture.md /alebal123bal/khadas yolov8n multithread/blob/main/docs/architecture.md for the per-frame offload table and pipeline diagram. Native on the board : cd yolov8n cap multithread bash build.sh Cross-compile WSL / x86-64 Linux : one-time setup sudo apt-get install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu bash setup sdk.sh fetches librga v1.10.5 8 + librknnrt v2.3.2 cd yolov8n cap multithread bash build.sh uses toolchain-aarch64.cmake aarch64-linux-gnu-g++ scp -r install/yolov8n cap multithread/ khadas@