{"slug": "show-hn-dual-yolov8n-uav-detection-on-rk3588s-at-42-fps-using-npu", "title": "Show HN: Dual YOLOv8n UAV Detection on RK3588S at 42 FPS Using NPU", "summary": "A developer built a real-time YOLOv8n UAV detection pipeline on the Rockchip RK3588S SoC that runs at 46 FPS using all three NPU cores, saturating the camera's frame rate. The pipeline uses only ~140 MB of RAM per stream and runs entirely on fixed-function silicon, keeping the CPU free. It also integrates an on-device LLM (Qwen2.5-0.5B) to generate natural-language assessments when a tracked UAV leaves the scene.", "body_md": "**Real-time YOLOv8n UAV detection at the sensor's 46 FPS ceiling, in ~140 MB of\nRAM.** A high-throughput, low-footprint computer-vision pipeline\nfor the **Rockchip RK3588S** SoC: it captures live 1080p MIPI frames, runs\nYOLOv8n across all **3 NPU cores** in parallel (lifting throughput from ~31 to\n**46 FPS** — the camera, not the pipeline, is now the limit), and streams the\nannotated result to HDMI or RTSP. Capture, color-convert/resize and inference\nrun entirely on fixed-function silicon (ISP, RGA, NPU), so the CPU stays free\nand memory holds **flat at ~140 MB per stream** — small enough to run on even\nthe cheapest 2 GB RK3588S boards, not just high-end dev kits. Targets any\nRK3588S board; built and tested on the **Khadas Edge2**.\n\nThen it goes a step further: when a tracked UAV leaves the scene, an on-device LLM\n(Qwen2.5-0.5B, on the *same* NPU) writes a natural-language assessment of what\njust happened. The whole thing is a chain of small, independent processes\nconnected by Unix-domain sockets — detections flow downstream into multi-object\ntracking, temporal-feature extraction, a presence FSM, and the on-demand LLM\nsummary.\n\n**Highlights**\n\n**Saturates the sensor:** 3-thread NPU inference lifts throughput from ~31 FPS to the 46 FPS camera ceiling — the pipeline is no longer the bottleneck.**Fully hardware-accelerated:** capture (ISP), color-convert/resize (RGA), and inference (NPU) never touch the CPU, giving a flat ~140 MB RSS per stream.**Runs on any RK3588S board:** because the footprint is so small (~140 MB for one stream, ~290 MB for two), it fits comfortably on the cheapest RK3588S boards on the market — even 2 GB models that sell for**as little as ~€90**— not just high-end dev kits.** Two cameras at once:**independent per-device sockets let two streams run and be controlled side by side.** Composable pipeline:**detection → ByteTrack → temporal features → presence FSM → on-demand LLM summary, each a separate process.** NPU hand-off for the LLM:**a`blackout`\n\n/`resume`\n\ncontrol plane frees the whole NPU so the LLM runs at full speed, then hands it back to the cameras.\n\n**Target hardware:** any RK3588S-based board, aarch64 Linux, with an OS08A10\nMIPI camera. Developed and tested on the **Khadas Edge2**. Cross-compiles from\nx86-64/WSL or builds natively on the board.\n\nFor the full software architecture (Mermaid diagrams of the internal pipeline\nand the multi-process topology) see [docs/architecture.md](/alebal123bal/khadas_yolov8n_multithread/blob/main/docs/architecture.md);\nfor launch commands see [docs/usage.md](/alebal123bal/khadas_yolov8n_multithread/blob/main/docs/usage.md).\n\n**Related repositories**\n\n— the entire pipeline for training, converting, and exporting the YOLO model into the Rockchip NPU[RKNN_TRAIN_YOLO](https://github.com/alebal123bal/RKNN_TRAIN_YOLO)`.rknn`\n\nformat used here.— the entire pipeline for running optimized LLM models on the RK3588S, either on the NPU (RKLLM) or the CPU (llama).[RKLLM_LLAMA_QWEN](https://github.com/alebal123bal/RKLLM_LLAMA_QWEN)\n\nA 3-thread inference pool runs one RKNN context per NPU core\n(`rknn_dup_context`\n\n+ `rknn_set_core_mask`\n\n), pipelining capture, inference, and\ndisplay across all three cores. At 1080p with YOLOv8n 640×640 this lifts\nthroughput from **~31.2 FPS** (naïve single-threaded loop) to the **46 FPS**\nOS08A10 camera ceiling — the pipeline is no longer the bottleneck, the sensor\nis. Full per-model FPS, latency, and CPU/NPU/RAM numbers are in\n[docs/benchmarks.md](/alebal123bal/khadas_yolov8n_multithread/blob/main/docs/benchmarks.md).\n\nEvery heavy per-frame operation runs on a dedicated fixed-function block of the\nRK3588S (camera ISP, RGA, NPU), never on the CPU — so there are no large\nintermediate framebuffers or scratch tensors CPU-side. A fixed pool of\npre-allocated buffers (`N_BUF`\n\n, see `BufPool`\n\nin [ src/main.cc](/alebal123bal/khadas_yolov8n_multithread/blob/main/yolov8n_cap_multithread/src/main.cc))\nis recycled instead of allocating per frame, so memory stays\n\n**flat and bounded**:\n\n**~137–152 MB RSS for one 1080p stream**,\n\n**~276–304 MB for two**(and that double-counts the shared\n\n`librknnrt.so`\n\n/ `librga.so`\n\npages).Because the NPU, ISP and RGA are identical across the whole RK3588S range, the\nsame binary runs at full speed on the cheapest 2 GB boards (**~€90**) — no\n8/16 GB dev kit required. See [docs/architecture.md](/alebal123bal/khadas_yolov8n_multithread/blob/main/docs/architecture.md)\nfor the per-frame offload table and pipeline diagram.\n\n**Native (on the board):**\n\n```\ncd yolov8n_cap_multithread\nbash build.sh\n```\n\n**Cross-compile (WSL / x86-64 Linux):**\n\n```\n# one-time setup\nsudo apt-get install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu\nbash setup_sdk.sh          # fetches librga v1.10.5_[8] + librknnrt v2.3.2\n\ncd yolov8n_cap_multithread\nbash build.sh              # uses toolchain-aarch64.cmake (aarch64-linux-gnu-g++)\nscp -r install/yolov8n_cap_multithread/ khadas@<board-ip>:~/programs/\n```\n\nRun: `./yolov8n_cap_multithread <rknn model> <device number> <rtsp port | hdmi>`\n\nSee [docs/usage.md](/alebal123bal/khadas_yolov8n_multithread/blob/main/docs/usage.md) for launch commands, and\n[docs/usage_advanced.md](/alebal123bal/khadas_yolov8n_multithread/blob/main/docs/usage_advanced.md) for the IPC control/data\nplane, the downstream tracking/temporal/LLM stages, and RTSP streaming setup.\n\n```\nyolov8n_cap_multithread/\n├── CMakeLists.txt              # builds main pipeline + all auxiliary processes\n├── build.sh                    # convenience wrapper around CMake\n├── toolchain-aarch64.cmake     # cross-compile toolchain (WSL / x86 → aarch64)\n├── data/\n│   ├── coco_1_labels_list.txt\n│   └── model/                  # .rknn model files\n│\n├── include/                    # YOLO pipeline headers\n│   ├── camera_util.h\n│   ├── drm_func.h\n│   ├── local_display.h         # HDMI output via DRM / Wayland\n│   ├── model_utils.h\n│   ├── postprocess.h           # YOLOv8 decode + NMS\n│   ├── rga_func.h              # Rockchip RGA color-space conversion / resize\n│   ├── rtsp_stream.h           # GStreamer RTSP publisher\n│   └── ipc/                    # shared IPC layer (control + data planes)\n│       ├── bounded_queue.h     # drop-oldest queue used by all publishers\n│       ├── i_control_server.h\n│       ├── i_data_publisher.h\n│       ├── messages.h          # in-process DetectionMessage type\n│       ├── unix_control_server.h\n│       ├── unix_data_publisher.h\n│       ├── wire_protocol.h     # ALL on-the-wire structs + socket paths\n│       └── yolo_control_state.h\n│\n├── src/                        # YOLO pipeline implementation\n│   ├── main.cc                 # multi-threaded RKNN pipeline (3 NPU cores)\n│   ├── camera_util.cc\n│   ├── local_display.cc\n│   ├── model_utils.cc\n│   ├── postprocess.cc\n│   ├── rga_func.cc\n│   ├── rtsp_stream.cc\n│   └── ipc/\n│       ├── unix_control_server.cc      # JSON control plane over AF_UNIX\n│       └── unix_data_publisher.cc      # binary detection stream over AF_UNIX\n│\n├── tracker/                    # ByteTrack stage (separate process)\n│   ├── include/\n│   │   └── bytetrack_adapter.h         # IByteTracker interface\n│   └── src/\n│       ├── bytetrack_service.cc        # main() — reads data, writes tracks\n│       └── iou_tracker.cc              # default IOU-greedy implementation\n│\n├── temporal/                   # Temporal-features stage (separate process)\n│   ├── include/\n│   │   ├── track_state.h               # per-track history + feature math\n│   │   └── track_manager.h             # lifecycle + per-frame orchestration\n│   └── src/\n│       ├── temporal_service.cc         # main() — reads tracks, writes events\n│       ├── track_state.cc\n│       └── track_manager.cc\n│\n├── tools/                      # Standalone client / debug binaries\n│   ├── control_client.cc       # send pause/resume/blackout/status commands\n│   ├── data_receiver.cc        # consume raw detections   (yolo_data socket)\n│   ├── tracks_receiver.cc      # consume tracked dets     (yolo_tracks socket)\n│   ├── events_receiver.cc      # consume temporal events  (yolo_events socket)\n│   └── event_summarizer.cc     # presence FSM + on-demand LLM (production sink)\n│\n├── utility_board_scripts/      # board-side helpers (deployed to install tree)\n│   └── run_qwen.sh             # feeds a snapshot to Qwen2.5-0.5B via llm_demo\n│\n├── build/                      # CMake out-of-source build tree\n└── install/                    # `make install` deploy tree (scp this to board)\n    └── yolov8n_cap_multithread/\n        ├── yolov8n_cap_multithread\n        ├── bytetrack_service\n        ├── temporal_service\n        ├── control_client\n        ├── data_receiver\n        ├── tracks_receiver\n        ├── events_receiver\n        ├── event_summarizer\n        ├── data/                       # models + labels\n        ├── utility_board_scripts/      # run_qwen.sh\n        └── lib/                        # librknnrt.so, librga.so\n```\n\nEach stage is an independent OS process; they communicate via per-device\nUnix-domain sockets (`<device>`\n\n= V4L2 device number, e.g. `33`\n\n). The full\nsoftware architecture — the internal `main.cc`\n\npipeline and the multi-process\ntopology, both as Mermaid diagrams — is documented in\n[docs/architecture.md](/alebal123bal/khadas_yolov8n_multithread/blob/main/docs/architecture.md).\n\nLicensed under the **Apache License 2.0** — see [LICENSE](/alebal123bal/khadas_yolov8n_multithread/blob/main/LICENSE).\n\nThis is an **independent, personal project** built for **educational and\nresearch purposes only**. It is **not affiliated with or endorsed by any\nemployer or client** of the author, and is **not intended for production,\noperational, safety-critical, surveillance, or defense use**. The \"UAV\" class\nis only a sample detection target for benchmarking the inference pipeline. The\nsoftware is provided **\"AS IS\", without warranty of any kind**, and **you are\nsolely responsible** for complying with all applicable export-control and other\nregulations. See [DISCLAIMER.md](/alebal123bal/khadas_yolov8n_multithread/blob/main/DISCLAIMER.md) for the full text.", "url": "https://wpnews.pro/news/show-hn-dual-yolov8n-uav-detection-on-rk3588s-at-42-fps-using-npu", "canonical_source": "https://github.com/alebal123bal/khadas_yolov8n_multithread", "published_at": "2026-06-14 14:37:51+00:00", "updated_at": "2026-06-14 16:10:35.780091+00:00", "lang": "en", "topics": ["computer-vision", "artificial-intelligence", "ai-infrastructure", "ai-agents", "ai-tools"], "entities": ["Rockchip RK3588S", "YOLOv8n", "Khadas Edge2", "Qwen2.5-0.5B", "OS08A10", "NPU", "ISP", "RGA"], "alternates": {"html": "https://wpnews.pro/news/show-hn-dual-yolov8n-uav-detection-on-rk3588s-at-42-fps-using-npu", "markdown": "https://wpnews.pro/news/show-hn-dual-yolov8n-uav-detection-on-rk3588s-at-42-fps-using-npu.md", "text": "https://wpnews.pro/news/show-hn-dual-yolov8n-uav-detection-on-rk3588s-at-42-fps-using-npu.txt", "jsonld": "https://wpnews.pro/news/show-hn-dual-yolov8n-uav-detection-on-rk3588s-at-42-fps-using-npu.jsonld"}}