Building a Drone-Based Security Reconnaissance System with Computer Vision

A developer built a drone-based security reconnaissance system that uses computer vision to detect restricted-zone breaches. The system relays PTZ camera and drone feeds through MediaMTX, samples frames every two seconds, and runs RF-DETR inference via Roboflow to detect people and cars. A confirmed breach auto-dispatches the drone to track the intruder without human intervention until the alert fires.

Build a drone-based security system by relaying every PTZ camera and drone feed through MediaMTX, sampling one frame every two seconds, and POSTing it to a Roboflow Inference Server running RF-DETR to detect people and cars. Use supervision's PolygonZone plus ByteTrack to confirm a sustained restricted-zone breach, which auto-dispatches the drone to track the intruder, with no per-frame inference and no human in the loop until the alert fires. Fixed security cameras have a frustrating property: they only see where you bolted them. Cover a real perimeter a yard, a lot, a fence line and you're either buying a dozen cameras and a wiring crew, or paying a person to walk it at 2 a.m. Both are expensive, and both leave gaps. A drone flips that math. One aircraft can patrol an arc that would take six fixed cameras, then land and charge. But a drone streaming video is just an expensive pair of eyes unless something is watching the feed and deciding what matters. That "something" is computer vision. This post walks through a working build that ties it all together: two pan-tilt-zoom PTZ cameras and one drone, all running live object detection, feeding a single dashboard that knows the difference between "a person walked by" and "someone is standing inside the restricted zone, launch the drone." Drone-Based Security System: The Architecture The whole system is a handful of small services that each do one job: System architecture: cameras and drone stream over RTSP into a MediaMTX relay, which sends WebRTC video to the dashboard and frames to the FastAPI workers; workers post frames to the RF-DETR inference server, and the zone/event logic dispatches the drone, records a clip, and pushes events to the dashboard. A confirmed breach auto-dispatches the drone. A few design choices worth calling out: Browsers can't play RTSP , so every camera and the drone publish into MediaMTX , which relays each stream out as low-latency WebRTC for the dashboard and as RTSP for the backend to pull frames from. One relay, three streams. Detection is a separate service. The Python workers never run a model in-process. They POST frames to Roboflow'sover HTTP and get detections back. That keeps the CV concern isolated and swappable. Inference Server Every service owns its own event stream a ring buffer + a WebSocket . The frontend subscribes to all of them and merges one chronological, severity-colored log. No shared database, no message broker, so services stay loosely coupled. Step 1: Get frames, not just video The cameras and drone stream continuously, but you do not want to run a detector on every frame. It's wasteful and, on a CPU, impossible to keep up with. So each camera worker grabs a frame on a fixed cadence INFERENCE INTERVAL SEC=2.0 by default, one inference every two seconds and sends just that one off for detection. Before sending, the worker shrinks the frame to a 640px longest side and JPEG-encodes it at quality 80: RF-DETR resizes to a small square internally, so downscaling here loses little accuracy but cuts encode + base64 + transfer + decode latency a lot. INFER MAX SIDE = 640 ok, buf = cv2.imencode ".jpg", send, cv2.IMWRITE JPEG QUALITY, 80 This is the single highest-leverage latency optimization in the build. The model downsizes the image anyway, so a 4K frame buys you nothing but slower JPEG encoding, a bigger base64 payload, and a slower decode on the server side. Send 640px and scale the returned boxes back up to the original frame. CODE: the full infer method, including scaling detections back to original resolution Step 2: The model, and why we didn't train one yet The detector is RF-DETR https://rfdetr.roboflow.com/latest/?ref=blog.roboflow.com rfdetr-base , a transformer-based object detection model that Roboflow runs out of the box with the standard COCO classes, which already include person and car. We filter the results down to just those two classes downstream, so the model itself stays generic: MODEL ID=rfdetr-base TARGET CLASSES=person,car CONFIDENCE THRESHOLD=0.5 For perimeter security, off-the-shelf person/car detection gets you surprisingly far, and starting here means zero labeling before you have a working system. MODEL ID is an environment variable end-to-end , so the moment you need something COCO doesn't cover a "person carrying a bag," a "delivery van vs. private car," aerial-angle people who look nothing like ground-level training data , you point it at a fine-tuned model with a one-line config change. That fine-tuning path is the natural Roboflow loop: Collect real footage from your own cameras and drone. The most valuable dataset is the one shot from your angles and altitudes. Label it in Roboflow Annotate , or bootstrap labels with a model you already have and just correct them. Pull in Universe datasets for classes you don't want to label from scratch. Aerial person/vehicle datasets are a strong head start for drone-altitude views. Train a fine-tuned RF-DETR, then deploy it back through the same Inference Server. No pipeline rewrite. Step 3: Deploy the detector The Inference Server runs as a container straight from roboflow/inference-server-cpu. The PTZ worker calls one route: POST /infer/object detection { "model id": "rfdetr-base", "image": { "type": "base64", "value": "<jpeg " }, "confidence": 0.5 } The default image is CPU-only , which is fine because inference runs on a low cadence rather than per-frame. When you want real-time, per-frame detection say, running this on a Jetson mounted on a larger drone , you swap the image to roboflow/inference-server-gpu, flip USE GPU=true, and the rest of the system doesn't change. Same container, same API, just CUDA underneath. BENCHMARK: measured FPS and per-frame latency, CPU server vs. GPU/Jetson; fill in from your own runs Step 4: Turn boxes into decisions A bounding box on its own isn't security. It's just a coordinate. The logic that makes it useful lives in three layers: Restricted zones, the easy way. You could hand-roll a point-in-polygon test, but you don't have to. This is exactly what supervision https://github.com/roboflow/supervision?ref=blog.roboflow.com , Roboflow's open-source computer-vision toolkit pip install supervision , gives you out of the box. sv.PolygonZone takes a polygon and tells you which detections fall inside it; sv.PolygonZoneAnnotator draws the zone on the frame. The one detail that matters is the anchor . Set triggering anchors to BOTTOM CENTER so the test uses a person's feet , not the box center. Ground contact is a far better proxy for "standing in the zone" than torso height, which can hang over a boundary the person isn't actually crossing. python import numpy as np import supervision as sv Restricted-zone polygon in pixel coordinates, for a 1280×720 frame. zone polygon = np.array 704, 180 , 1216, 180 , 1216, 612 , 704, 612 zone = sv.PolygonZone polygon=zone polygon, triggering anchors= sv.Position.BOTTOM CENTER, , a person's feet zone annotator = sv.PolygonZoneAnnotator zone=zone, color=sv.Color.RED Per frame: supervision parses the inference response directly. detections = sv.Detections.from inference result people = detections detections.class id == PERSON CLASS ID in zone = zone.trigger detections=people one boolean per detection breach = bool in zone.any is anyone standing in the zone? frame = zone annotator.annotate scene=frame zone.trigger returns a boolean array one flag per detection and zone.current count tells you how many people are inside right now. That's the whole breach test, and it's the same primitive whether the camera is fixed or flying on the drone. Confirmation, not twitch. A single frame with someone in the zone doesn't trip anything. The breach has to persist a few seconds before it fires, with an exit grace period so one dropped frame doesn't end the event prematurely. Run detections through a tracker first sv.ByteTrack and every person gets a stable tracker id, so "the same intruder, in the zone, for three seconds" becomes something you can actually measure instead of guessing frame to frame. This single rule is what separates an annoying system from a usable one. The payoff: cameras dispatch the drone. When a fixed PTZ camera confirms a breach, it doesn't just turn the log row red. It POSTs to the drone service to launch and go track the intruder, then keeps forwarding the detection's bounding-box width so the drone can hold range. The fixed camera is the tripwire; the drone is the response. Meanwhile the whole episode is recorded to an annotated MP4 and linked from the event feed. What Broke, And How We Fixed It False positives from flicker. Early on, every momentary detection inside a zone fired an alert. A person walking past the edge would set it off. The sustained-confirmation window plus the exit grace fixed it: real intrusions last seconds; noise doesn't. Altitude and angle drift. COCO's person class is trained mostly on ground-level photos. From a drone looking down, people are a very different shape, and confidence sags. That's exactly the gap the Roboflow fine-tuning loop closes, and the reason MODEL ID is swappable from day one. The drone has no GPS. The build's real-hardware driver targets a DJI Tello via djitellopy , which only reports relative motion and barometric height. The driver dead-reckons position by integrating reported velocity and maps it onto pseudo-coordinates so the flight-path mini-map still renders. But you treat position as an estimate, and "fly to" becomes a best-effort short hop, not GPS waypoint navigation. For a craft with real GPS, there's a MAVLink/ArduPilot driver path behind the same interface. What Worked One drone genuinely covered an arc that would have needed several fixed cameras, and the scan-area mission flies an orbit autonomously rather than needing a pilot. The 640px / low-cadence approach kept CPU-only inference comfortably real-time-enough for patrol. You don't need 30 FPS to catch someone standing in a restricted zone for three seconds. The camera-triggers-drone handoff is the part that feels like more than a demo: a fixed sensor making a decision that puts a mobile sensor on target, with no human in the loop until the alert lands. The honest limitations: COCO classes only take you so far at drone altitude, the Tello is an indoor-scale proof of concept rather than a perimeter aircraft, and edge FPS is entirely a function of whether you're on CPU or a GPU/Jetson. Try It Yourself: Build a Drone-Based Security Reconnaissance System with Vision AI The pattern here generalizes to almost any monitoring problem, drone or not: relay the video, sample frames on a cadence, detect with a swappable model, and wrap the boxes in decision logic . The CV part is the part you don't have to build from scratch. If you want to start with your own footage, sign up for Roboflow https://app.roboflow.com/?ref=blog.roboflow.com , label a few hundred frames in Annotate or grab a head-start dataset https://universe.roboflow.com/pian-jiangfeng-gmail-com/aerial-cars?ref=blog.roboflow.com from , train an RF-DETR model, and deploy it through https://universe.roboflow.com/?ref=blog.roboflow.com Universe with the same one-line MODEL ID swap shown above. Point it at a camera and see what it catches. https://docs.roboflow.com/changelog/explore-by-month/february-2026/inference-1.0-modular-vision-execution-engine?ref=blog.roboflow.com Inference Cite this Post Use the following entry to cite this post in your research: Tyler Odenthal /author/tyler/ . Jun 30, 2026 . Building a Drone-Based Security Reconnaissance System with Computer Vision. Roboflow Blog: https://blog.roboflow.com/drone-based-security-reconnaissance-system/