# Building a Drone-Based Security Reconnaissance System with Computer Vision

> Source: <https://blog.roboflow.com/drone-based-security-reconnaissance-system/>
> Published: 2026-06-30 14:59:15+00:00

*Build a drone-based security system by relaying every PTZ camera and drone feed through MediaMTX, sampling one frame every two seconds, and POSTing it to a Roboflow Inference Server running RF-DETR to detect people and cars. Use supervision's PolygonZone plus ByteTrack to confirm a sustained restricted-zone breach, which auto-dispatches the drone to track the intruder, with no per-frame inference and no human in the loop until the alert fires.*

Fixed security cameras have a frustrating property: they only see where you bolted them. Cover a real perimeter (a yard, a lot, a fence line) and you're either buying a dozen cameras and a wiring crew, or paying a person to walk it at 2 a.m. Both are expensive, and both leave gaps.

A drone flips that math. One aircraft can patrol an arc that would take six fixed cameras, then land and charge. But a drone streaming video is just an expensive pair of eyes unless *something* is watching the feed and deciding what matters. That "something" is computer vision.

This post walks through a working build that ties it all together: two pan-tilt-zoom (PTZ) cameras and one drone, all running live object detection, feeding a single dashboard that knows the difference between "a person walked by" and "someone is standing inside the restricted zone, launch the drone."

## Drone-Based Security System: The Architecture

The whole system is a handful of small services that each do one job:

System architecture: cameras and drone stream over RTSP into a MediaMTX relay, which sends WebRTC video to the dashboard and frames to the FastAPI workers; workers post frames to the RF-DETR inference server, and the zone/event logic dispatches the drone, records a clip, and pushes events to the dashboard. A confirmed breach auto-dispatches the drone.

A few design choices worth calling out:

**Browsers can't play RTSP**, so every camera and the drone publish into** MediaMTX**, which relays each stream out as low-latency WebRTC for the dashboard*and*as RTSP for the backend to pull frames from. One relay, three streams.**Detection is a separate service.** The Python workers never run a model in-process. They POST frames to Roboflow'sover HTTP and get detections back. That keeps the CV concern isolated and swappable.**Inference Server****Every service owns its own event stream**(a ring buffer + a WebSocket). The frontend subscribes to all of them and merges one chronological, severity-colored log. No shared database, no message broker, so services stay loosely coupled.

### Step 1: Get frames, not just video

The cameras and drone stream continuously, but you do *not* want to run a detector on every frame. It's wasteful and, on a CPU, impossible to keep up with. So each camera worker grabs a frame on a fixed cadence (INFERENCE_INTERVAL_SEC=2.0 by default, one inference every two seconds) and sends just that one off for detection.

Before sending, the worker shrinks the frame to a 640px longest side and JPEG-encodes it at quality 80:

```
# RF-DETR resizes to a small square internally, so downscaling here loses
# little accuracy but cuts encode + base64 + transfer + decode latency a lot.
INFER_MAX_SIDE = 640
ok, buf = cv2.imencode(".jpg", send, [cv2.IMWRITE_JPEG_QUALITY, 80])
```

This is the single highest-leverage latency optimization in the build. The model downsizes the image anyway, so a 4K frame buys you nothing but slower JPEG encoding, a bigger base64 payload, and a slower decode on the server side. Send 640px and scale the returned boxes back up to the original frame.

```
[CODE: the full `infer()` method, including scaling detections back to original resolution]
```

### Step 2: The model, and why we didn't train one (yet)

The detector is [ RF-DETR](https://rfdetr.roboflow.com/latest/?ref=blog.roboflow.com) (rfdetr-base), a transformer-based object detection model that Roboflow runs out of the box with the standard COCO classes, which already include person and car. We filter the results down to just those two classes downstream, so the model itself stays generic:

```
MODEL_ID=rfdetr-base
TARGET_CLASSES=person,car
CONFIDENCE_THRESHOLD=0.5
```

For perimeter security, off-the-shelf person/car detection gets you surprisingly far, and starting here means *zero* labeling before you have a working system. **MODEL_ID is an environment variable end-to-end**, so the moment you need something COCO doesn't cover (a "person carrying a bag," a "delivery van vs. private car," aerial-angle people who look nothing like ground-level training data), you point it at a fine-tuned model with a one-line config change.

That fine-tuning path is the natural Roboflow loop:

**Collect** real footage from your own cameras and drone. The most valuable dataset is the one shot from*your*angles and altitudes.**Label** it in**Roboflow Annotate**, or bootstrap labels with a model you already have and just correct them.** Pull in**__Universe__** datasets**for classes you don't want to label from scratch. Aerial person/vehicle datasets are a strong head start for drone-altitude views.**Train** a fine-tuned RF-DETR, then deploy it back through the same Inference Server. No pipeline rewrite.

### Step 3: Deploy the detector

The Inference Server runs as a container straight from roboflow/inference-server-cpu. The PTZ worker calls one route:

```
POST /infer/object_detection
  { "model_id": "rfdetr-base",
    "image": { "type": "base64", "value": "<jpeg>" },
    "confidence": 0.5 }
```

The default image is **CPU-only**, which is fine *because* inference runs on a low cadence rather than per-frame. When you want real-time, per-frame detection (say, running this on a **Jetson** mounted on a larger drone), you swap the image to roboflow/inference-server-gpu, flip USE_GPU=true, and the rest of the system doesn't change. Same container, same API, just CUDA underneath.

```
[BENCHMARK: measured FPS and per-frame latency, CPU server vs. GPU/Jetson; fill in from your own runs]
```

### Step 4: Turn boxes into decisions

A bounding box on its own isn't security. It's just a coordinate. The logic that makes it *useful* lives in three layers:

**Restricted zones, the easy way.** You could hand-roll a point-in-polygon test, but you don't have to. This is exactly what [supervision](https://github.com/roboflow/supervision?ref=blog.roboflow.com), Roboflow's open-source computer-vision toolkit (pip install supervision), gives you out of the box. sv.PolygonZone takes a polygon and tells you which detections fall inside it; sv.PolygonZoneAnnotator draws the zone on the frame.

The one detail that matters is the *anchor*. Set triggering_anchors to BOTTOM_CENTER so the test uses a person's **feet**, not the box center. Ground contact is a far better proxy for "standing in the zone" than torso height, which can hang over a boundary the person isn't actually crossing.

``` python
import numpy as np
import supervision as sv
 
# Restricted-zone polygon in pixel coordinates, for a 1280×720 frame.
zone_polygon = np.array([[704, 180], [1216, 180], [1216, 612], [704, 612]])
 
zone = sv.PolygonZone(
    polygon=zone_polygon,
    triggering_anchors=(sv.Position.BOTTOM_CENTER,),   # a person's feet
)
zone_annotator = sv.PolygonZoneAnnotator(zone=zone, color=sv.Color.RED)
 
# Per frame: supervision parses the inference response directly.
detections = sv.Detections.from_inference(result)
people = detections[detections.class_id == PERSON_CLASS_ID]
 
in_zone = zone.trigger(detections=people)   # one boolean per detection
breach = bool(in_zone.any())                # is anyone standing in the zone?
frame = zone_annotator.annotate(scene=frame)
```

zone.trigger() returns a boolean array (one flag per detection) and zone.current_count tells you how many people are inside right now. That's the whole breach test, and it's the same primitive whether the camera is fixed or flying on the drone.

**Confirmation, not twitch.** A single frame with someone in the zone doesn't trip anything. The breach has to *persist* a few seconds before it fires, with an exit grace period so one dropped frame doesn't end the event prematurely. Run detections through a tracker first (sv.ByteTrack) and every person gets a stable tracker_id, so "the same intruder, in the zone, for three seconds" becomes something you can actually measure instead of guessing frame to frame. This single rule is what separates an annoying system from a usable one.

**The payoff: cameras dispatch the drone.** When a fixed PTZ camera confirms a breach, it doesn't just turn the log row red. It POSTs to the drone service to launch and go track the intruder, then keeps forwarding the detection's bounding-box width so the drone can hold range. The fixed camera is the tripwire; the drone is the response. Meanwhile the whole episode is recorded to an annotated MP4 and linked from the event feed.

### What Broke, And How We Fixed It

**False positives from flicker.** Early on, every momentary detection inside a zone fired an alert. A person walking *past* the edge would set it off. The sustained-confirmation window plus the exit grace fixed it: real intrusions last seconds; noise doesn't.

**Altitude and angle drift.** COCO's person class is trained mostly on ground-level photos. From a drone looking down, people are a very different shape, and confidence sags. That's exactly the gap the Roboflow fine-tuning loop closes, and the reason MODEL_ID is swappable from day one.

**The drone has no GPS.** The build's real-hardware driver targets a **DJI Tello** (via djitellopy), which only reports relative motion and barometric height. The driver *dead-reckons* position by integrating reported velocity and maps it onto pseudo-coordinates so the flight-path mini-map still renders. But you treat position as an estimate, and "fly to" becomes a best-effort short hop, not GPS waypoint navigation. For a craft with real GPS, there's a MAVLink/ArduPilot driver path behind the same interface.

### What Worked

**One drone genuinely covered an arc** that would have needed several fixed cameras, and the scan-area mission flies an orbit autonomously rather than needing a pilot.**The 640px / low-cadence approach kept CPU-only inference comfortably real-time-enough** for patrol. You don't need 30 FPS to catch someone standing in a restricted zone for three seconds.**The camera-triggers-drone handoff** is the part that feels like more than a demo: a fixed sensor making a decision that puts a mobile sensor on target, with no human in the loop until the alert lands.

The honest limitations: COCO classes only take you so far at drone altitude, the Tello is an indoor-scale proof of concept rather than a perimeter aircraft, and edge FPS is entirely a function of whether you're on CPU or a GPU/Jetson.

## Try It Yourself: Build a Drone-Based Security Reconnaissance System with Vision AI

The pattern here generalizes to almost any monitoring problem, drone or not: relay the video, sample frames on a cadence, detect with a swappable model, and wrap the boxes in *decision logic*. The CV part is the part you don't have to build from scratch.

If you want to start with your own footage, [ sign up for Roboflow](https://app.roboflow.com/?ref=blog.roboflow.com), label a few hundred frames in

**Annotate**(or grab a head-start

[dataset](https://universe.roboflow.com/pian-jiangfeng-gmail-com/aerial-cars?ref=blog.roboflow.com)from

[), train an RF-DETR model, and deploy it through](https://universe.roboflow.com/?ref=blog.roboflow.com)

__Universe__[with the same one-line MODEL_ID swap shown above. Point it at a camera and see what it catches.](https://docs.roboflow.com/changelog/explore-by-month/february-2026/inference-1.0-modular-vision-execution-engine?ref=blog.roboflow.com)

**Inference****Cite this Post**

Use the following entry to cite this post in your research:

[Tyler Odenthal](/author/tyler/). (Jun 30, 2026).
Building a Drone-Based Security Reconnaissance System with Computer Vision. Roboflow Blog: https://blog.roboflow.com/drone-based-security-reconnaissance-system/