I Built an Automated Document Management System with Paperless-NGX A software developer built a fully automated document management system using Paperless-NGX that processes over 900 documents for a family of four across multiple languages. The system ingests physical papers via a ScanSnap scanner and digital documents from email, then uses Google Document AI and Gemini 2.5 Flash for OCR, classification, and tagging. The setup runs on a single VPS with nine Docker containers and requires only 30 seconds of daily manual effort. How I Built a Fully Automated Document Management System with Paperless-NGX How I built a zero-effort document management system for a family of four - 900+ documents across multiple languages with AI classification, barcode tracking, encrypted backups, and a scanner-to-archive pipeline that runs on autopilot. Managing paper and digital documents for a family of four across multiple countries and languages was becoming unmanageable. Tax documents in German, medical records in English, contracts scattered across three email accounts, and a growing pile of physical papers that I could never find when I needed them. I spent weeks building a fully automated document management system that now handles 900+ documents with zero daily effort. This post walks you through the entire setup — from the scanner on my desk to the encrypted backups on GitHub — with every configuration file and script you need to build it yourself. The Problem Our household generates a surprising amount of paperwork. Insurance letters, tax assessments, medical bills, employment contracts, kindergarten registrations, vehicle documents — all in different languages, arriving through different channels, belonging to different family members. Before this setup, my "system" was: - A physical folder that was always full and never organized - Email attachments scattered across three accounts - A phone full of photos of documents "I'll file later" - No way to search for anything — I had to remember where I put it I needed a system that could: - Ingest documents from email, scanner, and manual upload automatically - Classify documents by type, person, and topic using AI - Track which physical documents I have and where they are - Be searchable and accessible from anywhere - Back itself up to multiple locations without my intervention My Workflow: From Paper to Searchable Archive Here's the daily workflow that now runs on autopilot: Physical Documents - A letter arrives in the mail - I print an ASN barcode label using Avery Zweckform L4731REV-25 labels and their online designer, then stick it on the document - I place the document on my Ricoh ScanSnap iX1600 scanner - The scanner auto-uploads the PDF to a Google Drive folder , sorted into a subfolder by category Finance, Health, Work, etc. - Every 5 minutes, rclone moves new files from Google Drive to the server Paperless-NGX detects the new file and starts processing- A workflow assigns the document type based on which subfolder it came from Google Document AI performs high-quality OCR, then Gemini 2.5 Flash generates a clean title, identifies the correspondent, assigns tags, and extracts the creation date- The document is filed on disk by person, correspondent, and year - The ASN barcode is detected automatically — a cron job tags it as "Physical Filed" and syncs the ASN mapping to a Google Sheet - The physical document goes into a numbered binder, matching the ASN Digital Documents - An invoice arrives by email - One of 24 mail rules detects it and consumes the attachment - The document is auto-tagged as "Digital Only" and goes through the same AI classification pipeline The entire process takes about 30 seconds of my time sticking the label and placing the paper on the scanner . Everything else is automated. Architecture Overview The system runs on a single VPS 4 vCPU, 8GB RAM, 80GB NVMe with 9 Docker containers: | Container | Role | Memory Limit | |---|---|---| | Paperless-NGX | Core document management | 2 GB | | Paperless-GPT | AI classification Gemini 2.5 Flash + Document AI | 256 MB | | PostgreSQL 16 | Database | 512 MB | | Redis | Caching and task queue | 256 MB | | Gotenberg | Document conversion | 512 MB | | Apache Tika | Text extraction from Office formats | 512 MB | | Cloudflare Tunnel | Secure HTTPS access zero open ports | 128 MB | | Portainer | Container management UI | 256 MB | | Watchtower | Automatic container updates | 256 MB | Total memory footprint: under 5 GB, leaving headroom on the 8 GB VPS. Document Processing Flow Step-by-Step Build Guide Step 1: Server Setup Start with a fresh Ubuntu VPS. This script installs Docker, configures the firewall, sets up fail2ban, installs rclone for Google Drive sync, and creates the directory structure: bash /bin/bash set -e Update system sudo apt update && sudo apt upgrade -y Install Docker curl -fsSL https://get.docker.com | sudo sh sudo usermod -aG docker $USER Firewall - only SSH all web traffic goes through Cloudflare Tunnel sudo ufw allow OpenSSH sudo ufw --force enable Brute-force protection sudo apt install -y fail2ban sudo systemctl enable fail2ban && sudo systemctl start fail2ban Directory structure mkdir -p ~/paperless/{data,media,export,consume,redis,db,prompts,backups,scripts} mkdir -p ~/paperless/consume/{Arbeit,Dokumente,Fahrzeuge,Finanzen,Gesundheit,Wohnen,Sonstiges} Install rclone for Google Drive sync curl https://rclone.org/install.sh | sudo bash Backup cron daily at 2 AM crontab -l 2 /dev/null; echo '0 2 ~/paperless/backup.sh ~/paperless/backup.log 2 &1' | crontab - Step 2: Docker Compose Configuration This is the complete docker-compose.yml . Replace the placeholder values YOUR with your own credentials: services: broker: image: redis:7-alpine container name: paperless-redis read only: true healthcheck: test: "CMD-SHELL", "redis-cli ping || exit 1" interval: 30s timeout: 10s retries: 3 security opt: - no-new-privileges:true environment: REDIS ARGS: "--save 60 10" restart: unless-stopped volumes: - ~/paperless/redis:/data networks: - paperless-net deploy: resources: limits: memory: 256M gotenberg: image: gotenberg/gotenberg:8.7 container name: paperless-gotenberg restart: unless-stopped security opt: - no-new-privileges:true command: - "gotenberg" - "--chromium-disable-javascript=true" - "--chromium-allow-list=. " networks: - paperless-net deploy: resources: limits: memory: 512M db: image: postgres:16-alpine container name: paperless-db restart: unless-stopped healthcheck: test: "CMD", "pg isready", "-q", "-d", "paperless", "-U", "paperless" timeout: 45s interval: 10s retries: 10 security opt: - no-new-privileges:true volumes: - ~/paperless/db:/var/lib/postgresql/data environment: POSTGRES DB: paperless POSTGRES USER: paperless POSTGRES PASSWORD: YOUR DB PASSWORD networks: - paperless-net deploy: resources: limits: memory: 512M paperless: image: ghcr.io/paperless-ngx/paperless-ngx:latest container name: paperless-ngx healthcheck: test: "CMD", "curl", "-fs", "-S", "--max-time", "2", "http://localhost:8000" interval: 30s timeout: 10s retries: 5 security opt: - no-new-privileges:true restart: unless-stopped depends on: db: condition: service healthy broker: condition: service healthy gotenberg: condition: service started ports: - "127.0.0.1:8001:8000" volumes: - ~/paperless/data:/usr/src/paperless/data - ~/paperless/media:/usr/src/paperless/media - ~/paperless/export:/usr/src/paperless/export - ~/paperless/consume:/usr/src/paperless/consume - ~/paperless/scripts:/usr/src/paperless/scripts environment: PAPERLESS REDIS: redis://broker:6379 PAPERLESS DBHOST: db PAPERLESS TIKA ENABLED: 1 PAPERLESS TIKA GOTENBERG ENDPOINT: http://gotenberg:3000 PAPERLESS TIKA ENDPOINT: http://tika:9998 PAPERLESS TIME ZONE: Europe/Berlin PAPERLESS SECRET KEY: YOUR RANDOM SECRET KEY PAPERLESS ADMIN USER: admin PAPERLESS ADMIN PASSWORD: YOUR ADMIN PASSWORD PAPERLESS URL: "https://your-domain.com" OCR Settings PAPERLESS OCR LANGUAGE: "deu+eng" PAPERLESS OCR LANGUAGES: "tur aze" PAPERLESS OCR MODE: "skip" PAPERLESS OCR DPI: 300 PAPERLESS OCR SKIP ARCHIVE FILE: with text PAPERLESS OCR CLEAN: clean PAPERLESS OCR CACHING: true PAPERLESS OCR USER ARGS: '{"invalidate digital signatures": true}' Consumer Settings PAPERLESS CONSUMER POLLING: 10 PAPERLESS CONSUMER RECURSIVE: "true" PAPERLESS CONSUMER SUBDIRS AS TAGS: "false" PAPERLESS CONSUMER RETRY COUNT: 3 PAPERLESS CONSUMER DELETE DUPLICATES: true File Naming PAPERLESS FILENAME FORMAT: "{{{ correspondent|slugify }}}/{{{ created year }}}/{{{ created }} {{{ title|slugify }}}" PAPERLESS FILENAME FORMAT REMOVE NONE: "true" PAPERLESS FILENAME DATE ORDER: "YMD" Workers PAPERLESS TASK WORKERS: 2 PAPERLESS THREADS PER WORKER: 2 ASN Barcode Detection PAPERLESS CONSUMER ENABLE BARCODES: "true" PAPERLESS CONSUMER ENABLE ASN BARCODE: "true" PAPERLESS CONSUMER ASN BARCODE PREFIX: "ASN" PAPERLESS CONSUMER BARCODE SCANNER: "ZXING" Email for notifications PAPERLESS EMAIL HOST: "YOUR SMTP HOST" PAPERLESS EMAIL PORT: 465 PAPERLESS EMAIL HOST USER: "YOUR EMAIL" PAPERLESS EMAIL HOST PASSWORD: "YOUR EMAIL APP PASSWORD" PAPERLESS EMAIL USE SSL: "true" PAPERLESS EMAIL FROM: "YOUR EMAIL" PAPERLESS EMAIL TASK CRON: " /5 " Proxy settings for Cloudflare Tunnel PAPERLESS USE X FORWARD HOST: "true" PAPERLESS USE X FORWARD PORT: "true" PAPERLESS PROXY SSL HEADER: ' "HTTP X FORWARDED PROTO", "https" ' PAPERLESS ALLOWED HOSTS: "localhost,paperless,your-domain.com" PAPERLESS CORS ALLOWED HOSTS: "https://your-domain.com" PAPERLESS CSRF TRUSTED ORIGINS: "https://your-domain.com" PAPERLESS DEBUG: false networks: - paperless-net deploy: resources: limits: memory: 2G tika: image: apache/tika:latest container name: paperless-tika restart: unless-stopped security opt: - no-new-privileges:true networks: - paperless-net deploy: resources: limits: memory: 512M paperless-gpt: image: icereed/paperless-gpt:latest container name: paperless-gpt environment: PAPERLESS BASE URL: "http://paperless:8000" PAPERLESS API TOKEN: "YOUR PAPERLESS API TOKEN" PAPERLESS PUBLIC URL: "https://your-domain.com" MANUAL TAG: "paperless-gpt" AUTO TAG: "paperless-gpt-auto" LLM PROVIDER: "googleai" GOOGLEAI API KEY: "YOUR GOOGLE AI API KEY" LLM MODEL: "gemini-2.5-flash" TOKEN LIMIT: 0 OCR PROCESS MODE: "whole pdf" OCR PROVIDER: "google docai" GOOGLE PROJECT ID: "YOUR GCP PROJECT" GOOGLE LOCATION: "eu" GOOGLE PROCESSOR ID: "YOUR PROCESSOR ID" GOOGLE APPLICATION CREDENTIALS: "/app/credentials.json" AUTO OCR TAG: "paperless-gpt-ocr-auto" OCR LIMIT PAGES: "5" LOG LEVEL: "info" volumes: - ~/paperless/prompts:/app/prompts - ~/paperless/google-ai.json:/app/credentials.json ports: - "127.0.0.1:8080:8080" depends on: - paperless networks: - paperless-net deploy: resources: limits: memory: 256M cloudflared: image: cloudflare/cloudflared:latest container name: paperless-cloudflared command: tunnel --no-autoupdate run --token YOUR TUNNEL TOKEN restart: unless-stopped security opt: - no-new-privileges:true networks: - paperless-net deploy: resources: limits: memory: 128M networks: paperless-net: driver: bridge Step 3: AI Classification with Paperless-GPT Paperless-GPT https://github.com/icereed/paperless-gpt?ref=turalali.com is the brain of the system. It uses Google Gemini 2.5 Flash for classification and Google Document AI for OCR. Every new document gets: - A clean, descriptive title extracted from the content - The correspondent identified stripped of legal suffixes like GmbH or AG - A document type from your predefined categories Tags selected from your curated list- The creation date parsed from the document, not the scan date Custom field extraction expiration dates, amounts, etc. A note on privacy: Since Google Document AI and Gemini process document content in the cloud, I only send non-sensitive documents through the automated pipeline. Sensitive documents — things like passport copies, tax returns with personal IDs, or medical records with detailed diagnoses — are classified and titled manually in the Paperless UI. The OCR for those still runs locally via Tesseract Paperless-NGX's built-in OCR engine , so the content never leaves the server. This is a conscious trade-off: the AI pipeline saves hours of work on the 90% of documents that aren't sensitive, while the 10% that are get handled with extra care. The prompts are fully customizable Go templates stored in /prompts/ . Here's the correspondent prompt as an example — it tells the AI to avoid legal suffixes and provides the existing correspondent list as context: I will provide you with the content of a document. Your task is to suggest a correspondent that is most relevant to the document. Try to avoid any legal or financial suffixes like "GmbH" or "AG" in the correspondent name. For example use "Microsoft" instead of "Microsoft Ireland Operations Limited". If you can't find a suitable correspondent, respond with "Unknown". Example Correspondents: {{.AvailableCorrespondents | join ", "}} The content is likely in {{.Language}}. Document Content: {{.Content}} Step 4: Consume Subfolders and Workflows The ScanSnap saves files to Google Drive subfolders based on document category. Paperless has 7 consume subfolders, each mapped to a document type via workflows: | Subfolder | Document Type | Workflow Filter | |---|---|---| | Arbeit/ | Arbeit und Beruf | /Arbeit/ | | Dokumente/ | Wichtige Dokumente | /Dokumente/ | | Fahrzeuge/ | Fahrzeuge | /Fahrzeuge/ | | Finanzen/ | Finanzen und Steuern | /Finanzen/ | | Gesundheit/ | Gesundheit und Versicherungen | /Gesundheit/ | | Wohnen/ | Wohnen | /Wohnen/ | | Sonstiges/ | AI decides | — | Each workflow triggers on Consumption Started type 1 and uses filter path to match the subfolder. This ensures the document type is set before the AI runs — so even if the AI disagrees e.g., a health document in the Finance folder , the subfolder-based type sticks. The exception is Sonstiges Miscellaneous — documents scanned into this folder have no workflow, so Paperless-GPT classifies them freely based on content. Additional workflows automatically assign storage paths based on person tags, organizing files on disk as Person/Correspondent/Year/Date Title . Step 5: Google Drive Sync The ScanSnap uploads to Google Drive. A cron job syncs new files to the server every 5 minutes: /etc/cron.d/paperless-gdrive-consume /5 ubuntu rclone move "Gdrive:Paperless/Consume/" /home/ubuntu/paperless/consume/ \ --log-file=/tmp/paperless-gdrive-consume.log --log-level=INFO 2 /dev/null The rclone move command moves not copies files, so Google Drive acts as a temporary drop zone. Configure rclone with rclone config and set up Google Drive OAuth. Step 6: Physical Document Tracking ASN Barcodes Every physical document gets an Archive Serial Number ASN barcode label. I use Avery Zweckform L4731REV-25 removable labels 189 per sheet and their online designer https://www.avery-zweckform.com/vorlagen-software/design-drucken?ref=turalali.com to print ASN barcodes. Paperless automatically reads the barcode using ZXING when the document is scanned. A cron script runs every minute to: - Find all documents with an ASN that aren't tagged "Physical Filed" yet - Add the "Physical Filed" tag and remove the "Digital Only" tag - Sync all ASN-to-document mappings to a Google Sheet as a safety backup bash /bin/bash /home/ubuntu/paperless/scripts/asn-physical-filed.sh TOKEN="YOUR PAPERLESS API TOKEN" API="http://localhost:8001/api" Find docs with ASN but without "Physical Filed" tag DOCS=$ curl -s "${API}/documents/?archive serial number isnull=false&tags id none=PHYSICAL FILED TAG ID&page size=100" \ -H "Authorization: Token ${TOKEN}" IDS=$ echo "$DOCS" | python3 -c " import json,sys data = json.load sys.stdin for d in data.get 'results', : print d 'id' " 2 /dev/null for doc id in $IDS; do TAGS=$ curl -s "${API}/documents/${doc id}/" -H "Authorization: Token ${TOKEN}" | python3 -c " import json,sys d = json.load sys.stdin tags = d 'tags' if PHYSICAL FILED ID not in tags: tags.append PHYSICAL FILED ID if DIGITAL ONLY ID in tags: tags.remove DIGITAL ONLY ID print json.dumps tags " 2 /dev/null curl -s -X PATCH "${API}/documents/${doc id}/" \ -H "Authorization: Token ${TOKEN}" \ -H "Content-Type: application/json" \ -d "{\"tags\": ${TAGS}}" /dev/null 2 &1 done The Google Sheets sync script uses gspread and a Google service account to write all ASN mappings to a spreadsheet. This means even if I lose the entire Paperless server, I still have a record of which ASN corresponds to which document. Step 7: Backup Strategy 3-2-1 Rule Backups run automatically at 2:00 AM daily with three destinations and rotation: | Destination | Method | Retention | |---|---|---| | Google Drive | rclone upload | 7 daily / 4 weekly / 3 monthly | | GitHub private repo | AES-256-CBC encrypted | 7 daily / 4 weekly / 3 monthly | | Local | On-disk copy | Same rotation | The backup script exports all documents from Paperless, uploads to Google Drive, encrypts and pushes to GitHub, rotates old backups, and verifies integrity. Healthchecks.io notifies me if anything fails. bash /bin/bash backup.sh - Daily Paperless backup with rotation and monitoring set -e DATE=$ date +%Y-%m-%d DAY OF WEEK=$ date +%u DAY OF MONTH=$ date +%d PAPERLESS DIR="/home/ubuntu/paperless" BACKUP DIR="$PAPERLESS DIR/backups" CONFIG REPO="/home/ubuntu/paperless-config" GDRIVE DIR="Backups/Paperless" Load .env contains HEALTHCHECK URL and ENCRYPTION PASSPHRASE source "$PAPERLESS DIR/.env" Healthchecks.io integration healthcheck start { curl -fsS -m 10 --retry 5 "${HEALTHCHECK URL}/start" /dev/null 2 &1 || true; } healthcheck success { curl -fsS -m 10 --retry 5 "$HEALTHCHECK URL" /dev/null 2 &1 || true; } healthcheck fail { curl -fsS -m 10 --retry 5 "${HEALTHCHECK URL}/fail" /dev/null 2 &1 || true; } healthcheck start Export from Paperless -sm = split manifest for faster imports cd "$PAPERLESS DIR" docker compose exec -T paperless document exporter ../export --zip -sm EXPORT FILE=$ ls -t "$PAPERLESS DIR/export/export-" .zip 2 /dev/null | head -1 -z "$EXPORT FILE" && { healthcheck fail; exit 1; } Determine backup type if "$DAY OF MONTH" == "01" ; then BACKUP TYPE="monthly" elif "$DAY OF WEEK" == "7" ; then BACKUP TYPE="weekly" else BACKUP TYPE="daily" fi BACKUP NAME="paperless-${BACKUP TYPE}-${DATE}.zip" Copy locally and upload to Google Drive cp "$EXPORT FILE" "$BACKUP DIR/$BACKUP NAME" rclone copy "$BACKUP DIR/$BACKUP NAME" "Gdrive:$GDRIVE DIR/$BACKUP TYPE/" Rotate local and remote backups find "$BACKUP DIR" -name "paperless-daily- .zip" -mtime +7 -delete find "$BACKUP DIR" -name "paperless-weekly- .zip" -mtime +28 -delete find "$BACKUP DIR" -name "paperless-monthly- .zip" -mtime +90 -delete rclone delete "Gdrive:$GDRIVE DIR/daily/" --min-age 7d 2 /dev/null || true rclone delete "Gdrive:$GDRIVE DIR/weekly/" --min-age 28d 2 /dev/null || true rclone delete "Gdrive:$GDRIVE DIR/monthly/" --min-age 90d 2 /dev/null || true Verify integrity unzip -t "$BACKUP DIR/$BACKUP NAME" /dev/null 2 &1 || { healthcheck fail; exit 1; } Encrypted backup to GitHub if -n "$ENCRYPTION PASSPHRASE" && -d "$CONFIG REPO" ; then mkdir -p "$CONFIG REPO/encrypted-backups/$BACKUP TYPE" openssl enc -aes-256-cbc -salt -pbkdf2 \ -in "$BACKUP DIR/$BACKUP NAME" \ -out "$CONFIG REPO/encrypted-backups/$BACKUP TYPE/paperless-${BACKUP TYPE}-${DATE}.zip.enc" \ -pass pass:"$ENCRYPTION PASSPHRASE" Rotate encrypted backups find "$CONFIG REPO/encrypted-backups/daily" -name " .enc" -mtime +7 -delete 2 /dev/null || true find "$CONFIG REPO/encrypted-backups/weekly" -name " .enc" -mtime +28 -delete 2 /dev/null || true find "$CONFIG REPO/encrypted-backups/monthly" -name " .enc" -mtime +90 -delete 2 /dev/null || true cd "$CONFIG REPO" && git add -A git diff --staged --quiet || git commit -m "Backup $DATE - $BACKUP TYPE encrypted " && git push origin main fi healthcheck success Step 8: Security Hardening There are two approaches to making Paperless accessible remotely. I use both — pick whichever fits your threat model. Option A: Tailscale VPN Private Access Only Tailscale https://tailscale.com/?ref=turalali.com is a zero-config WireGuard mesh VPN. Install it on your server and your devices, and Paperless becomes accessible only to machines on your private network — invisible to the rest of the internet. Install Tailscale curl -fsSL https://tailscale.com/install.sh | sh sudo tailscale up --ssh Expose Paperless via Tailscale HTTPS only accessible from your tailnet tailscale serve --bg --https=443 http://localhost:8001 Lock SSH to Tailscale only sudo ufw allow in on tailscale0 to any port 22 sudo ufw delete allow 22 The result: Paperless is available at https://