{"slug": "i-built-an-automated-document-management-system-with-paperless-ngx", "title": "I Built an Automated Document Management System with Paperless-NGX", "summary": "A software developer built a fully automated document management system using Paperless-NGX that processes over 900 documents for a family of four across multiple languages. The system ingests physical papers via a ScanSnap scanner and digital documents from email, then uses Google Document AI and Gemini 2.5 Flash for OCR, classification, and tagging. The setup runs on a single VPS with nine Docker containers and requires only 30 seconds of daily manual effort.", "body_md": "# How I Built a Fully Automated Document Management System with Paperless-NGX\n\nHow I built a zero-effort document management system for a family of four - 900+ documents across multiple languages with AI classification, barcode tracking, encrypted backups, and a scanner-to-archive pipeline that runs on autopilot.\n\nManaging paper and digital documents for a family of four across multiple countries and languages was becoming unmanageable. Tax documents in German, medical records in English, contracts scattered across three email accounts, and a growing pile of physical papers that I could never find when I needed them.\n\nI spent weeks building a fully automated document management system that now handles 900+ documents with zero daily effort. This post walks you through the entire setup — from the scanner on my desk to the encrypted backups on GitHub — with every configuration file and script you need to build it yourself.\n\n## The Problem\n\nOur household generates a surprising amount of paperwork. Insurance letters, tax assessments, medical bills, employment contracts, kindergarten registrations, vehicle documents — all in different languages, arriving through different channels, belonging to different family members.\n\nBefore this setup, my \"system\" was:\n\n- A physical folder that was always full and never organized\n- Email attachments scattered across three accounts\n- A phone full of photos of documents \"I'll file later\"\n- No way to search for anything — I had to remember where I put it\n\nI needed a system that could:\n\n- Ingest documents from email, scanner, and manual upload automatically\n- Classify documents by type, person, and topic using AI\n- Track which physical documents I have and where they are\n- Be searchable and accessible from anywhere\n- Back itself up to multiple locations without my intervention\n\n## My Workflow: From Paper to Searchable Archive\n\nHere's the daily workflow that now runs on autopilot:\n\n### Physical Documents\n\n- A letter arrives in the mail\n- I print an\n**ASN barcode label** using**Avery Zweckform L4731REV-25** labels and their online designer, then stick it on the document - I place the document on my\n**Ricoh ScanSnap iX1600** scanner - The scanner auto-uploads the PDF to a\n**Google Drive folder**, sorted into a subfolder by category (Finance, Health, Work, etc.) - Every 5 minutes,\n**rclone** moves new files from Google Drive to the server **Paperless-NGX** detects the new file and starts processing- A\n**workflow** assigns the document type based on which subfolder it came from **Google Document AI** performs high-quality OCR, then**Gemini 2.5 Flash** generates a clean title, identifies the correspondent, assigns tags, and extracts the creation date- The document is filed on disk by person, correspondent, and year\n- The ASN barcode is detected automatically — a cron job tags it as \"Physical Filed\" and syncs the ASN mapping to a Google Sheet\n- The physical document goes into a numbered binder, matching the ASN\n\n### Digital Documents\n\n- An invoice arrives by email\n- One of 24 mail rules detects it and consumes the attachment\n- The document is auto-tagged as \"Digital Only\" and goes through the same AI classification pipeline\n\nThe entire process takes about 30 seconds of my time (sticking the label and placing the paper on the scanner). Everything else is automated.\n\n## Architecture Overview\n\nThe system runs on a single VPS (4 vCPU, 8GB RAM, 80GB NVMe) with 9 Docker containers:\n\n| Container | Role | Memory Limit |\n|---|---|---|\n| Paperless-NGX | Core document management | 2 GB |\n| Paperless-GPT | AI classification (Gemini 2.5 Flash + Document AI) | 256 MB |\n| PostgreSQL 16 | Database | 512 MB |\n| Redis | Caching and task queue | 256 MB |\n| Gotenberg | Document conversion | 512 MB |\n| Apache Tika | Text extraction from Office formats | 512 MB |\n| Cloudflare Tunnel | Secure HTTPS access (zero open ports) | 128 MB |\n| Portainer | Container management UI | 256 MB |\n| Watchtower | Automatic container updates | 256 MB |\n\nTotal memory footprint: under 5 GB, leaving headroom on the 8 GB VPS.\n\n## Document Processing Flow\n\n## Step-by-Step Build Guide\n\n### Step 1: Server Setup\n\nStart with a fresh Ubuntu VPS. This script installs Docker, configures the firewall, sets up fail2ban, installs rclone for Google Drive sync, and creates the directory structure:\n\n``` bash\n#!/bin/bash\nset -e\n\n# Update system\nsudo apt update && sudo apt upgrade -y\n\n# Install Docker\ncurl -fsSL https://get.docker.com | sudo sh\nsudo usermod -aG docker $USER\n\n# Firewall - only SSH (all web traffic goes through Cloudflare Tunnel)\nsudo ufw allow OpenSSH\nsudo ufw --force enable\n\n# Brute-force protection\nsudo apt install -y fail2ban\nsudo systemctl enable fail2ban && sudo systemctl start fail2ban\n\n# Directory structure\nmkdir -p ~/paperless/{data,media,export,consume,redis,db,prompts,backups,scripts}\nmkdir -p ~/paperless/consume/{Arbeit,Dokumente,Fahrzeuge,Finanzen,Gesundheit,Wohnen,Sonstiges}\n\n# Install rclone for Google Drive sync\ncurl https://rclone.org/install.sh | sudo bash\n\n# Backup cron (daily at 2 AM)\n(crontab -l 2>/dev/null; echo '0 2 * * * ~/paperless/backup.sh >> ~/paperless/backup.log 2>&1') | crontab -\n```\n\n### Step 2: Docker Compose Configuration\n\nThis is the complete `docker-compose.yml`\n\n. Replace the placeholder values (`YOUR_*`\n\n) with your own credentials:\n\n```\nservices:\n  broker:\n    image: redis:7-alpine\n    container_name: paperless-redis\n    read_only: true\n    healthcheck:\n      test: [\"CMD-SHELL\", \"redis-cli ping || exit 1\"]\n      interval: 30s\n      timeout: 10s\n      retries: 3\n    security_opt:\n      - no-new-privileges:true\n    environment:\n      REDIS_ARGS: \"--save 60 10\"\n    restart: unless-stopped\n    volumes:\n      - ~/paperless/redis:/data\n    networks:\n      - paperless-net\n    deploy:\n      resources:\n        limits:\n          memory: 256M\n\n  gotenberg:\n    image: gotenberg/gotenberg:8.7\n    container_name: paperless-gotenberg\n    restart: unless-stopped\n    security_opt:\n      - no-new-privileges:true\n    command:\n      - \"gotenberg\"\n      - \"--chromium-disable-javascript=true\"\n      - \"--chromium-allow-list=.*\"\n    networks:\n      - paperless-net\n    deploy:\n      resources:\n        limits:\n          memory: 512M\n\n  db:\n    image: postgres:16-alpine\n    container_name: paperless-db\n    restart: unless-stopped\n    healthcheck:\n      test: [\"CMD\", \"pg_isready\", \"-q\", \"-d\", \"paperless\", \"-U\", \"paperless\"]\n      timeout: 45s\n      interval: 10s\n      retries: 10\n    security_opt:\n      - no-new-privileges:true\n    volumes:\n      - ~/paperless/db:/var/lib/postgresql/data\n    environment:\n      POSTGRES_DB: paperless\n      POSTGRES_USER: paperless\n      POSTGRES_PASSWORD: YOUR_DB_PASSWORD\n    networks:\n      - paperless-net\n    deploy:\n      resources:\n        limits:\n          memory: 512M\n\n  paperless:\n    image: ghcr.io/paperless-ngx/paperless-ngx:latest\n    container_name: paperless-ngx\n    healthcheck:\n      test: [\"CMD\", \"curl\", \"-fs\", \"-S\", \"--max-time\", \"2\", \"http://localhost:8000\"]\n      interval: 30s\n      timeout: 10s\n      retries: 5\n    security_opt:\n      - no-new-privileges:true\n    restart: unless-stopped\n    depends_on:\n      db:\n        condition: service_healthy\n      broker:\n        condition: service_healthy\n      gotenberg:\n        condition: service_started\n    ports:\n      - \"127.0.0.1:8001:8000\"\n    volumes:\n      - ~/paperless/data:/usr/src/paperless/data\n      - ~/paperless/media:/usr/src/paperless/media\n      - ~/paperless/export:/usr/src/paperless/export\n      - ~/paperless/consume:/usr/src/paperless/consume\n      - ~/paperless/scripts:/usr/src/paperless/scripts\n    environment:\n      PAPERLESS_REDIS: redis://broker:6379\n      PAPERLESS_DBHOST: db\n      PAPERLESS_TIKA_ENABLED: 1\n      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000\n      PAPERLESS_TIKA_ENDPOINT: http://tika:9998\n      PAPERLESS_TIME_ZONE: Europe/Berlin\n      PAPERLESS_SECRET_KEY: YOUR_RANDOM_SECRET_KEY\n      PAPERLESS_ADMIN_USER: admin\n      PAPERLESS_ADMIN_PASSWORD: YOUR_ADMIN_PASSWORD\n      PAPERLESS_URL: \"https://your-domain.com\"\n      # OCR Settings\n      PAPERLESS_OCR_LANGUAGE: \"deu+eng\"\n      PAPERLESS_OCR_LANGUAGES: \"tur aze\"\n      PAPERLESS_OCR_MODE: \"skip\"\n      PAPERLESS_OCR_DPI: 300\n      PAPERLESS_OCR_SKIP_ARCHIVE_FILE: with_text\n      PAPERLESS_OCR_CLEAN: clean\n      PAPERLESS_OCR_CACHING: true\n      PAPERLESS_OCR_USER_ARGS: '{\"invalidate_digital_signatures\": true}'\n      # Consumer Settings\n      PAPERLESS_CONSUMER_POLLING: 10\n      PAPERLESS_CONSUMER_RECURSIVE: \"true\"\n      PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS: \"false\"\n      PAPERLESS_CONSUMER_RETRY_COUNT: 3\n      PAPERLESS_CONSUMER_DELETE_DUPLICATES: true\n      # File Naming\n      PAPERLESS_FILENAME_FORMAT: \"{{{ correspondent|slugify }}}/{{{ created_year }}}/{{{ created }} {{{ title|slugify }}}\"\n      PAPERLESS_FILENAME_FORMAT_REMOVE_NONE: \"true\"\n      PAPERLESS_FILENAME_DATE_ORDER: \"YMD\"\n      # Workers\n      PAPERLESS_TASK_WORKERS: 2\n      PAPERLESS_THREADS_PER_WORKER: 2\n      # ASN Barcode Detection\n      PAPERLESS_CONSUMER_ENABLE_BARCODES: \"true\"\n      PAPERLESS_CONSUMER_ENABLE_ASN_BARCODE: \"true\"\n      PAPERLESS_CONSUMER_ASN_BARCODE_PREFIX: \"ASN\"\n      PAPERLESS_CONSUMER_BARCODE_SCANNER: \"ZXING\"\n      # Email (for notifications)\n      PAPERLESS_EMAIL_HOST: \"YOUR_SMTP_HOST\"\n      PAPERLESS_EMAIL_PORT: 465\n      PAPERLESS_EMAIL_HOST_USER: \"YOUR_EMAIL\"\n      PAPERLESS_EMAIL_HOST_PASSWORD: \"YOUR_EMAIL_APP_PASSWORD\"\n      PAPERLESS_EMAIL_USE_SSL: \"true\"\n      PAPERLESS_EMAIL_FROM: \"YOUR_EMAIL\"\n      PAPERLESS_EMAIL_TASK_CRON: \"*/5 * * * *\"\n      # Proxy settings (for Cloudflare Tunnel)\n      PAPERLESS_USE_X_FORWARD_HOST: \"true\"\n      PAPERLESS_USE_X_FORWARD_PORT: \"true\"\n      PAPERLESS_PROXY_SSL_HEADER: '[\"HTTP_X_FORWARDED_PROTO\", \"https\"]'\n      PAPERLESS_ALLOWED_HOSTS: \"localhost,paperless,your-domain.com\"\n      PAPERLESS_CORS_ALLOWED_HOSTS: \"https://your-domain.com\"\n      PAPERLESS_CSRF_TRUSTED_ORIGINS: \"https://your-domain.com\"\n      PAPERLESS_DEBUG: false\n    networks:\n      - paperless-net\n    deploy:\n      resources:\n        limits:\n          memory: 2G\n\n  tika:\n    image: apache/tika:latest\n    container_name: paperless-tika\n    restart: unless-stopped\n    security_opt:\n      - no-new-privileges:true\n    networks:\n      - paperless-net\n    deploy:\n      resources:\n        limits:\n          memory: 512M\n\n  paperless-gpt:\n    image: icereed/paperless-gpt:latest\n    container_name: paperless-gpt\n    environment:\n      PAPERLESS_BASE_URL: \"http://paperless:8000\"\n      PAPERLESS_API_TOKEN: \"YOUR_PAPERLESS_API_TOKEN\"\n      PAPERLESS_PUBLIC_URL: \"https://your-domain.com\"\n      MANUAL_TAG: \"paperless-gpt\"\n      AUTO_TAG: \"paperless-gpt-auto\"\n      LLM_PROVIDER: \"googleai\"\n      GOOGLEAI_API_KEY: \"YOUR_GOOGLE_AI_API_KEY\"\n      LLM_MODEL: \"gemini-2.5-flash\"\n      TOKEN_LIMIT: 0\n      OCR_PROCESS_MODE: \"whole_pdf\"\n      OCR_PROVIDER: \"google_docai\"\n      GOOGLE_PROJECT_ID: \"YOUR_GCP_PROJECT\"\n      GOOGLE_LOCATION: \"eu\"\n      GOOGLE_PROCESSOR_ID: \"YOUR_PROCESSOR_ID\"\n      GOOGLE_APPLICATION_CREDENTIALS: \"/app/credentials.json\"\n      AUTO_OCR_TAG: \"paperless-gpt-ocr-auto\"\n      OCR_LIMIT_PAGES: \"5\"\n      LOG_LEVEL: \"info\"\n    volumes:\n      - ~/paperless/prompts:/app/prompts\n      - ~/paperless/google-ai.json:/app/credentials.json\n    ports:\n      - \"127.0.0.1:8080:8080\"\n    depends_on:\n      - paperless\n    networks:\n      - paperless-net\n    deploy:\n      resources:\n        limits:\n          memory: 256M\n\n  cloudflared:\n    image: cloudflare/cloudflared:latest\n    container_name: paperless-cloudflared\n    command: tunnel --no-autoupdate run --token YOUR_TUNNEL_TOKEN\n    restart: unless-stopped\n    security_opt:\n      - no-new-privileges:true\n    networks:\n      - paperless-net\n    deploy:\n      resources:\n        limits:\n          memory: 128M\n\nnetworks:\n  paperless-net:\n    driver: bridge\n```\n\n### Step 3: AI Classification with Paperless-GPT\n\n[Paperless-GPT](https://github.com/icereed/paperless-gpt?ref=turalali.com) is the brain of the system. It uses **Google Gemini 2.5 Flash** for classification and **Google Document AI** for OCR. Every new document gets:\n\n- A clean, descriptive\n**title** extracted from the content - The\n**correspondent** identified (stripped of legal suffixes like GmbH or AG) - A\n**document type** from your predefined categories **Tags** selected from your curated list- The\n**creation date** parsed from the document, not the scan date **Custom field** extraction (expiration dates, amounts, etc.)\n\n**A note on privacy:** Since Google Document AI and Gemini process document content in the cloud, I only send non-sensitive documents through the automated pipeline. Sensitive documents — things like passport copies, tax returns with personal IDs, or medical records with detailed diagnoses — are classified and titled manually in the Paperless UI. The OCR for those still runs locally via Tesseract (Paperless-NGX's built-in OCR engine), so the content never leaves the server. This is a conscious trade-off: the AI pipeline saves hours of work on the 90% of documents that aren't sensitive, while the 10% that are get handled with extra care.\n\nThe prompts are fully customizable Go templates stored in `/prompts/`\n\n. Here's the correspondent prompt as an example — it tells the AI to avoid legal suffixes and provides the existing correspondent list as context:\n\n```\nI will provide you with the content of a document.\nYour task is to suggest a correspondent that is most relevant to the document.\n\nTry to avoid any legal or financial suffixes like \"GmbH\" or \"AG\" in the\ncorrespondent name. For example use \"Microsoft\" instead of\n\"Microsoft Ireland Operations Limited\".\n\nIf you can't find a suitable correspondent, respond with \"Unknown\".\n\nExample Correspondents:\n{{.AvailableCorrespondents | join \", \"}}\n\nThe content is likely in {{.Language}}.\n\nDocument Content:\n{{.Content}}\n```\n\n### Step 4: Consume Subfolders and Workflows\n\nThe ScanSnap saves files to Google Drive subfolders based on document category. Paperless has 7 consume subfolders, each mapped to a document type via workflows:\n\n| Subfolder | Document Type | Workflow Filter |\n|---|---|---|\n| Arbeit/ | Arbeit und Beruf | `**/Arbeit/**` |\n| Dokumente/ | Wichtige Dokumente | `**/Dokumente/**` |\n| Fahrzeuge/ | Fahrzeuge | `**/Fahrzeuge/**` |\n| Finanzen/ | Finanzen und Steuern | `**/Finanzen/**` |\n| Gesundheit/ | Gesundheit und Versicherungen | `**/Gesundheit/**` |\n| Wohnen/ | Wohnen | `**/Wohnen/**` |\n| Sonstiges/ | AI decides | — |\n\nEach workflow triggers on **Consumption Started** (type 1) and uses `filter_path`\n\nto match the subfolder. This ensures the document type is set *before* the AI runs — so even if the AI disagrees (e.g., a health document in the Finance folder), the subfolder-based type sticks. The exception is **Sonstiges** (Miscellaneous) — documents scanned into this folder have no workflow, so Paperless-GPT classifies them freely based on content.\n\nAdditional workflows automatically assign storage paths based on person tags, organizing files on disk as `Person/Correspondent/Year/Date Title`\n\n.\n\n### Step 5: Google Drive Sync\n\nThe ScanSnap uploads to Google Drive. A cron job syncs new files to the server every 5 minutes:\n\n```\n# /etc/cron.d/paperless-gdrive-consume\n*/5 * * * * ubuntu rclone move \"Gdrive:Paperless/Consume/\" /home/ubuntu/paperless/consume/ \\\n  --log-file=/tmp/paperless-gdrive-consume.log --log-level=INFO 2>/dev/null\n```\n\nThe `rclone move`\n\ncommand moves (not copies) files, so Google Drive acts as a temporary drop zone. Configure rclone with `rclone config`\n\nand set up Google Drive OAuth.\n\n### Step 6: Physical Document Tracking (ASN Barcodes)\n\nEvery physical document gets an **Archive Serial Number (ASN)** barcode label. I use **Avery Zweckform L4731REV-25** removable labels (189 per sheet) and their [online designer](https://www.avery-zweckform.com/vorlagen-software/design-drucken?ref=turalali.com) to print ASN barcodes. Paperless automatically reads the barcode using ZXING when the document is scanned.\n\nA cron script runs every minute to:\n\n- Find all documents with an ASN that aren't tagged \"Physical Filed\" yet\n- Add the \"Physical Filed\" tag and remove the \"Digital Only\" tag\n- Sync all ASN-to-document mappings to a Google Sheet as a safety backup\n\n``` bash\n#!/bin/bash\n# /home/ubuntu/paperless/scripts/asn-physical-filed.sh\nTOKEN=\"YOUR_PAPERLESS_API_TOKEN\"\nAPI=\"http://localhost:8001/api\"\n\n# Find docs with ASN but without \"Physical Filed\" tag\nDOCS=$(curl -s \"${API}/documents/?archive_serial_number__isnull=false&tags__id__none=PHYSICAL_FILED_TAG_ID&page_size=100\" \\\n  -H \"Authorization: Token ${TOKEN}\")\n\nIDS=$(echo \"$DOCS\" | python3 -c \"\nimport json,sys\ndata = json.load(sys.stdin)\nfor d in data.get('results', []):\n    print(d['id'])\n\" 2>/dev/null)\n\nfor doc_id in $IDS; do\n  TAGS=$(curl -s \"${API}/documents/${doc_id}/\" -H \"Authorization: Token ${TOKEN}\" | python3 -c \"\nimport json,sys\nd = json.load(sys.stdin)\ntags = d['tags']\nif PHYSICAL_FILED_ID not in tags:\n    tags.append(PHYSICAL_FILED_ID)\nif DIGITAL_ONLY_ID in tags:\n    tags.remove(DIGITAL_ONLY_ID)\nprint(json.dumps(tags))\n\" 2>/dev/null)\n  curl -s -X PATCH \"${API}/documents/${doc_id}/\" \\\n    -H \"Authorization: Token ${TOKEN}\" \\\n    -H \"Content-Type: application/json\" \\\n    -d \"{\\\"tags\\\": ${TAGS}}\" > /dev/null 2>&1\ndone\n```\n\nThe Google Sheets sync script uses `gspread`\n\nand a Google service account to write all ASN mappings to a spreadsheet. This means even if I lose the entire Paperless server, I still have a record of which ASN corresponds to which document.\n\n### Step 7: Backup Strategy (3-2-1 Rule)\n\nBackups run automatically at 2:00 AM daily with three destinations and rotation:\n\n| Destination | Method | Retention |\n|---|---|---|\n| Google Drive | rclone upload | 7 daily / 4 weekly / 3 monthly |\n| GitHub (private repo) | AES-256-CBC encrypted | 7 daily / 4 weekly / 3 monthly |\n| Local | On-disk copy | Same rotation |\n\nThe backup script exports all documents from Paperless, uploads to Google Drive, encrypts and pushes to GitHub, rotates old backups, and verifies integrity. Healthchecks.io notifies me if anything fails.\n\n``` bash\n#!/bin/bash\n# backup.sh - Daily Paperless backup with rotation and monitoring\nset -e\n\nDATE=$(date +%Y-%m-%d)\nDAY_OF_WEEK=$(date +%u)\nDAY_OF_MONTH=$(date +%d)\n\nPAPERLESS_DIR=\"/home/ubuntu/paperless\"\nBACKUP_DIR=\"$PAPERLESS_DIR/backups\"\nCONFIG_REPO=\"/home/ubuntu/paperless-config\"\nGDRIVE_DIR=\"Backups/Paperless\"\n\n# Load .env (contains HEALTHCHECK_URL and ENCRYPTION_PASSPHRASE)\nsource \"$PAPERLESS_DIR/.env\"\n\n# Healthchecks.io integration\nhealthcheck_start() { curl -fsS -m 10 --retry 5 \"${HEALTHCHECK_URL}/start\" >/dev/null 2>&1 || true; }\nhealthcheck_success() { curl -fsS -m 10 --retry 5 \"$HEALTHCHECK_URL\" >/dev/null 2>&1 || true; }\nhealthcheck_fail() { curl -fsS -m 10 --retry 5 \"${HEALTHCHECK_URL}/fail\" >/dev/null 2>&1 || true; }\n\nhealthcheck_start\n\n# Export from Paperless (-sm = split manifest for faster imports)\ncd \"$PAPERLESS_DIR\"\ndocker compose exec -T paperless document_exporter ../export --zip -sm\n\nEXPORT_FILE=$(ls -t \"$PAPERLESS_DIR/export/export-\"*.zip 2>/dev/null | head -1)\n[ -z \"$EXPORT_FILE\" ] && { healthcheck_fail; exit 1; }\n\n# Determine backup type\nif [ \"$DAY_OF_MONTH\" == \"01\" ]; then\n    BACKUP_TYPE=\"monthly\"\nelif [ \"$DAY_OF_WEEK\" == \"7\" ]; then\n    BACKUP_TYPE=\"weekly\"\nelse\n    BACKUP_TYPE=\"daily\"\nfi\nBACKUP_NAME=\"paperless-${BACKUP_TYPE}-${DATE}.zip\"\n\n# Copy locally and upload to Google Drive\ncp \"$EXPORT_FILE\" \"$BACKUP_DIR/$BACKUP_NAME\"\nrclone copy \"$BACKUP_DIR/$BACKUP_NAME\" \"Gdrive:$GDRIVE_DIR/$BACKUP_TYPE/\"\n\n# Rotate local and remote backups\nfind \"$BACKUP_DIR\" -name \"paperless-daily-*.zip\" -mtime +7 -delete\nfind \"$BACKUP_DIR\" -name \"paperless-weekly-*.zip\" -mtime +28 -delete\nfind \"$BACKUP_DIR\" -name \"paperless-monthly-*.zip\" -mtime +90 -delete\nrclone delete \"Gdrive:$GDRIVE_DIR/daily/\" --min-age 7d 2>/dev/null || true\nrclone delete \"Gdrive:$GDRIVE_DIR/weekly/\" --min-age 28d 2>/dev/null || true\nrclone delete \"Gdrive:$GDRIVE_DIR/monthly/\" --min-age 90d 2>/dev/null || true\n\n# Verify integrity\nunzip -t \"$BACKUP_DIR/$BACKUP_NAME\" >/dev/null 2>&1 || { healthcheck_fail; exit 1; }\n\n# Encrypted backup to GitHub\nif [ -n \"$ENCRYPTION_PASSPHRASE\" ] && [ -d \"$CONFIG_REPO\" ]; then\n    mkdir -p \"$CONFIG_REPO/encrypted-backups/$BACKUP_TYPE\"\n    openssl enc -aes-256-cbc -salt -pbkdf2 \\\n        -in \"$BACKUP_DIR/$BACKUP_NAME\" \\\n        -out \"$CONFIG_REPO/encrypted-backups/$BACKUP_TYPE/paperless-${BACKUP_TYPE}-${DATE}.zip.enc\" \\\n        -pass pass:\"$ENCRYPTION_PASSPHRASE\"\n\n    # Rotate encrypted backups\n    find \"$CONFIG_REPO/encrypted-backups/daily\" -name \"*.enc\" -mtime +7 -delete 2>/dev/null || true\n    find \"$CONFIG_REPO/encrypted-backups/weekly\" -name \"*.enc\" -mtime +28 -delete 2>/dev/null || true\n    find \"$CONFIG_REPO/encrypted-backups/monthly\" -name \"*.enc\" -mtime +90 -delete 2>/dev/null || true\n\n    cd \"$CONFIG_REPO\" && git add -A\n    git diff --staged --quiet || git commit -m \"Backup $DATE - $BACKUP_TYPE (encrypted)\" && git push origin main\nfi\n\nhealthcheck_success\n```\n\n### Step 8: Security Hardening\n\nThere are two approaches to making Paperless accessible remotely. I use both — pick whichever fits your threat model.\n\n#### Option A: Tailscale VPN (Private Access Only)\n\n[Tailscale](https://tailscale.com/?ref=turalali.com) is a zero-config WireGuard mesh VPN. Install it on your server and your devices, and Paperless becomes accessible only to machines on your private network — invisible to the rest of the internet.\n\n```\n# Install Tailscale\ncurl -fsSL https://tailscale.com/install.sh | sh\nsudo tailscale up --ssh\n\n# Expose Paperless via Tailscale HTTPS (only accessible from your tailnet)\ntailscale serve --bg --https=443 http://localhost:8001\n\n# Lock SSH to Tailscale only\nsudo ufw allow in on tailscale0 to any port 22\nsudo ufw delete allow 22\n```\n\nThe result: Paperless is available at `https://<your-machine>.tail*.ts.net`\n\nwith a valid TLS certificate, accessible from any device on your tailnet — laptop, phone, tablet. Zero open ports, no certificate management, no dynamic DNS.\n\n**Pros:** Maximum security. The server has no public-facing ports at all. Even SSH is VPN-only.**Cons:** Every device needs Tailscale installed. Not suitable if you need public access.\n\n#### Option B: Cloudflare Tunnel (Internet-Facing)\n\nIf you need to access Paperless from any browser without installing a VPN client, [Cloudflare Tunnel](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/?ref=turalali.com) gives you a public URL with DDoS protection and no exposed ports.\n\n```\n# Install cloudflared\ncurl -fsSL https://pkg.cloudflare.com/cloudflare-main.gpg | sudo tee /usr/share/keyrings/cloudflare-archive-keyring.gpg\necho \"deb [signed-by=/usr/share/keyrings/cloudflare-archive-keyring.gpg] https://pkg.cloudflare.com/cloudflared $(lsb_release -cs) main\" | sudo tee /etc/apt/sources.list.d/cloudflared.list\nsudo apt update && sudo apt install cloudflared\n\n# Authenticate and create tunnel\ncloudflared tunnel login\ncloudflared tunnel create paperless\ncloudflared tunnel route dns paperless docs.yourdomain.com\n\n# Run the tunnel (point it at Paperless)\ncloudflared tunnel --url http://localhost:8000 run paperless\n```\n\n**Pros:** Accessible from any browser, no client software needed, free DDoS protection.**Cons:** Your application is exposed to the internet. Add [Cloudflare Access](https://developers.cloudflare.com/cloudflare-one/policies/access/?ref=turalali.com) or another auth layer (Authelia, Authentik) in front for additional protection.\n\n**My recommendation:** Use Tailscale for daily access and lock everything behind the VPN. If you need occasional public access for specific use cases, add a Cloudflare Tunnel with a Zero Trust Access policy on top.\n\nBaseline hardening for either option:\n\n- All containers run with\n`no-new-privileges: true`\n\n- Memory limits on every container to prevent runaway processes\n- Randomized 66-character\n`SECRET_KEY`\n\n- Redis in read-only mode\n- All ports bound to\n`127.0.0.1`\n\n(localhost only) **fail2ban** for SSH brute-force protection\n\n## The Numbers\n\nAfter running for several weeks:\n\n| Metric | Count |\n|---|---|\n| Documents managed | 900 |\n| Correspondents (auto-detected) | 254 |\n| Document types | 7 |\n| Tags | 90+ |\n| Mail rules | 24 (across 3 email accounts) |\n| Active workflows | 21 |\n| Saved views | 13 |\n| Storage paths (per person) | 4 |\n| Docker containers | 10 |\n| Total disk usage | ~1.3 GB |\n\n## What I Learned\n\n**Subfolder-based classification beats tag-based approaches.** I tried three different methods: using consume subdirectories as tags, tracking tags with enforce workflows, and simple path-based workflows. The simplest approach (subfolder → document type workflow) won. Scanning into a named folder is faster and more reliable than relying on AI alone.**AI needs guardrails for document types.** Gemini is excellent at generating titles and identifying correspondents, but it sometimes reclassifies documents based on content rather than intent. A health insurance bill about a work injury might get classified as \"Health\" when you filed it under \"Work.\" The subfolder workflow running before AI solves this.**Physical-digital bridging matters.** The ASN barcode system with Google Sheets backup means I never lose track of where a physical document is stored. Even if my server dies, I can look up any document by its ASN number.**Backup redundancy is worth the complexity.** Three backup destinations (Google Drive + encrypted GitHub + local) with automatic rotation means I can rebuild the entire system from scratch on a new server in under an hour.**Start with fewer tags.** I ended up with 90+ tags which is borderline too many. The AI tag suggestions work better with a focused, curated list. If I started over, I'd aim for 30-40 carefully chosen tags.\n\n*The entire system runs hands-off. Documents arrive by email or scanner, get classified by AI, filed by person, and backed up to three locations. The only manual step is sticking an ASN barcode label on physical documents before scanning — about 30 seconds per document.*", "url": "https://wpnews.pro/news/i-built-an-automated-document-management-system-with-paperless-ngx", "canonical_source": "https://turalali.com/how-i-built-a-fully-automated-document-management-system-with-paperless-ngx/", "published_at": "2026-05-29 10:44:06+00:00", "updated_at": "2026-05-29 10:45:58.748884+00:00", "lang": "en", "topics": ["ai-tools", "ai-products", "machine-learning", "computer-vision", "natural-language-processing"], "entities": ["Paperless-NGX", "GitHub"], "alternates": {"html": "https://wpnews.pro/news/i-built-an-automated-document-management-system-with-paperless-ngx", "markdown": "https://wpnews.pro/news/i-built-an-automated-document-management-system-with-paperless-ngx.md", "text": "https://wpnews.pro/news/i-built-an-automated-document-management-system-with-paperless-ngx.txt", "jsonld": "https://wpnews.pro/news/i-built-an-automated-document-management-system-with-paperless-ngx.jsonld"}}