OpenAI-compatible proxy for DeepSeek V4 Flash with intelligent auto context compression features This article describes a Python script that functions as an OpenAI-compatible proxy for the DeepSeek V4 Flash model, designed to optimize API usage through intelligent context compression. The proxy automatically compresses system prompts, deduplicates markdown blocks and repeated user message segments, and triggers conversation summarization when the token budget is exceeded. It also caches assistant reasoning and uses SHA-256 fingerprinting to remove boilerplate content, while ignoring all client-supplied model parameters in favor of a fixed global configuration. /usr/bin/env python3 """ Zero-dependency OpenAI-compatible proxy for DeepSeek V4 Flash. Author: g023 License: MIT All client‑supplied model and generation parameters are ignored . The proxy always uses the model, max output tokens, and other settings defined in the global configuration see --help and the constants below . Optimisations: - System prompt compression auto-summarized via DeepSeek API; originals stored in ./pre sys/, summaries cached in ./post sys/ - Markdown block deduplication keeps only the latest occurrence full - Conversation summarisation triggers when token budget is exceeded - Assistant reasoning is cached to avoid redundant re‑generation - Inter‑message content fingerprinting & deduplication Feature F-1 - Removes repeated boilerplate segments environment info, userMemory, reminderInstructions, etc. from user messages across conversation turns. - Segments are hashed SHA‑256 , duplicates replaced with an empty string or a minimal placeholder if the message becomes empty . - Per‑conversation fingerprint storage with LRU eviction. Reads from local file K.dat for API key if DEEPSEEK API KEY env var is not set. just a proof of concept pet project. Do not expose this server to the internet. """ import argparse import collections import copy import hashlib import http.server import json import logging import os import re import signal import socketserver import sys import threading import time import urllib.error import urllib.request from datetime import datetime from pathlib import Path from typing import Any, Dict, List, Optional, Tuple ============================================================================== Global configuration – these forcibly override every client request ============================================================================== DEEPSEEK BASE = "https://api.deepseek.com" DEFAULT MODEL = "deepseek-v4-flash" "deepseek-v4-flash" "deepseek-v4-pro" model that will always be used MAX CACHE SIZE = 500 LRU cache for assistant reasoning MAX CONTEXT = 128000 tokens context size SUMMARY RATIO = 0.8 trigger summarisation at 80 % of MAX CONTEXT SUMMARY MODEL = DEFAULT MODEL model used for the summarisation call MAX OUTPUT TOKENS = 128000 max tokens to generate overrides client THINKING MODE = "auto" "enabled", "disabled", or "auto" default -------------------------------------------------------------------------- Local file save toggles – set to False to disable disk writes -------------------------------------------------------------------------- SAVE PREPOST MSGS = False save pre/post message dumps to ./pre msg/ and ./post msg/ SAVE PREPOST SYSTEM = True save original/summarized system prompts to ./pre sys/ and ./post sys/ -------------------------------------------------------------------------- Retry configuration for summarisation calls -------------------------------------------------------------------------- SUMMARISE MAX RETRIES = 3 SUMMARISE RETRY BASE SLEEP = 2.0 seconds, doubled each attempt -------------------------------------------------------------------------- Feature F-1: Inter‑message content fingerprinting & deduplication -------------------------------------------------------------------------- MAX FINGERPRINT HISTORY = 100 max number of segments stored per conversation Known boilerplate XML tags – each as open tag, close tag BOILERPLATE PATTERNS = { "environment info": "