{"slug": "show-hn-parsehawk-100-local-document-ai-with-api-cli-and-web-ui", "title": "Show HN: ParseHawk – 100% Local Document AI with API, CLI, and Web UI", "summary": "ParseHawk, a new local-first document AI tool, launched on Hacker News, enabling users to extract structured JSON from PDFs, scans, and images entirely on their own hardware without sending data to third-party APIs. The tool supports API, CLI, and Web UI interfaces, runs on macOS Apple Silicon and Linux with NVIDIA GPUs, and is designed for sensitive documents like invoices and medical records.", "body_md": "**Local-first document AI. Run 100% locally by default, with API, CLI, and Web UI.**\n\n[Quickstart](#quickstart) ·\n[First extraction](#first-extraction) ·\n[API, CLI, and Web UI](#api-cli-and-web-ui) ·\n[Requirements](#requirements) ·\n[Development](#development)\n\nParseHawk turns PDFs, scans, images, text files, and Markdown into structured JSON without sending sensitive documents to a third-party AI API. It is built for developers and teams working with private data: invoices, receipts, contracts, internal documents, customer files, medical or financial records, and other unstructured inputs that should stay under your control.\n\nThe default setup runs fully locally. ParseHawk uses vLLM on Linux NVIDIA\nmachines and vLLM Metal on macOS Apple Silicon, so you can run practical\ndocument extraction on a server or even on your MacBook. You can drive the same\nworkflow from the browser, from `curl`\n\n, or from the `parsehawk`\n\nCLI.\n\n- Extract structured JSON from unstructured PDFs, scans, images, text, and Markdown\n- Define your own schemas for the data you want back\n- Run zero-shot extraction with only instructions and a schema\n- Add few-shot examples when a document type needs more guidance\n- Improve extraction quality without training a model\n- Improve extractors over time with better instructions, schemas, and examples\n- Get validated JSON output using JSON Schema Draft 2020-12\n- Keep files, jobs, extractors, and results local by default\n- Use the Web UI for humans and the REST API or CLI for scripts, services, and agents\n- Control both the local stack and the extraction API from one\n`parsehawk`\n\nCLI - Run on Linux with vLLM or on macOS Apple Silicon with vLLM Metal\n\nParseHawk runs on macOS Apple Silicon and Linux x86_64 with an NVIDIA GPU. Windows is not supported yet.\n\n## macOS Apple Silicon details\n\nRequired:\n\n`uv`\n\n- Docker Desktop\n- Xcode Command Line Tools\n- Apple Silicon Mac with enough unified memory for NuExtract3-W4A16\n\nVerified:\n\n- MacBook Pro M3 Pro with 18 GB unified memory\n- MacBook Pro M3 Pro with 36 GB unified memory\n\nRecommended:\n\n- 16 GB unified memory minimum for the default local workflow\n- 32 GB or more for larger context lengths\n\n## Linux NVIDIA details\n\nRequired:\n\n`uv`\n\n- Docker Engine\n- Docker Compose\n- NVIDIA driver\n- NVIDIA Container Toolkit\n- NVIDIA GPU with enough VRAM for NuExtract3-W4A16\n\nVerified:\n\n- NVIDIA L4 with 24 GB VRAM\n\nRecommended:\n\n- 16 GB VRAM minimum for the default local workflow\n- 24 GB VRAM or more for larger context lengths\n\nRun ParseHawk from a Git checkout with\n[ uv](https://docs.astral.sh/uv/getting-started/installation/) and install the\nCLI as an editable local tool:\n\n```\ngit clone https://github.com/parsehawk/parsehawk.git\ncd parsehawk\nuv tool install --editable .\nparsehawk start\n```\n\nThen open:\n\n- Web UI:\n[http://127.0.0.1:5173](http://127.0.0.1:5173) - API docs:\n[http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs) - OpenAPI JSON:\n[http://127.0.0.1:8000/openapi.json](http://127.0.0.1:8000/openapi.json)\n\nStop ParseHawk:\n\n```\nparsehawk stop\n```\n\nCheck your local setup:\n\n```\nparsehawk doctor\n```\n\nThe easiest first run is image-to-JSON extraction with the bundled receipt image\nand the seeded prebuilt `Receipt`\n\nextractor.\n\n- Start ParseHawk with\n`parsehawk start`\n\n. - Open\n[http://127.0.0.1:5173](http://127.0.0.1:5173). - Upload\n.`tests/fixtures/receipt/receipt.jpg`\n\n- Select the prebuilt\n`Receipt`\n\nextractor. - Select the uploaded file and click\n**Run extraction**. - Inspect the extracted fields and JSON result.\n\nExpected fields include:\n\n```\n{\n  \"merchant_name\": \"PARSEHAWK COFFEE\",\n  \"receipt_id\": \"R-1001\",\n  \"date\": \"2026-06-21\",\n  \"total\": 11.22,\n  \"currency\": \"EUR\"\n}\nparsehawk files upload tests/fixtures/receipt/receipt.jpg\nparsehawk extractors list\nparsehawk extract \\\n  tests/fixtures/receipt/receipt.jpg \\\n  --extractor extractor_... \\\n  --wait\n```\n\nUse the `Receipt`\n\nextractor ID from `extractors list`\n\n.\n\n```\nAPI=http://127.0.0.1:8000\n\nEXTRACTOR_ID=$(\n  curl -s \"$API/v1/extractors\" |\n    jq -r '.[] | select(.name==\"Receipt\" and .is_prebuilt==true) | .id'\n)\n\nFILE_ID=$(\n  curl -s -X POST \"$API/v1/files\" \\\n    -F \"upload=@tests/fixtures/receipt/receipt.jpg;type=image/jpeg\" |\n    jq -r '.id'\n)\n\nJOB_ID=$(\n  curl -s -X POST \"$API/v1/jobs\" \\\n    -H \"Content-Type: application/json\" \\\n    -d \"{\\\"extractor_id\\\":\\\"$EXTRACTOR_ID\\\",\\\"file_id\\\":\\\"$FILE_ID\\\"}\" |\n    jq -r '.id'\n)\n\ncurl -s \"$API/v1/jobs/$JOB_ID\" | jq .\n```\n\nJobs are asynchronous. Poll `GET /v1/jobs/{job_id}`\n\nuntil `status`\n\nis\n`completed`\n\nor `failed`\n\n.\n\nParseHawk exposes one local API. The CLI and Web UI are clients of that API.\nThe CLI has two jobs: it controls the local ParseHawk stack (`start`\n\n, `stop`\n\n,\n`status`\n\n, `doctor`\n\n, `restart`\n\n) and it works with the data plane (`files`\n\n,\n`extractors`\n\n, `schemas`\n\n, `jobs`\n\n, and one-shot `extract`\n\n).\n\nCore resources:\n\n```\nPOST   /v1/files\nGET    /v1/files\nGET    /v1/files/{file_id}\nGET    /v1/files/{file_id}/content\nDELETE /v1/files/{file_id}\n\nPOST   /v1/schemas/validate\n\nPOST   /v1/extractors\nGET    /v1/extractors\nGET    /v1/extractors/{extractor_id}\nPATCH  /v1/extractors/{extractor_id}\nDELETE /v1/extractors/{extractor_id}\n\nPOST   /v1/jobs\nGET    /v1/jobs\nGET    /v1/jobs/{job_id}\nDELETE /v1/jobs/{job_id}\n```\n\nJobs return the canonical extracted JSON inline as `job.result.data`\n\nonce\ncompleted.\n\nUseful CLI commands:\n\n```\nparsehawk files upload document.pdf\nparsehawk files list\nparsehawk schemas validate schema.json\nparsehawk extractors create --name invoice_v1 --schema schema.json --instructions \"Extract invoice fields.\"\nparsehawk jobs create --extractor extractor_... --file-id file_...\nparsehawk jobs get job_...\nparsehawk extract document.pdf --schema schema.json --instructions \"Extract invoice fields.\" --wait\n```\n\nPublic IDs are TypeID-style strings with resource prefixes such as `file_...`\n\n,\n`extractor_...`\n\n, and `job_...`\n\n.\n\nAn extractor combines:\n\n- a name\n- natural-language instructions\n- JSON Schema Draft 2020-12\n- optional few-shot examples\n- optional thinking mode\n\nA minimal extractor schema:\n\n```\n{\n  \"type\": \"object\",\n  \"properties\": {\n    \"invoice_number\": {\n      \"type\": [\"string\", \"null\"],\n      \"description\": \"The invoice number exactly as shown on the document.\"\n    },\n    \"total_amount\": {\n      \"type\": [\"number\", \"null\"],\n      \"description\": \"The final total amount to pay.\"\n    }\n  },\n  \"required\": [\"invoice_number\", \"total_amount\"],\n  \"additionalProperties\": false\n}\n```\n\nFew-shot examples can use inline text or uploaded files:\n\n```\n{\n  \"name\": \"invoice_v1\",\n  \"instructions\": \"Extract the invoice fields exactly.\",\n  \"schema\": {\n    \"type\": \"object\",\n    \"properties\": {\n      \"invoice_number\": { \"type\": [\"string\", \"null\"] }\n    },\n    \"required\": [\"invoice_number\"],\n    \"additionalProperties\": false\n  },\n  \"examples\": [\n    {\n      \"input\": { \"type\": \"text\", \"text\": \"Invoice #A-123\" },\n      \"output\": { \"invoice_number\": \"A-123\" }\n    },\n    {\n      \"input\": { \"type\": \"file\", \"file_id\": \"file_...\" },\n      \"output\": { \"invoice_number\": \"B-456\" }\n    }\n  ]\n}\n```\n\nParseHawk validates model output against the schema and stores the canonical\nresult under `job.result.data`\n\n.\n\nThe schema dialect is documented in\n[ docs/schemas/parsehawk-extraction-schema.schema.json](/parsehawk/parsehawk/blob/main/docs/schemas/parsehawk-extraction-schema.schema.json).\nIt supports JSON Schema plus optional\n\n`x-parsehawk.semantic`\n\nmetadata for\nNuExtract3-oriented scalar hints.The default model is:\n\n```\nnumind/NuExtract3-W4A16\n```\n\nParseHawk talks to the runtime through an OpenAI-compatible API. On macOS, the runtime runs on the host through vLLM Metal because Metal acceleration is not available inside a normal Linux container. On Linux, the runtime runs as part of Docker Compose.\n\nCurrent defaults:\n\n| Setting | Default |\n|---|---|\n| vLLM package | `vllm==0.23.0` |\n| Linux runtime image | `vllm/vllm-openai:v0.23.0` |\n| Model | `numind/NuExtract3-W4A16` |\n| GPU memory utilization | `0.5` |\n| Max model length | `8192` by default, `32768` on larger Apple Silicon Macs |\n| PDF render DPI | `170` |\n| PDF max pages | `25` |\n\nCommon overrides:\n\n```\nPARSEHAWK_VLLM_MAX_MODEL_LEN=16384 parsehawk start\nPARSEHAWK_VLLM_GPU_MEMORY_UTILIZATION=0.6 parsehawk start\nPARSEHAWK_VLLM_MODEL=numind/NuExtract3-W4A16 parsehawk start\nPARSEHAWK_VLLM_IMAGE=vllm/vllm-openai:v0.23.0 parsehawk start\n```\n\nParseHawk uses Pydantic settings. Common environment variables:\n\n| Environment variable | Default | Description |\n|---|---|---|\n`PARSEHAWK_DATA_DIR` |\n`data` |\nLocal storage directory for SQLite, uploaded files, logs, and local state. |\n`PARSEHAWK_DATABASE_PATH` |\n`data/parsehawk.db` |\nSQLite database path. |\n`PARSEHAWK_LOG_LEVEL` |\n`INFO` |\nLog level for API, worker, runtime, and Web UI logs. |\n`PARSEHAWK_LOG_MODEL_IO` |\n`false` |\nWhen `true` and `PARSEHAWK_LOG_LEVEL=DEBUG` , log model-runtime request and response JSON from the API/worker process. Image data URLs are redacted. |\n`PARSEHAWK_INFERENCE_ENGINE` |\n`none` |\nAPI/worker inference engine. `parsehawk start` sets this to `vllm` when a runtime is configured. |\n`PARSEHAWK_VLLM_BASE_URL` |\n`http://127.0.0.1:8080/v1` |\nOpenAI-compatible model runtime URL. |\n`PARSEHAWK_VLLM_MODEL` |\n`numind/NuExtract3-W4A16` |\nModel name sent to the runtime. |\n`PARSEHAWK_VLLM_MAX_MODEL_LEN` |\nplatform-specific | vLLM context length. Overrides the automatic local default. |\n`PARSEHAWK_VLLM_MAX_NUM_SEQS` |\n`128` |\nLinux vLLM maximum concurrent decode sequences. |\n`PARSEHAWK_VLLM_GPU_MEMORY_UTILIZATION` |\n`0.5` |\nvLLM memory reservation fraction. |\n`PARSEHAWK_VLLM_IMAGE` |\n`vllm/vllm-openai:v0.23.0` |\nLinux Docker runtime image. |\n`PARSEHAWK_VLLM_CACHE_HOME` |\n`~/.cache/vllm` |\nLinux host cache for vLLM compile artifacts. |\n`PARSEHAWK_PDF_MAX_PAGES` |\n`25` |\nMaximum PDF pages rendered for one extraction. |\n`PARSEHAWK_PDF_RENDER_DPI` |\n`170` |\nPDF page image render DPI. |\n`PARSEHAWK_TELEMETRY_DISABLED` |\n`false` |\nWhen truthy, disables anonymous usage analytics. |\n\nCLI config:\n\n```\nparsehawk config list\nparsehawk config set log.level DEBUG\nparsehawk restart\n```\n\nParseHawk collects **anonymous usage analytics**. Two events are sent to\n[PostHog](https://posthog.com):\n\n`install`\n\n— once per install, the first time you start ParseHawk.`run_started`\n\n— each time a user starts an extraction run.\n\nEach event carries only coarse, non-identifying data:\n\n- a random per-install id stored in\n`data/telemetry-id`\n\n- the input type (\n`file`\n\nor`text`\n\n, on runs) - the ParseHawk version and your operating system name\n- an approximate location (country/region)\n\nParseHawk never sends file contents, file names, extractor instructions,\nschemas, or extracted data, and it never creates a personal profile from the\nper-install id. The first time you run `parsehawk start`\n\nor `parsehawk dev`\n\n, you\nwill see a notice describing this.\n\nTo opt out, set either of these before starting ParseHawk:\n\n```\nexport PARSEHAWK_TELEMETRY_DISABLED=1\nexport DO_NOT_TRACK=1\n```\n\nWhen ParseHawk runs in Docker, these variables are passed through to the API and worker containers automatically.\n\nMaintainers can tag internal usage instead of dropping it:\n\n```\nexport PARSEHAWK_TELEMETRY_INTERNAL=1\n```\n\nBy default ParseHawk stores local state under `data/`\n\n:\n\n```\ndata/\n  parsehawk.db\n  files/\n  logs/\n  parsehawk-state.json\n  telemetry-id\n```\n\nStop ParseHawk before deleting `data/`\n\n:\n\n```\nparsehawk stop\nrm -rf data\nparsehawk start\n```\n\nIf `data/`\n\nis deleted while ParseHawk is still running, old processes can keep\nserving from already-open SQLite handles. `parsehawk start`\n\nrefuses to start\nwhen target ports are already occupied without a live state file. In that case,\nstop the process using the port and start again.\n\nDevelopment requires:\n\n`git`\n\n`just`\n\n`uv`\n\n`pnpm`\n\nUseful commands:\n\n```\njust setup          # install dependencies and pre-commit hooks\njust start          # product-like Docker mode\njust dev            # local-source development mode\njust web-dev        # Web UI dev server only\njust smoke          # local smoke flow\njust test           # Python tests\njust e2e            # local end-to-end API tests (needs the model runtime up)\njust format         # format Python\njust lint           # Ruff linting\njust typecheck      # ty type checking\njust web-typecheck  # TypeScript checks\njust web-test       # Web UI tests\njust web-build      # production Web UI build\njust check          # all standard checks\njust hooks-run      # run pre-commit on all files\n```\n\nPre-commit hooks are not installed automatically by Git. Run this once per clone:\n\n```\njust setup\n```\n\nThe hooks run Ruff, ty, Python tests, Web UI typecheck, and Web UI tests. CI should still run the same checks; hooks are just the fast local feedback loop.\n\nDevelopment mode:\n\n```\nparsehawk dev\n```\n\nProduct-like local mode:\n\n```\nparsehawk start\n```\n\nStart with the built-in health checks:\n\nCheck status:\n\n```\nparsehawk status\n```\n\nRead logs:\n\n```\nls data/logs\ntail -f data/logs/api.log\ntail -f data/logs/worker.log\ntail -f data/logs/runtime.log\n```\n\nRestart:\n\n```\nparsehawk restart\n```\n\nIf Docker or the runtime gets into a strange state, stop ParseHawk before removing local data:\n\n```\nparsehawk stop\nrm -rf data\nparsehawk start\n```\n\nIf the Model Runtime is slow to become ready, give it a few minutes on first startup while vLLM loads model weights, profiles memory, and warms kernels.\n\nTo start only the API and Web UI without local inference:\n\n```\nparsehawk start --runtime none\n```\n\nParseHawk stands on excellent open-source projects, including:\n\n[FastAPI](https://github.com/fastapi/fastapi)for the API framework and OpenAPI docs[vLLM](https://github.com/vllm-project/vllm)and[vLLM Metal](https://github.com/vllm-project/vllm-metal)for local model serving[NuExtract3](https://huggingface.co/numind/NuExtract3-W4A16)for the default extraction model[Pydantic](https://github.com/pydantic/pydantic),[Ruff](https://github.com/astral-sh/ruff), and[uv](https://github.com/astral-sh/uv)for the Python toolchain[React](https://github.com/facebook/react),[Vite](https://github.com/vitejs/vite), and[Tailwind CSS](https://github.com/tailwindlabs/tailwindcss)for the Web UI\n\nNear-term focus:\n\n- make the macOS and Linux runtime paths boringly reliable\n- publish an installable CLI package\n- improve the Web UI schema builder\n- add stronger end-to-end runtime smoke tests\n- document deployment options for VPS and container platforms\n\nLater:\n\n- Python SDK\n- migrations and PostgreSQL support\n- batch extraction\n- review/correction workflows\n- eval tooling\n- bring-your-own OpenAI-compatible runtime\n\nParseHawk is developed by Totoy GmbH in Vienna, Austria. If you are interested\nin an enterprise deployment, private-cloud setup, or managed infrastructure for\nsensitive document workflows, contact [support@totoy.ai](mailto:support@totoy.ai).\n\nParseHawk follows SemVer.\n\nUntil `v1.0.0`\n\n, ParseHawk is in developer preview. Breaking changes may happen\nin any minor release, for example from `v0.1.0`\n\nto `v0.2.0`\n\n.\n\nPatch releases, such as `v0.1.1`\n\n, are intended to be backward-compatible bug\nfixes for that minor line.\n\nWe will move to `v1.0.0`\n\nonce the core CLI commands, the REST API, and the\nconfig file format are stable enough for users to rely on.\n\nParseHawk is open source under the Apache-2.0 license. See [ LICENSE](/parsehawk/parsehawk/blob/main/LICENSE).\n\nThird-party dependencies retain their own licenses.", "url": "https://wpnews.pro/news/show-hn-parsehawk-100-local-document-ai-with-api-cli-and-web-ui", "canonical_source": "https://github.com/parsehawk/parsehawk", "published_at": "2026-06-25 10:57:50+00:00", "updated_at": "2026-06-25 11:14:29.846115+00:00", "lang": "en", "topics": ["ai-tools", "ai-products", "natural-language-processing", "developer-tools"], "entities": ["ParseHawk", "vLLM", "NuExtract3", "NVIDIA", "Apple", "Docker", "GitHub"], "alternates": {"html": "https://wpnews.pro/news/show-hn-parsehawk-100-local-document-ai-with-api-cli-and-web-ui", "markdown": "https://wpnews.pro/news/show-hn-parsehawk-100-local-document-ai-with-api-cli-and-web-ui.md", "text": "https://wpnews.pro/news/show-hn-parsehawk-100-local-document-ai-with-api-cli-and-web-ui.txt", "jsonld": "https://wpnews.pro/news/show-hn-parsehawk-100-local-document-ai-with-api-cli-and-web-ui.jsonld"}}