sds-converter: Converting Safety Data Sheets to MHLW Standard JSON with Rust and LLMs The article describes **sds-converter**, a Rust-based tool that uses Large Language Models (LLMs) to convert unstructured Safety Data Sheets (SDS) into Japan’s MHLW standard JSON format, which contains roughly 200 deeply nested fields and intentional typos. It supports multiple input formats (PDF, DOCX, XLSX, TXT) and various LLM providers (Anthropic, OpenAI, Google Gemini, local via Ollama), using parallel calls and automatic retries to handle diverse international SDS formats. The tool also offers validation, batch processing, and conversion to JIS Z 7253-compliant Word documents. Safety Data Sheets SDS are mandatory documents for every chemical product — solvents, adhesives, industrial gases, cleaning agents. Every manufacturer that supplies a hazardous chemical must provide one. In Japan, the governing standard is JIS Z 7253, which defines 16 sections covering chemical identity, hazard classification, first aid, storage, transport information, and more. The Ministry of Health, Labour and Welfare MHLW published a standard JSON schema in March 2025 for electronic SDS data exchange between chemical management systems. The schema has roughly 200 deeply nested fields covering all 16 sections. The problem is that real SDS documents don't arrive structured to this schema. Even two documents both compliant with JIS Z 7253 will differ in ways that break rule-based parsers: "≥99.5%" , "99.5% or higher" , "approximately 100%" all mean the same thingThe MHLW schema compounds this: it has intentional typos that must be reproduced exactly. HumanExposureAndEmergencyMeasuress ends in double-s . TestGuidline is missing an e . Desclaimer has transposed letters. These are in the official spec, and validation fails if you "fix" them. To handle SDS from international manufacturers GHS/OSHA format or Chinese suppliers GB/T 16483 format in the same pipeline, you'd need separate parsers for each format. Writing and maintaining those is impractical. I built sds-converter to handle this with an LLM instead. cargo install sds-converter PDF → MHLW standard JSON export ANTHROPIC API KEY=sk-ant-... sds-converter to-json --input input.pdf --output output.json MHLW JSON → JIS Z 7253-compliant Word document sds-converter to-docx --input output.json --output result.docx --lang ja Schema validation sds-converter validate --input output.json Extract raw text no LLM call — useful for debugging sds-converter extract-text --input input.pdf Supported input: PDF, DOCX, XLSX, TXT. Text is pulled from the PDF or DOCX file. Use extract-text to inspect exactly what gets sent to the LLM — useful when extraction quality is lower than expected. Note: Encrypted PDFs and scan-only image PDFs are not supported — text extraction requires selectable text. The 16 sections are split into two groups and extracted with two parallel LLM calls, halving per-file latency: Results from both calls are merged. Sections skipped in the first pass are automatically retried. HTTP rate-limit responses 429/529 trigger exponential backoff retries 2s → 4s → 8s, up to 3 attempts . The merged result is written as MHLW SDS data exchange format v1.0 JSON. OpenAI GPT gpt-4o-mini by default sds-converter to-json --input input.pdf --output output.json \ --provider openai --api-key $OPENAI API KEY Google Gemini gemini-2.0-flash by default sds-converter to-json --input input.pdf --output output.json \ --provider gemini --api-key $GEMINI API KEY Local LLM via Ollama any OpenAI-compatible endpoint sds-converter to-json --input input.pdf --output output.json \ --provider local --base-url http://localhost:11434/v1 \ --model llama3.2 --api-key dummy --quality controls both the model and how much text is sent to the LLM per call: At high , the full document text including the later sections transport information, regulatory is included. Use --quality high when complete 16-section coverage matters. sds-converter to-json \ --input-dir ./pdfs/ \ --output-dir ./json/ \ --lang ja \ --concurrency 4 validate checks structural completeness of the extracted JSON and returns warnings without hard-failing — partial results are still usable. sds-converter validate --input output.json Examples of what it checks: When using the library, convert to json returns a SdsRoot, Vec