{"slug": "sds-converter-converting-safety-data-sheets-to-mhlw-standard-json-with-rust-and", "title": "sds-converter: Converting Safety Data Sheets to MHLW Standard JSON with Rust and LLMs", "summary": "The article describes **sds-converter**, a Rust-based tool that uses Large Language Models (LLMs) to convert unstructured Safety Data Sheets (SDS) into Japan’s MHLW standard JSON format, which contains roughly 200 deeply nested fields and intentional typos. It supports multiple input formats (PDF, DOCX, XLSX, TXT) and various LLM providers (Anthropic, OpenAI, Google Gemini, local via Ollama), using parallel calls and automatic retries to handle diverse international SDS formats. The tool also offers validation, batch processing, and conversion to JIS Z 7253-compliant Word documents.", "body_md": "Safety Data Sheets (SDS) are mandatory documents for every chemical product — solvents, adhesives, industrial gases, cleaning agents. Every manufacturer that supplies a hazardous chemical must provide one. In Japan, the governing standard is JIS Z 7253, which defines 16 sections covering chemical identity, hazard classification, first aid, storage, transport information, and more.\nThe Ministry of Health, Labour and Welfare (MHLW) published a standard JSON schema in March 2025 for electronic SDS data exchange between chemical management systems. The schema has roughly 200 deeply nested fields covering all 16 sections.\nThe problem is that real SDS documents don't arrive structured to this schema.\nEven two documents both compliant with JIS Z 7253 will differ in ways that break rule-based parsers:\n\"≥99.5%\"\n, \"99.5% or higher\"\n, \"approximately 100%\"\nall mean the same thingThe MHLW schema compounds this: it has intentional typos that must be reproduced exactly. HumanExposureAndEmergencyMeasuress\nends in double-s\n. TestGuidline\nis missing an e\n. Desclaimer\nhas transposed letters. These are in the official spec, and validation fails if you \"fix\" them.\nTo handle SDS from international manufacturers (GHS/OSHA format) or Chinese suppliers (GB/T 16483 format) in the same pipeline, you'd need separate parsers for each format. Writing and maintaining those is impractical. I built sds-converter to handle this with an LLM instead.\ncargo install sds-converter\n# PDF → MHLW standard JSON\nexport ANTHROPIC_API_KEY=sk-ant-...\nsds-converter to-json --input input.pdf --output output.json\n# MHLW JSON → JIS Z 7253-compliant Word document\nsds-converter to-docx --input output.json --output result.docx --lang ja\n# Schema validation\nsds-converter validate --input output.json\n# Extract raw text (no LLM call — useful for debugging)\nsds-converter extract-text --input input.pdf\nSupported input: PDF, DOCX, XLSX, TXT.\nText is pulled from the PDF or DOCX file. Use extract-text\nto inspect exactly what gets sent to the LLM — useful when extraction quality is lower than expected.\nNote: Encrypted PDFs and scan-only (image) PDFs are not supported — text extraction requires selectable text.\nThe 16 sections are split into two groups and extracted with two parallel LLM calls, halving per-file latency:\nResults from both calls are merged. Sections skipped in the first pass are automatically retried. HTTP rate-limit responses (429/529) trigger exponential backoff retries (2s → 4s → 8s, up to 3 attempts).\nThe merged result is written as MHLW SDS data exchange format v1.0 JSON.\n# OpenAI GPT (gpt-4o-mini by default)\nsds-converter to-json --input input.pdf --output output.json \\\n--provider openai --api-key $OPENAI_API_KEY\n# Google Gemini (gemini-2.0-flash by default)\nsds-converter to-json --input input.pdf --output output.json \\\n--provider gemini --api-key $GEMINI_API_KEY\n# Local LLM via Ollama (any OpenAI-compatible endpoint)\nsds-converter to-json --input input.pdf --output output.json \\\n--provider local --base-url http://localhost:11434/v1 \\\n--model llama3.2 --api-key dummy\n--quality\ncontrols both the model and how much text is sent to the LLM per call:\nAt high\n, the full document text including the later sections (transport information, regulatory) is included. Use --quality high\nwhen complete 16-section coverage matters.\nsds-converter to-json \\\n--input-dir ./pdfs/ \\\n--output-dir ./json/ \\\n--lang ja \\\n--concurrency 4\nvalidate\nchecks structural completeness of the extracted JSON and returns warnings without hard-failing — partial results are still usable.\nsds-converter validate --input output.json\nExamples of what it checks:\nWhen using the library, convert_to_json\nreturns a (SdsRoot, Vec<String>)\ntuple — the warnings are surfaced inline.\n{\n\"Datasheet\": {\n\"IssueDate\": \"2024-03-31\",\n\"SDS-SchemaVersionNo\": \"1.0\"\n},\n\"Identification\": {\n\"TradeProductIdentity\": {\n\"TradeNameJP\": \"Sample Product\"\n},\n\"SupplierInformation\": {\n\"CompanyName\": \"Sample Corp\",\n\"Phone\": \"03-0000-0000\"\n}\n}\n}\nThe full schema covers all 16 JIS Z 7253 sections with ~200 fields. The official spec and developer manual are on the MHLW website (Japanese).\n[dependencies]\nsds-converter-core = \"0.1\"\nuse sds_converter_core::{\nconverter::{AnthropicBackend, LlmConfig},\nconvert_to_json, ConvertConfig, Language,\n};\n#[tokio::main]\nasync fn main() -> anyhow::Result<()> {\nlet backend = AnthropicBackend::new(\nstd::env::var(\"ANTHROPIC_API_KEY\")?,\nLlmConfig::default(),\n);\nlet config = ConvertConfig {\nsource_language: Some(Language::Japanese),\noutput_language: Language::Japanese,\n..Default::default()\n};\nlet (sds, warnings) = convert_to_json(\nstd::path::Path::new(\"input.pdf\"), &backend, &config\n).await?;\nfor w in &warnings { eprintln!(\"WARN: {w}\"); }\nstd::fs::write(\"output.json\", serde_json::to_string_pretty(&sds)?)?;\nOk(())\n}\nuse sds_converter_core::{convert_from_json, ConvertConfig, Language, SdsRoot};\nfn main() -> anyhow::Result<()> {\nlet sds: SdsRoot = serde_json::from_str(&std::fs::read_to_string(\"output.json\")?)?;\nlet config = ConvertConfig {\noutput_language: Language::Japanese,\n..Default::default()\n};\nconvert_from_json(&sds, std::path::Path::new(\"result.docx\"), &config)?;\nOk(())\n}\nuse sds_converter_core::{LlmBackend, SdsError};\nstruct MyBackend;\nimpl LlmBackend for MyBackend {\nasync fn complete(&self, system: &str, user: &str) -> Result<String, SdsError> {\n// Call your LLM API, return the raw JSON string response\ntodo!()\n}\n}\nsds-converter is the only open-source tool that supports the MHLW schema, runs entirely locally, and handles the full round-trip.\nsds-converter-core\n— library. LLM extraction, DOCX generation, MHLW schema types.sds-converter\n— CLI binary. to-json\n, to-docx\n, validate\n, extract-text\nsubcommands.Feedback welcome, especially on section 3 component table extraction and non-Japanese document accuracy.", "url": "https://wpnews.pro/news/sds-converter-converting-safety-data-sheets-to-mhlw-standard-json-with-rust-and", "canonical_source": "https://dev.to/kent-tokyo/sds-converter-converting-safety-data-sheets-to-mhlw-standard-json-with-rust-and-llms-ihg", "published_at": "2026-05-22 23:09:11+00:00", "updated_at": "2026-05-23 00:03:47.348208+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "open-source", "data", "enterprise-software"], "entities": ["MHLW", "JIS Z 7253", "Rust", "sds-converter", "GHS", "OSHA", "GB/T 16483"], "alternates": {"html": "https://wpnews.pro/news/sds-converter-converting-safety-data-sheets-to-mhlw-standard-json-with-rust-and", "markdown": "https://wpnews.pro/news/sds-converter-converting-safety-data-sheets-to-mhlw-standard-json-with-rust-and.md", "text": "https://wpnews.pro/news/sds-converter-converting-safety-data-sheets-to-mhlw-standard-json-with-rust-and.txt", "jsonld": "https://wpnews.pro/news/sds-converter-converting-safety-data-sheets-to-mhlw-standard-json-with-rust-and.jsonld"}}