# sds-converter: Converting Safety Data Sheets to MHLW Standard JSON with Rust and LLMs

> Source: <https://dev.to/kent-tokyo/sds-converter-converting-safety-data-sheets-to-mhlw-standard-json-with-rust-and-llms-ihg>
> Published: 2026-05-22 23:09:11+00:00

Safety Data Sheets (SDS) are mandatory documents for every chemical product — solvents, adhesives, industrial gases, cleaning agents. Every manufacturer that supplies a hazardous chemical must provide one. In Japan, the governing standard is JIS Z 7253, which defines 16 sections covering chemical identity, hazard classification, first aid, storage, transport information, and more.
The Ministry of Health, Labour and Welfare (MHLW) published a standard JSON schema in March 2025 for electronic SDS data exchange between chemical management systems. The schema has roughly 200 deeply nested fields covering all 16 sections.
The problem is that real SDS documents don't arrive structured to this schema.
Even two documents both compliant with JIS Z 7253 will differ in ways that break rule-based parsers:
"≥99.5%"
, "99.5% or higher"
, "approximately 100%"
all mean the same thingThe MHLW schema compounds this: it has intentional typos that must be reproduced exactly. HumanExposureAndEmergencyMeasuress
ends in double-s
. TestGuidline
is missing an e
. Desclaimer
has transposed letters. These are in the official spec, and validation fails if you "fix" them.
To handle SDS from international manufacturers (GHS/OSHA format) or Chinese suppliers (GB/T 16483 format) in the same pipeline, you'd need separate parsers for each format. Writing and maintaining those is impractical. I built sds-converter to handle this with an LLM instead.
cargo install sds-converter
# PDF → MHLW standard JSON
export ANTHROPIC_API_KEY=sk-ant-...
sds-converter to-json --input input.pdf --output output.json
# MHLW JSON → JIS Z 7253-compliant Word document
sds-converter to-docx --input output.json --output result.docx --lang ja
# Schema validation
sds-converter validate --input output.json
# Extract raw text (no LLM call — useful for debugging)
sds-converter extract-text --input input.pdf
Supported input: PDF, DOCX, XLSX, TXT.
Text is pulled from the PDF or DOCX file. Use extract-text
to inspect exactly what gets sent to the LLM — useful when extraction quality is lower than expected.
Note: Encrypted PDFs and scan-only (image) PDFs are not supported — text extraction requires selectable text.
The 16 sections are split into two groups and extracted with two parallel LLM calls, halving per-file latency:
Results from both calls are merged. Sections skipped in the first pass are automatically retried. HTTP rate-limit responses (429/529) trigger exponential backoff retries (2s → 4s → 8s, up to 3 attempts).
The merged result is written as MHLW SDS data exchange format v1.0 JSON.
# OpenAI GPT (gpt-4o-mini by default)
sds-converter to-json --input input.pdf --output output.json \
--provider openai --api-key $OPENAI_API_KEY
# Google Gemini (gemini-2.0-flash by default)
sds-converter to-json --input input.pdf --output output.json \
--provider gemini --api-key $GEMINI_API_KEY
# Local LLM via Ollama (any OpenAI-compatible endpoint)
sds-converter to-json --input input.pdf --output output.json \
--provider local --base-url http://localhost:11434/v1 \
--model llama3.2 --api-key dummy
--quality
controls both the model and how much text is sent to the LLM per call:
At high
, the full document text including the later sections (transport information, regulatory) is included. Use --quality high
when complete 16-section coverage matters.
sds-converter to-json \
--input-dir ./pdfs/ \
--output-dir ./json/ \
--lang ja \
--concurrency 4
validate
checks structural completeness of the extracted JSON and returns warnings without hard-failing — partial results are still usable.
sds-converter validate --input output.json
Examples of what it checks:
When using the library, convert_to_json
returns a (SdsRoot, Vec<String>)
tuple — the warnings are surfaced inline.
{
"Datasheet": {
"IssueDate": "2024-03-31",
"SDS-SchemaVersionNo": "1.0"
},
"Identification": {
"TradeProductIdentity": {
"TradeNameJP": "Sample Product"
},
"SupplierInformation": {
"CompanyName": "Sample Corp",
"Phone": "03-0000-0000"
}
}
}
The full schema covers all 16 JIS Z 7253 sections with ~200 fields. The official spec and developer manual are on the MHLW website (Japanese).
[dependencies]
sds-converter-core = "0.1"
use sds_converter_core::{
converter::{AnthropicBackend, LlmConfig},
convert_to_json, ConvertConfig, Language,
};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let backend = AnthropicBackend::new(
std::env::var("ANTHROPIC_API_KEY")?,
LlmConfig::default(),
);
let config = ConvertConfig {
source_language: Some(Language::Japanese),
output_language: Language::Japanese,
..Default::default()
};
let (sds, warnings) = convert_to_json(
std::path::Path::new("input.pdf"), &backend, &config
).await?;
for w in &warnings { eprintln!("WARN: {w}"); }
std::fs::write("output.json", serde_json::to_string_pretty(&sds)?)?;
Ok(())
}
use sds_converter_core::{convert_from_json, ConvertConfig, Language, SdsRoot};
fn main() -> anyhow::Result<()> {
let sds: SdsRoot = serde_json::from_str(&std::fs::read_to_string("output.json")?)?;
let config = ConvertConfig {
output_language: Language::Japanese,
..Default::default()
};
convert_from_json(&sds, std::path::Path::new("result.docx"), &config)?;
Ok(())
}
use sds_converter_core::{LlmBackend, SdsError};
struct MyBackend;
impl LlmBackend for MyBackend {
async fn complete(&self, system: &str, user: &str) -> Result<String, SdsError> {
// Call your LLM API, return the raw JSON string response
todo!()
}
}
sds-converter is the only open-source tool that supports the MHLW schema, runs entirely locally, and handles the full round-trip.
sds-converter-core
— library. LLM extraction, DOCX generation, MHLW schema types.sds-converter
— CLI binary. to-json
, to-docx
, validate
, extract-text
subcommands.Feedback welcome, especially on section 3 component table extraction and non-Japanese document accuracy.
