Named Entity Recognition (NER) remains one of the most common production workloads in natural language processing. Whether you are extracting patient identifiers from clinical notes, tracking company mentions in financial filings, or tagging locations in legal contracts, the underlying challenge is the same: identify and classify atomic spans of text with high precision and recall. Traditional approaches rely on fine-tuned transformer models or brittle rule engines, but the rise of large language models has shifted the paradigm toward zero-shot and few-shot extraction. The catch is cost. When you pay by the token, processing long documents or running high-frequency agentic pipelines becomes expensive quickly. Oxlo.ai removes that constraint with request-based pricing, making LLM-driven NER economically viable for documents of any length.
Fine-tuned BERT variants are fast, but they are also rigid. Adding a new entity type means re-labeling data and retraining. LLMs accept a schema at inference time. You can pivot from extracting PERSON
and ORG
to extracting PRODUCT_SKU
and MANUFACTURING_DATE
by updating a prompt, with no redeployment. They also handle nested and discontinuous entities better than token-classification models, and they can infer implicit relationships between mentions.
The trade-off has always been inference cost and latency, especially when you need to process entire pages or documents rather than short sentences. That trade-off disappears when your provider charges a flat rate per request.
The most reliable way to run NER with an LLM is to enforce a structured output. Oxlo.ai supports JSON mode and function calling across its chat models, so you can constrain the response to a schema and parse it deterministically. Below is a minimal Python example using the OpenAI SDK, pointed at Oxlo.ai.
import openai
import json
client = openai.OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_API_KEY"
)
schema = {
"type": "object",
"properties": {
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string"},
"label": {"type": "string", "enum": ["PERSON", "ORG", "GPE", "DATE", "MONEY"]},
"start": {"type": "integer"},
"end": {"type": "integer"}
},
"required": ["text", "label", "start", "end"]
}
}
},
"required": ["entities"]
}
text = "Apple Inc. is planning to open a new office in Austin by March 2026, investing over $1 billion."
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": "You are a precise NER engine. Extract all named entities from the user text and return valid JSON matching the provided schema. Do not add extra commentary."},
{"role": "user", "content": f"Extract entities from the following text:\n\n{text}"}
],
response_format={"type": "json_object"},
temperature=0.1
)
result = json.loads(response.choices[0].message.content)
print(json.dumps(result, indent=2))
This pattern works with any