LLM for Text Summarization: Best Practices and Optimization Techniques

wpnews.pro

We are going to build a production-ready document summarizer that ingests long-form text and emits structured JSON with a TL;DR, key points, and action items. If you process research papers, support tickets, or meeting transcripts, this gives you a reusable pipeline you can drop into any backend.

pip install openai

I always start by confirming the API contract works. This snippet initializes the Oxlo.ai client and sends a one-sentence summary request to DeepSeek V3.2, which is available on the free tier. If you see a response, your environment is ready.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

response = client.chat.completions.create(
    model="deepseek-v3.2",
    messages=[
        {"role": "user", "content": "Summarize this in one sentence: The quick brown fox jumps over the lazy dog."},
    ],
)

print(response.choices[0].message.content)

The system prompt is the only part of the stack that shapes tone and structure, so I keep it in a dedicated constant. I instruct the model to behave like a research analyst and emit only valid JSON.

SYSTEM_PROMPT = """You are a precise document summarizer. Read the user's text and produce a JSON object with exactly these keys:
- title: a short, descriptive title
- tldr: a one-sentence summary under 20 words
- key_points: an array of 3 to 5 bullet strings
- action_items: an array of specific next steps, or an empty array if none exist

Rules:
- Output only the JSON object, with no markdown fences and no preamble.
- Base every field strictly on the provided text.
- Be concise. Avoid filler words."""

Next, I wrap the prompt in a reusable function that calls Oxlo.ai. I use Llama 3.3 70B here because it follows system instructions reliably for structured extraction.

import json
from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

def summarize(text: str) -> dict:
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": text},
        ],
        temperature=0.2,
    )
    
    raw = response.choices[0].message.content.strip()
    if raw.startswith("

```"):
        raw = raw.split("\n", 1)[1].rsplit("```

", 1)[0].strip()
    return json.loads(raw)

Most token-based providers make long inputs expensive, but Oxlo.ai uses flat per-request pricing regardless of prompt length, so a 50,000-character annual report costs the same as a single sentence. See https://oxlo.ai/pricing for details. For this step I switch to Kimi K2.6, which supports a 131K context window, so I can drop the entire document into one request without chunking logic.

def summarize_long(text: str) -> dict:
    response = client.chat.completions.create(
        model="kimi-k2.6",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": text},
        ],
        temperature=0.2,
    )
    
    raw = response.choices[0].message.content.strip()
    if raw.startswith("

```"):
        raw = raw.split("\n", 1)[1].rsplit("```

", 1)[0].strip()
    return json.loads(raw)

When the input is dense with jargon, I run a second pass to simplify language while preserving meaning. I chain two calls: the first extracts the raw summary, and the second rewrites the tldr and key_points for non-expert readers. I use Qwen 3 32B for the rewrite because it handles technical rephrasing precisely.

REFINE_PROMPT = """You are an editor. Take the JSON summary below and rewrite only the 'tldr' and 'key_points' fields so a non-expert can understand them. Keep the 'title' and 'action_items' exactly as they are. Output only valid JSON."""

def summarize_and_refine(text: str) -> dict:
    first = summarize_long(text)
    
    response = client.chat.completions.create(
        model="qwen-3-32b",
        messages=[
            {"role": "system", "content": REFINE_PROMPT},
            {"role": "user", "content": json.dumps(first, indent=2)},
        ],
        temperature=0.3,
    )
    
    raw = response.choices[0].message.content.strip()
    if raw.startswith("

```"):
        raw = raw.split("\n", 1)[1].rsplit("```

", 1)[0].strip()
    return json.loads(raw)

Here is the complete script. I feed it a sample quarterly earnings excerpt and print the refined JSON.

import json
from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

SYSTEM_PROMPT = """You are a precise document summarizer. Read the user's text and produce a JSON object with exactly these keys:
- title: a short, descriptive title
- tldr: a one-sentence summary under 20 words
- key_points: an array of 3 to 5 bullet strings
- action_items: an array of specific next steps, or an empty array if none exist

Rules:
- Output only the JSON object, with no markdown fences and no preamble.
- Base every field strictly on the provided text.
- Be concise. Avoid filler words."""

REFINE_PROMPT = """You are an editor. Take the JSON summary below and rewrite only the 'tldr' and 'key_points' fields so a non-expert can understand them. Keep the 'title' and 'action_items' exactly as they are. Output only valid JSON."""

def summarize_long(text: str) -> dict:
    response = client.chat.completions.create(
        model="kimi-k2.6",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": text},
        ],
        temperature=0.2,
    )
    raw = response.choices[0].message.content.strip()
    if raw.startswith("

```"):
        raw = raw.split("\n", 1)[1].rsplit("```

", 1)[0].strip()
    return json.loads(raw)

def summarize_and_refine(text: str) -> dict:
    first = summarize_long(text)
    response = client.chat.completions.create(
        model="qwen-3-32b",
        messages=[
            {"role": "system", "content": REFINE_PROMPT},
            {"role": "user", "content": json.dumps(first, indent=2)},
        ],
        temperature=0.3,
    )
    raw = response.choices[0].message.content.strip()
    if raw.startswith("

```"):
        raw = raw.split("\n", 1)[1].rsplit("```

", 1)[0].strip()
    return json.loads(raw)

if __name__ == "__main__":
    document = """
    Q3 2024 Earnings Highlights

    Revenue grew 12% year-over-year to $840M, driven primarily by cloud infrastructure adoption in APAC and expansion of the enterprise tier. Operating margin compressed to 18% from 22% last quarter due to increased headcount in R&D and a one-time restructuring charge of $14M. The board approved a $200M share buyback program to be executed over the next twelve months. CFO guidance for Q4 projects revenue between $855M and $875M, with margin recovery to 20% as the restructuring costs roll off. The company also announced a strategic partnership with a major semiconductor vendor to co-design AI accelerators for edge deployments, with first silicon expected in late 2025.
    """
    
    result = summarize_and_refine(document)
    print(json.dumps(result, indent=2))

Example output:

{
  "title": "Q3 2024 Earnings and Q4 Outlook",
  "tldr": "Revenue rose 12 percent to 840 million dollars, but profit margins dropped because of hiring and restructuring costs.",
  "key_points": [
    "Cloud infrastructure sales in Asia Pacific pushed revenue up 12 percent year over year",
    "Operating margin fell to 18 percent from 22 percent due to research hiring and a 14 million dollar restructuring charge",
    "The board authorized a 200 million dollar stock buyback over the next year",
    "Fourth quarter revenue is expected to reach 855 to 875 million dollars with margins rebounding to 20 percent",
    "A new chip partnership targets edge AI hardware arriving in late 2025"
  ],
  "action_items": [
    "Monitor Q4 margin recovery toward the 20 percent target",
    "Track progress on the semiconductor partnership and 2025 silicon timeline",
    "Evaluate impact of the share buyback on capital allocation"
  ]
}

Two concrete ways to productionize this. First, wrap the summarizer in a FastAPI endpoint and accept file uploads so other services can POST PDFs or raw text. Second, enable streaming by setting stream=True

on the Oxlo.ai client and yield JSON chunks as they arrive, which keeps latency low for interactive UIs.

source & further reading

dev.to — original article Aeglix Mind Context compaction happens in the dark. I made it happen on a map. Build a Health Autopilot: Mastering LangGraph for Chronic Disease Management 🩺🤖

LLM for Text Summarization: Best Practices and Optimization Techniques

Run your AI side-project on zahid.host