LLM for Text Summarization: Best Practices and Optimization Techniques

A developer built a production-ready document summarizer using Oxlo.ai's API, which ingests long-form text and outputs structured JSON with a TL;DR, key points, and action items. The pipeline leverages models like DeepSeek V3.2, Llama 3.3 70B, Kimi K2.6, and Qwen 3 32B, and takes advantage of Oxlo.ai's flat per-request pricing to handle large documents without chunking. The system includes a refinement step to simplify jargon for non-expert readers.

We are going to build a production-ready document summarizer that ingests long-form text and emits structured JSON with a TL;DR, key points, and action items. If you process research papers, support tickets, or meeting transcripts, this gives you a reusable pipeline you can drop into any backend. pip install openai I always start by confirming the API contract works. This snippet initializes the Oxlo.ai client and sends a one-sentence summary request to DeepSeek V3.2, which is available on the free tier. If you see a response, your environment is ready. python from openai import OpenAI client = OpenAI base url="https://api.oxlo.ai/v1", api key="YOUR OXLO API KEY" response = client.chat.completions.create model="deepseek-v3.2", messages= {"role": "user", "content": "Summarize this in one sentence: The quick brown fox jumps over the lazy dog."}, , print response.choices 0 .message.content The system prompt is the only part of the stack that shapes tone and structure, so I keep it in a dedicated constant. I instruct the model to behave like a research analyst and emit only valid JSON. SYSTEM PROMPT = """You are a precise document summarizer. Read the user's text and produce a JSON object with exactly these keys: - title: a short, descriptive title - tldr: a one-sentence summary under 20 words - key points: an array of 3 to 5 bullet strings - action items: an array of specific next steps, or an empty array if none exist Rules: - Output only the JSON object, with no markdown fences and no preamble. - Base every field strictly on the provided text. - Be concise. Avoid filler words.""" Next, I wrap the prompt in a reusable function that calls Oxlo.ai. I use Llama 3.3 70B here because it follows system instructions reliably for structured extraction. python import json from openai import OpenAI client = OpenAI base url="https://api.oxlo.ai/v1", api key="YOUR OXLO API KEY" def summarize text: str - dict: response = client.chat.completions.create model="llama-3.3-70b", messages= {"role": "system", "content": SYSTEM PROMPT}, {"role": "user", "content": text}, , temperature=0.2, raw = response.choices 0 .message.content.strip if raw.startswith " " : raw = raw.split "\n", 1 1 .rsplit " ", 1 0 .strip return json.loads raw Most token-based providers make long inputs expensive, but Oxlo.ai uses flat per-request pricing regardless of prompt length, so a 50,000-character annual report costs the same as a single sentence. See https://oxlo.ai/pricing https://oxlo.ai/pricing for details. For this step I switch to Kimi K2.6, which supports a 131K context window, so I can drop the entire document into one request without chunking logic. php def summarize long text: str - dict: response = client.chat.completions.create model="kimi-k2.6", messages= {"role": "system", "content": SYSTEM PROMPT}, {"role": "user", "content": text}, , temperature=0.2, raw = response.choices 0 .message.content.strip if raw.startswith " " : raw = raw.split "\n", 1 1 .rsplit " ", 1 0 .strip return json.loads raw When the input is dense with jargon, I run a second pass to simplify language while preserving meaning. I chain two calls: the first extracts the raw summary, and the second rewrites the tldr and key points for non-expert readers. I use Qwen 3 32B for the rewrite because it handles technical rephrasing precisely. REFINE PROMPT = """You are an editor. Take the JSON summary below and rewrite only the 'tldr' and 'key points' fields so a non-expert can understand them. Keep the 'title' and 'action items' exactly as they are. Output only valid JSON.""" def summarize and refine text: str - dict: first = summarize long text response = client.chat.completions.create model="qwen-3-32b", messages= {"role": "system", "content": REFINE PROMPT}, {"role": "user", "content": json.dumps first, indent=2 }, , temperature=0.3, raw = response.choices 0 .message.content.strip if raw.startswith " " : raw = raw.split "\n", 1 1 .rsplit " ", 1 0 .strip return json.loads raw Here is the complete script. I feed it a sample quarterly earnings excerpt and print the refined JSON. python import json from openai import OpenAI client = OpenAI base url="https://api.oxlo.ai/v1", api key="YOUR OXLO API KEY" SYSTEM PROMPT = """You are a precise document summarizer. Read the user's text and produce a JSON object with exactly these keys: - title: a short, descriptive title - tldr: a one-sentence summary under 20 words - key points: an array of 3 to 5 bullet strings - action items: an array of specific next steps, or an empty array if none exist Rules: - Output only the JSON object, with no markdown fences and no preamble. - Base every field strictly on the provided text. - Be concise. Avoid filler words.""" REFINE PROMPT = """You are an editor. Take the JSON summary below and rewrite only the 'tldr' and 'key points' fields so a non-expert can understand them. Keep the 'title' and 'action items' exactly as they are. Output only valid JSON.""" def summarize long text: str - dict: response = client.chat.completions.create model="kimi-k2.6", messages= {"role": "system", "content": SYSTEM PROMPT}, {"role": "user", "content": text}, , temperature=0.2, raw = response.choices 0 .message.content.strip if raw.startswith " " : raw = raw.split "\n", 1 1 .rsplit " ", 1 0 .strip return json.loads raw def summarize and refine text: str - dict: first = summarize long text response = client.chat.completions.create model="qwen-3-32b", messages= {"role": "system", "content": REFINE PROMPT}, {"role": "user", "content": json.dumps first, indent=2 }, , temperature=0.3, raw = response.choices 0 .message.content.strip if raw.startswith " " : raw = raw.split "\n", 1 1 .rsplit " ", 1 0 .strip return json.loads raw if name == " main ": document = """ Q3 2024 Earnings Highlights Revenue grew 12% year-over-year to $840M, driven primarily by cloud infrastructure adoption in APAC and expansion of the enterprise tier. Operating margin compressed to 18% from 22% last quarter due to increased headcount in R&D and a one-time restructuring charge of $14M. The board approved a $200M share buyback program to be executed over the next twelve months. CFO guidance for Q4 projects revenue between $855M and $875M, with margin recovery to 20% as the restructuring costs roll off. The company also announced a strategic partnership with a major semiconductor vendor to co-design AI accelerators for edge deployments, with first silicon expected in late 2025. """ result = summarize and refine document print json.dumps result, indent=2 Example output: { "title": "Q3 2024 Earnings and Q4 Outlook", "tldr": "Revenue rose 12 percent to 840 million dollars, but profit margins dropped because of hiring and restructuring costs.", "key points": "Cloud infrastructure sales in Asia Pacific pushed revenue up 12 percent year over year", "Operating margin fell to 18 percent from 22 percent due to research hiring and a 14 million dollar restructuring charge", "The board authorized a 200 million dollar stock buyback over the next year", "Fourth quarter revenue is expected to reach 855 to 875 million dollars with margins rebounding to 20 percent", "A new chip partnership targets edge AI hardware arriving in late 2025" , "action items": "Monitor Q4 margin recovery toward the 20 percent target", "Track progress on the semiconductor partnership and 2025 silicon timeline", "Evaluate impact of the share buyback on capital allocation" } Two concrete ways to productionize this. First, wrap the summarizer in a FastAPI endpoint and accept file uploads so other services can POST PDFs or raw text. Second, enable streaming by setting stream=True on the Oxlo.ai client and yield JSON chunks as they arrive, which keeps latency low for interactive UIs.