cd /news/large-language-models/comparing-llm-models-a-technical-dee… · home topics large-language-models article
[ARTICLE · art-30206] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Comparing LLM Models: A Technical Deep Dive

A developer built a lightweight Python harness to compare production-grade open large language models from Oxlo.ai. The harness sends identical prompts to four models—Llama 3.3 70B, Qwen 3 32B, Kimi K2.6, and DeepSeek V3.2—times each response, and scores outputs using a judge model for objective comparison. The approach uses concurrent requests to measure wall-clock latency and a separate judge model to evaluate correctness, clarity, and conciseness.

read4 min views1 publishedJun 16, 2026

I needed a fast, repeatable way to compare production-grade open models before routing traffic to them. In this post, I will walk through a lightweight Python harness that sends identical prompts to four different Oxlo.ai models, times each response, and scores the outputs with a judge model so you can pick the right one for your workload.

pip install openai

We start by initializing the client and defining the models we want to test. I picked a mix of generalist, reasoning, and multilingual models that Oxlo.ai hosts.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ.get("OXLO_API_KEY")
)

CANDIDATE_MODELS = [
    "llama-3.3-70b",
    "qwen-3-32b",
    "kimi-k2.6",
    "deepseek-v3.2",
]

TEST_PROMPT = (
    "Write a Python function that accepts a list of integers and returns "
    "the longest strictly increasing subsequence. Include type hints, "
    "a docstring, and a simple test case in the same code block."
)

Before we fire requests, we need a consistent rubric. I use a separate system prompt for the judge model so scoring stays objective across runs.

JUDGE_SYSTEM_PROMPT = """You are an expert code reviewer. You will receive a user request and a candidate response. Score the response on three axes from 1 to 5:
1. Correctness: does the code solve the problem and pass the included test?
2. Clarity: are the docstring, types, and variable names clear?
3. Conciseness: is the solution free of unnecessary bloat?

Return ONLY a JSON object with keys: model, correctness, clarity, conciseness, total_score, and one_sentence_verdict.
"""

Waiting for four sequential API calls is slow. I use a thread pool to hit all candidate models at once and record wall-clock latency for each.

import time
import concurrent.futures

def query_model(model_id: str, prompt: str) -> dict:
    start = time.perf_counter()
    response = client.chat.completions.create(
        model=model_id,
        messages=[
            {"role": "system", "content": "You are a helpful coding assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0.2,
    )
    elapsed = time.perf_counter() - start
    return {
        "model": model_id,
        "text": response.choices[0].message.content,
        "latency_sec": round(elapsed, 2),
    }

def run_benchmark(prompt: str):
    results = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        futures = {
            executor.submit(query_model, m, prompt): m
            for m in CANDIDATE_MODELS
        }
        for future in concurrent.futures.as_completed(futures):
            results.append(future.result())
    return results

Now we feed each candidate response into a judge. I use llama-3.3-70b as the judge because it gives stable JSON formatting.

import json

def judge_response(candidate: dict, original_prompt: str) -> dict:
    judge_input = (
        f"User request:\n{original_prompt}\n\n"
        f"Candidate response from {candidate['model']}:\n{candidate['text']}\n\n"
        "Score the response and return the JSON object."
    )
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
            {"role": "user", "content": judge_input},
        ],
        temperature=0.1,
    )
    raw = response.choices[0].message.content.strip()
    if raw.startswith("

```"):
        raw = raw.split("```

")[1].replace("json", "").strip()
    scores = json.loads(raw)
    return {**candidate, **scores}

def score_all(results: list, prompt: str):
    return [judge_response(r, prompt) for r in results]

Finally, we print a markdown table so the differences are obvious at a glance.

def print_report(scored_results: list):
    print("| Model | Latency (s) | Correctness | Clarity | Conciseness | Total | Verdict |")
    print("|-------|-------------|-------------|---------|-------------|-------|---------|")
    for r in scored_results:
        print(
            f"| {r['model']} | {r['latency_sec']} | "
            f"{r['correctness']} | {r['clarity']} | {r['conciseness']} | "
            f"{r['total_score']} | {r['one_sentence_verdict']} |"
        )

if __name__ == "__main__":
    print("Running benchmark...")
    raw_results = run_benchmark(TEST_PROMPT)
    scored = score_all(raw_results, TEST_PROMPT)
    scored.sort(key=lambda x: x["total_score"], reverse=True)
    print_report(scored)

Save the script as benchmark.py

, export your key, and run it.

export OXLO_API_KEY="your-key-here"
python benchmark.py

Example output (values will vary by run):

Running benchmark...
| Model | Latency (s) | Correctness | Clarity | Conciseness | Total | Verdict |
|-------|-------------|-------------|---------|-------------|-------|---------|
| deepseek-v3.2 | 4.2 | 5 | 5 | 4 | 14 | Produces correct LIS with clean type hints and a valid doctest. |
| kimi-k2.6 | 3.8 | 5 | 4 | 4 | 13 | Correct solution but slightly verbose docstring. |
| qwen-3-32b | 2.1 | 4 | 4 | 5 | 13 | Correct logic, omits explicit test case in the block. |
| llama-3.3-70b | 1.9 | 4 | 5 | 4 | 13 | Good structure, test case is present but uses print instead of assert. |

Swap the static prompt for a JSONL test suite so you can regression-test model behavior on every deploy. You can also add a lightweight Streamlit frontend so non-engineers can run comparisons and vote on their preferred output.

── more in #large-language-models 4 stories · sorted by recency
── more on @oxlo.ai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/comparing-llm-models…] indexed:0 read:4min 2026-06-16 ·