Comparing LLM Models: A Technical Deep Dive

A developer built a lightweight Python harness to compare production-grade open large language models from Oxlo.ai. The harness sends identical prompts to four models—Llama 3.3 70B, Qwen 3 32B, Kimi K2.6, and DeepSeek V3.2—times each response, and scores outputs using a judge model for objective comparison. The approach uses concurrent requests to measure wall-clock latency and a separate judge model to evaluate correctness, clarity, and conciseness.

I needed a fast, repeatable way to compare production-grade open models before routing traffic to them. In this post, I will walk through a lightweight Python harness that sends identical prompts to four different Oxlo.ai models, times each response, and scores the outputs with a judge model so you can pick the right one for your workload. pip install openai We start by initializing the client and defining the models we want to test. I picked a mix of generalist, reasoning, and multilingual models that Oxlo.ai hosts. python from openai import OpenAI import os client = OpenAI base url="https://api.oxlo.ai/v1", api key=os.environ.get "OXLO API KEY" CANDIDATE MODELS = "llama-3.3-70b", "qwen-3-32b", "kimi-k2.6", "deepseek-v3.2", TEST PROMPT = "Write a Python function that accepts a list of integers and returns " "the longest strictly increasing subsequence. Include type hints, " "a docstring, and a simple test case in the same code block." Before we fire requests, we need a consistent rubric. I use a separate system prompt for the judge model so scoring stays objective across runs. JUDGE SYSTEM PROMPT = """You are an expert code reviewer. You will receive a user request and a candidate response. Score the response on three axes from 1 to 5: 1. Correctness: does the code solve the problem and pass the included test? 2. Clarity: are the docstring, types, and variable names clear? 3. Conciseness: is the solution free of unnecessary bloat? Return ONLY a JSON object with keys: model, correctness, clarity, conciseness, total score, and one sentence verdict. """ Waiting for four sequential API calls is slow. I use a thread pool to hit all candidate models at once and record wall-clock latency for each. python import time import concurrent.futures def query model model id: str, prompt: str - dict: start = time.perf counter response = client.chat.completions.create model=model id, messages= {"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": prompt}, , temperature=0.2, elapsed = time.perf counter - start return { "model": model id, "text": response.choices 0 .message.content, "latency sec": round elapsed, 2 , } def run benchmark prompt: str : results = with concurrent.futures.ThreadPoolExecutor max workers=4 as executor: futures = { executor.submit query model, m, prompt : m for m in CANDIDATE MODELS } for future in concurrent.futures.as completed futures : results.append future.result return results Now we feed each candidate response into a judge. I use llama-3.3-70b as the judge because it gives stable JSON formatting. php import json def judge response candidate: dict, original prompt: str - dict: judge input = f"User request:\n{original prompt}\n\n" f"Candidate response from {candidate 'model' }:\n{candidate 'text' }\n\n" "Score the response and return the JSON object." response = client.chat.completions.create model="llama-3.3-70b", messages= {"role": "system", "content": JUDGE SYSTEM PROMPT}, {"role": "user", "content": judge input}, , temperature=0.1, raw = response.choices 0 .message.content.strip if raw.startswith " " : raw = raw.split " " 1 .replace "json", "" .strip scores = json.loads raw return { candidate, scores} def score all results: list, prompt: str : return judge response r, prompt for r in results Finally, we print a markdown table so the differences are obvious at a glance. python def print report scored results: list : print "| Model | Latency s | Correctness | Clarity | Conciseness | Total | Verdict |" print "|-------|-------------|-------------|---------|-------------|-------|---------|" for r in scored results: print f"| {r 'model' } | {r 'latency sec' } | " f"{r 'correctness' } | {r 'clarity' } | {r 'conciseness' } | " f"{r 'total score' } | {r 'one sentence verdict' } |" if name == " main ": print "Running benchmark..." raw results = run benchmark TEST PROMPT scored = score all raw results, TEST PROMPT scored.sort key=lambda x: x "total score" , reverse=True print report scored Save the script as benchmark.py , export your key, and run it. export OXLO API KEY="your-key-here" python benchmark.py Example output values will vary by run : Running benchmark... | Model | Latency s | Correctness | Clarity | Conciseness | Total | Verdict | |-------|-------------|-------------|---------|-------------|-------|---------| | deepseek-v3.2 | 4.2 | 5 | 5 | 4 | 14 | Produces correct LIS with clean type hints and a valid doctest. | | kimi-k2.6 | 3.8 | 5 | 4 | 4 | 13 | Correct solution but slightly verbose docstring. | | qwen-3-32b | 2.1 | 4 | 4 | 5 | 13 | Correct logic, omits explicit test case in the block. | | llama-3.3-70b | 1.9 | 4 | 5 | 4 | 13 | Good structure, test case is present but uses print instead of assert. | Swap the static prompt for a JSONL test suite so you can regression-test model behavior on every deploy. You can also add a lightweight Streamlit frontend so non-engineers can run comparisons and vote on their preferred output.