{"slug": "comparing-llm-models-a-technical-deep-dive", "title": "Comparing LLM Models: A Technical Deep Dive", "summary": "A developer built a lightweight Python harness to compare production-grade open large language models from Oxlo.ai. The harness sends identical prompts to four models—Llama 3.3 70B, Qwen 3 32B, Kimi K2.6, and DeepSeek V3.2—times each response, and scores outputs using a judge model for objective comparison. The approach uses concurrent requests to measure wall-clock latency and a separate judge model to evaluate correctness, clarity, and conciseness.", "body_md": "I needed a fast, repeatable way to compare production-grade open models before routing traffic to them. In this post, I will walk through a lightweight Python harness that sends identical prompts to four different Oxlo.ai models, times each response, and scores the outputs with a judge model so you can pick the right one for your workload.\n\n`pip install openai`\n\nWe start by initializing the client and defining the models we want to test. I picked a mix of generalist, reasoning, and multilingual models that Oxlo.ai hosts.\n\n``` python\nfrom openai import OpenAI\nimport os\n\nclient = OpenAI(\n    base_url=\"https://api.oxlo.ai/v1\",\n    api_key=os.environ.get(\"OXLO_API_KEY\")\n)\n\nCANDIDATE_MODELS = [\n    \"llama-3.3-70b\",\n    \"qwen-3-32b\",\n    \"kimi-k2.6\",\n    \"deepseek-v3.2\",\n]\n\nTEST_PROMPT = (\n    \"Write a Python function that accepts a list of integers and returns \"\n    \"the longest strictly increasing subsequence. Include type hints, \"\n    \"a docstring, and a simple test case in the same code block.\"\n)\n```\n\nBefore we fire requests, we need a consistent rubric. I use a separate system prompt for the judge model so scoring stays objective across runs.\n\n```\nJUDGE_SYSTEM_PROMPT = \"\"\"You are an expert code reviewer. You will receive a user request and a candidate response. Score the response on three axes from 1 to 5:\n1. Correctness: does the code solve the problem and pass the included test?\n2. Clarity: are the docstring, types, and variable names clear?\n3. Conciseness: is the solution free of unnecessary bloat?\n\nReturn ONLY a JSON object with keys: model, correctness, clarity, conciseness, total_score, and one_sentence_verdict.\n\"\"\"\n```\n\nWaiting for four sequential API calls is slow. I use a thread pool to hit all candidate models at once and record wall-clock latency for each.\n\n``` python\nimport time\nimport concurrent.futures\n\ndef query_model(model_id: str, prompt: str) -> dict:\n    start = time.perf_counter()\n    response = client.chat.completions.create(\n        model=model_id,\n        messages=[\n            {\"role\": \"system\", \"content\": \"You are a helpful coding assistant.\"},\n            {\"role\": \"user\", \"content\": prompt},\n        ],\n        temperature=0.2,\n    )\n    elapsed = time.perf_counter() - start\n    return {\n        \"model\": model_id,\n        \"text\": response.choices[0].message.content,\n        \"latency_sec\": round(elapsed, 2),\n    }\n\ndef run_benchmark(prompt: str):\n    results = []\n    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:\n        futures = {\n            executor.submit(query_model, m, prompt): m\n            for m in CANDIDATE_MODELS\n        }\n        for future in concurrent.futures.as_completed(futures):\n            results.append(future.result())\n    return results\n```\n\nNow we feed each candidate response into a judge. I use llama-3.3-70b as the judge because it gives stable JSON formatting.\n\n``` php\nimport json\n\ndef judge_response(candidate: dict, original_prompt: str) -> dict:\n    judge_input = (\n        f\"User request:\\n{original_prompt}\\n\\n\"\n        f\"Candidate response from {candidate['model']}:\\n{candidate['text']}\\n\\n\"\n        \"Score the response and return the JSON object.\"\n    )\n    response = client.chat.completions.create(\n        model=\"llama-3.3-70b\",\n        messages=[\n            {\"role\": \"system\", \"content\": JUDGE_SYSTEM_PROMPT},\n            {\"role\": \"user\", \"content\": judge_input},\n        ],\n        temperature=0.1,\n    )\n    raw = response.choices[0].message.content.strip()\n    if raw.startswith(\"\n\n```\"):\n        raw = raw.split(\"```\n\n\")[1].replace(\"json\", \"\").strip()\n    scores = json.loads(raw)\n    return {**candidate, **scores}\n\ndef score_all(results: list, prompt: str):\n    return [judge_response(r, prompt) for r in results]\n```\n\nFinally, we print a markdown table so the differences are obvious at a glance.\n\n``` python\ndef print_report(scored_results: list):\n    print(\"| Model | Latency (s) | Correctness | Clarity | Conciseness | Total | Verdict |\")\n    print(\"|-------|-------------|-------------|---------|-------------|-------|---------|\")\n    for r in scored_results:\n        print(\n            f\"| {r['model']} | {r['latency_sec']} | \"\n            f\"{r['correctness']} | {r['clarity']} | {r['conciseness']} | \"\n            f\"{r['total_score']} | {r['one_sentence_verdict']} |\"\n        )\n\nif __name__ == \"__main__\":\n    print(\"Running benchmark...\")\n    raw_results = run_benchmark(TEST_PROMPT)\n    scored = score_all(raw_results, TEST_PROMPT)\n    scored.sort(key=lambda x: x[\"total_score\"], reverse=True)\n    print_report(scored)\n```\n\nSave the script as `benchmark.py`\n\n, export your key, and run it.\n\n```\nexport OXLO_API_KEY=\"your-key-here\"\npython benchmark.py\n```\n\nExample output (values will vary by run):\n\n```\nRunning benchmark...\n| Model | Latency (s) | Correctness | Clarity | Conciseness | Total | Verdict |\n|-------|-------------|-------------|---------|-------------|-------|---------|\n| deepseek-v3.2 | 4.2 | 5 | 5 | 4 | 14 | Produces correct LIS with clean type hints and a valid doctest. |\n| kimi-k2.6 | 3.8 | 5 | 4 | 4 | 13 | Correct solution but slightly verbose docstring. |\n| qwen-3-32b | 2.1 | 4 | 4 | 5 | 13 | Correct logic, omits explicit test case in the block. |\n| llama-3.3-70b | 1.9 | 4 | 5 | 4 | 13 | Good structure, test case is present but uses print instead of assert. |\n```\n\nSwap the static prompt for a JSONL test suite so you can regression-test model behavior on every deploy. You can also add a lightweight Streamlit frontend so non-engineers can run comparisons and vote on their preferred output.", "url": "https://wpnews.pro/news/comparing-llm-models-a-technical-deep-dive", "canonical_source": "https://dev.to/shashank_ms_6a35baa4be138/comparing-llm-models-a-technical-deep-dive-dhp", "published_at": "2026-06-16 21:35:26+00:00", "updated_at": "2026-06-16 21:59:58.399696+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "machine-learning"], "entities": ["Oxlo.ai", "Llama 3.3 70B", "Qwen 3 32B", "Kimi K2.6", "DeepSeek V3.2", "OpenAI"], "alternates": {"html": "https://wpnews.pro/news/comparing-llm-models-a-technical-deep-dive", "markdown": "https://wpnews.pro/news/comparing-llm-models-a-technical-deep-dive.md", "text": "https://wpnews.pro/news/comparing-llm-models-a-technical-deep-dive.txt", "jsonld": "https://wpnews.pro/news/comparing-llm-models-a-technical-deep-dive.jsonld"}}