{"slug": "research-what-s-the-default-language-of-an-llm", "title": "Research: \"What's the Default Language of an LLM?\"", "summary": "A developer replicated Chad Fowler's study on LLM default coding languages using local models, finding that most models default to Python when task context is ambiguous. The glm-4.7-flash model produced 48 \"none\" results and generated excessively verbose outputs exceeding 6,000 lines, suggesting potential configuration issues or model-specific quirks. The findings confirm that local LLMs exhibit similar language preferences to cloud-based models, with Python dominating unspecified coding tasks.", "body_md": "Chad Fowler did an interesting study and posted about it to LinkedIN, in which he asked the question, \"if I ask Claude / GPT / Gemini for \"a script that...\" or \"a small web app for...\", what am I going to get back?\" I thought, \"What about local LLMs? Does that change the conversation at all?\"\n\nFirst off, his original LinkedIN post is [here](https://www.linkedin.com/posts/fowlerchad_when-you-hand-an-llm-a-coding-task-and-dont-share-7466109947307651072-xQ31/), just to give credit where credit is due. Fortunately, he also put together a nice little test harness [up on GitHub](https://github.com/chad/whichlang), which I was able to [fork](https://github.com/tedneward/Research-whichlang). I encourage readers to go look at either repository to understand the project code and methodology before continuing.\n\nThe code required a few changes to run locally:\n\nModify the `models.yaml`\n\nfile (which contained the list of models to run the prompt against). The original had a list of cloud models and providers, so it wasn't too hard to add a list of local-hosted models and URLs. There's one small mismatch, in that the code expects there to be an environment variable (`OPENAI_API_KEY`\n\n) that's used as part of the API calls, so in order to run locally I had to have some kind of value there (a la `export OPENAI_API_KEY=foobar`\n\nin the shell before running). Longer-term fix would be to probably check if it is provided, and if not, simply don't go looking for it and see if the call fails.\n\nThe original was using a second call to a cloud model to \"judge\" the returned LLM result, in order to determine what language the LLM had used to generate the code. Since I was running everything locally, I needed to modify the code to use a local LLM. Rather than switch models to match what was being used (or deliberately a different model than what was being used), I just chose a model and hard-coded it.\n\nI also added an `extract.py`\n\nscript that takes the JSONL file and turns each row into a standalone file in a peer `extractions`\n\ndirectory. This turned out to be necessary because I was getting some very weird results from the glm-4.7-flash model--more on this later. The extract script works a lot like the report script: it takes the JSONL and extracts the data into standalone files, one for each row.\n\nIn my initial run, I use `qwen-3.6`\n\n, `qwen3-coder`\n\n, `gpt-oss`\n\n, `gemma4`\n\n, and `glm-4.7-flash`\n\n, and while most of the time the results aligned pretty closely with [Chad's original results](https://github.com/chad/whichlang/blob/main/REPORT.md), the `glm-4.7-flash`\n\nmodel really choked hard.\n\nLike, 48 `none`\n\nresults, hard.\n\nThe rest of the models behaved somewhat similarly to what Chad found in his work: Lots of preference for Python when the context of the problem didn't strongly suggest (if not outright enforce) something else.\n\nBut the glm-4.7-flash failures were curious, as most of the time, it was exceptionally verbose and its output actually spilled out into a *second* response, which was actually the call to the classifier-judge request. For example, with the `cli-dir-size`\n\ntask, which `gemma4`\n\ncompleted in about 70 lines of response, the `glm-4.7-flash`\n\nmodel used over 6k lines no less than four times, and in some cases it got to a workable solution then talked itself right out of it. I have zero idea why that would be the case, but it was a common problem. We can see this when running the `python3 -m whichlang.extract`\n\nscript, which breaks the JSONL out into separate files for easier comparison.\n\nNow, I can't say for certain that the problem was with the model, since it could very well have been something I did wrong in the Ollama setup/configuration, but I couldn't say exactly what that would be. Asking Ollama for its model configuration, we got:\n\n```\ntedneward@Teds-MBP-16 Research-whichlang % ollama show glm-4.7-flash\n  Model\n    architecture        glm4moelite    \n    parameters          29.9B          \n    context length      202752         \n    embedding length    2048           \n    quantization        Q4_K_M         \n    requires            0.15.0         \n\n  Capabilities\n    completion    \n    tools         \n    thinking      \n\n  Parameters\n    temperature    1    \n\n  License\n    MIT License                        \n    Copyright (c) [year] [fullname]    \n    ...\n```\n\n... which seems fine, but...? Certainly its context length and embedding length seemed fine, and I did nothing to change any of the configuration after the `ollama pull`\n\n, but `glm-4.7-flash`\n\nconsistently failed like this over several runs.\n\nIn of itself, my modifications to Chad's experiment were pretty minor and incremental, at best--the only real \"value-add\" was the added data in the `runs.jsonl`\n\nresults. For the most part, what I think of as the \"standard\" local coding models, `gemma4`\n\n, `gpt-oss`\n\nand the various `qwen3`\n\nmodels, all did pretty well, well enough that I consider them to be on par with what the cloud models would create for a bunch of these sorts of tasks. The `glm-4.7-flash`\n\nmodel I think is stronger than this experiment suggests it to be, but it may need some kind of tuning or better harnessing to avoid what appeared to be getting caught in a \"dead-end\" loop.\n\nIf anything, my personal \"big win\" is the `tasks.yaml`\n\nfile, which I plan to use as a harness for some of my other experiments, most notably the one I was working on before Chad distracted me, around the various permutations of \"skills\" files that we see across the industry. They seem like a nice collection of tasks to feed to OpenCode and capture the results.\n\nOne last thing: When Chad and I were DM'ing about this experiment, one thing that became very apparent is how much he is hoping this experiment can serve as an ongoing, \"live\" experiment to which others can contribute and improve. I heartily second that emotion--like Chad, I'm putting all this out into the public space so that people can take it and run with it, maybe adding new models (cloud or local) and/or new tasks, or even just run the experiment with different parameters (temperature, context lengths, whatever). The more we can get data that shows different behavior of the models, the more we collectively as an industry can get a handle on exactly what and how these models can help us.\n\nAnd in the end, isn't that what these things are supposed to be doing? Helping us, I mean?", "url": "https://wpnews.pro/news/research-what-s-the-default-language-of-an-llm", "canonical_source": "http://blogs.newardassociates.com/blog/2026/rnd-whichlang.html", "published_at": "2026-06-03 00:00:00+00:00", "updated_at": "2026-06-04 10:10:16.401959+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "generative-ai", "ai-research", "ai-tools"], "entities": ["Chad Fowler", "LinkedIn", "GitHub", "Claude", "GPT", "Gemini", "OpenAI"], "alternates": {"html": "https://wpnews.pro/news/research-what-s-the-default-language-of-an-llm", "markdown": "https://wpnews.pro/news/research-what-s-the-default-language-of-an-llm.md", "text": "https://wpnews.pro/news/research-what-s-the-default-language-of-an-llm.txt", "jsonld": "https://wpnews.pro/news/research-what-s-the-default-language-of-an-llm.jsonld"}}