{"slug": "kog-laneformer-2b-the-latency-first-model-behind-kog-inference-engine", "title": "Kog Laneformer 2B: The Latency-First Model Behind Kog Inference Engine", "summary": "Kog released Laneformer 2B, a 2.3B-parameter coding model designed for high-speed decoding, on Hugging Face Hub. The model was trained from scratch with latency as the primary objective, achieving 45.1% HumanEval+ and 51.6% MBPP+ in greedy decoding. Kog, a Paris-based AI infrastructure startup, developed the model to co-design architecture and runtime for real-time inference.", "body_md": "# Kog Laneformer 2B: The Latency-First Model Behind Kog Inference Engine\n\nToday Kog is releasing the weights and model code of Laneformer 2B on Hugging Face Hub, the 2.3B-parameter instruction-tuned coding model designed for high-speed decoding.\n\nMost LLM research optimizes for benchmark quality first, and inference metrics like speed are often treated as a serving problem that comes later: train the model, then quantize it, shard it, batch inputs, cache inputs, and write better kernels.\n\nKog took a different route and treated speed as our first objective. What changes when a model is designed from the ground up with decoding speed maximization in mind? Which architectural choices does that rule out, and which ones still preserve strong model performance?\n\nThis blog post is the story of how Kog trained Laneformer 2B from scratch into a capable coding model while respecting the hardware constraints required by our Kog Inference Engine and the budget constraints of a startup.\n\nAbout Kog\n\nKog is a Paris-based AI infrastructure startup building a real-time inference engine for AI agents with innovative low-level GPU engineering and LLM architecture research.\n\nFor more background, see Kog's website and introductory blog post:\n\nTL;DR\n\n- Kog designed a lane-structured Transformer architecture for high-speed single-request decoding on our inference stack.\n- Kog validated the custom architectural changes at small scale, then trained the final 2.3B model from scratch on ~4T pre-training tokens, continued on ~2T code/reasoning-heavy tokens, and instruction-tuned on ~210M tokens.\n- Kog shows that, even with moderate resources, it is possible to build and deploy a custom small language model with competitive coding benchmark results in its size range.\n- Laneformer 2B reaches\n**45.1% HumanEval+** and**51.6% MBPP+** in greedy decoding. - Kog releases the weights, Hugging Face model code and documentation as\n[kogai‑laneformer‑2b‑it ↗](https://huggingface.co/kogai/laneformer-2b-it?ref=blog.kog.ai) - You can experience the accelerated version via our Kog Inference Engine on our\n[playground ↗](https://playground.kog.ai/?ref=blog.kog.ai)\n\nThe Laneformer 2B technical report is available on Hugging Face.\n[Read the full report ↗]([COLLE TON LIEN HF ICI])\n\n## The idea\n\nAt low batch sizes, [decode speed is not just a FLOPs problem](https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/). A lot of time goes into moving weights, synchronizing kernels, and paying communication costs layer after layer.\n\nThis overhead increases even more in multi-GPU setups, where inter-GPU communication is introduced. At the model architecture level, [Tensor Parallelism (TP)](https://danielvegamyhre.github.io/ml/performance/2025/03/30/illustrated-megatron.html?ref=blog.kog.ai) is a well-known way to split work across GPUs, but each layer forces the devices to stop and exchange results before moving on to the next layer.\n\nThis led us to a simple question: can we hide those communication costs instead of paying them at every layer?\n\nNaive attempts to solve this problem can introduce ad hoc architectural changes that hurt model quality, and make the method difficult to apply to an existing pre-trained architecture without leaving performance on the table.\n\nFast inference does not require training a new model from scratch and Kog's inference engine already achieves very high decoding speeds on standard pre-trained architectures through low-level GPU optimization. But to go further, the runtime can no longer be treated as a separate serving layer: the model architecture itself has to expose the right structure for the engine to exploit.\n\nThose observations left us with a single conclusion: for the fastest single-request inference, **architecture and runtime should be designed together**. Laneformer is our first model trained from scratch to explore that co-design point.\n\nAs a small startup, we could not solve this by scaling indefinitely yet. The target had to be deliberately constrained: **design and train a small-scale model with strong coding capabilities and extreme inference decoding speed.**\n\n## The story\n\n### Hiding overhead\n\nTensor Parallelism (TP) is effective because it splits large matrix operations across GPUs and pays it with inter-GPU synchronization. At batch-size-one decoding, this cost is especially painful.\n\nThe obvious idea is to delay the communication introduced by TP. In practice, doing this naively leads to subpar model quality: once hidden states are no longer synchronized at the usual boundaries, model quality starts to drop off sharply and finding architectural ideas becomes necessary for training stability and maintaining model quality.\n\nWe spent this phase testing variants at small scale. Interestingly, many of our more complex ideas either degraded quality or made the implementation too brittle. The useful lesson was almost embarrassingly simple: try the obvious thing first! Understand why it fails and fix it with only the most minimal architectural change needed.\n\nThat path led to the mechanism we now call [Delayed Tensor Parallelism (DTP)](https://blog.kog.ai/delayed-tensor-parallelism-for-faster-transformer-inference). For the full mechanism, see our DTP deep dive.\n\n### Designing the architecture\n\nOnce DTP had a viable shape, the rest of the model design had to stay conservative. DTP was already changing one important assumption of a standard Tensor-Parallel Transformer, so we did not want to introduce unrelated architectural novelty at the same time.\n\nThere was also a speed budget. In the idealized limit where communication and other overheads are hidden, batch-size-one throughput becomes bounded by GPU memory bandwidth. Each generated token requires streaming model weights and the KV cache from GPU memory to compute units, so the maximum theoretical tokens per second depends strongly on how much data must be read per forward pass.\n\nSince we were not targeting very long contexts in this release, the main practical question became: how large can the model be?\n\nIn theory, this question could be answered from hardware bandwidth alone. In practice, another constraint mattered just as much: what budget and compute could we actually sustain, and how long could we train?\n\nThe final size was therefore chosen at the intersection of three constraints:\n\n- small enough to train from scratch with our resources,\n- large enough to make coding benchmarks and post-training meaningful,\n- support DTP and our inference engine to reach the highest speed possible.\n\nThe 2B scale ended up being perfect for us.\n\nBelow is our architecture card, showing how DTP and lanes fit into a mostly standard decoder Transformer.\n\nSome notes:\n\n- The strongest architectural change is the new 8-lane system to support DTP. To perfectly hide the TP communication overhead, we calibrated the delay to 2.\n- We use causal Grouped-Query Attention (GQA) with 32 query heads and 16 key/value heads, allowing us to shard the heads evenly across our 8 lanes.\n- We used Sliding Window Attention (SWA) for 10 of the 15 layers of the model to make sure streaming the KV cache would never introduce a non-negligible latency at our context size.\n\n### Pre-training and mid-training\n\nPre-training, even at the 2B scale, is a mountain to climb! Fortunately, today there are many useful resources, such as the [Smol training playbook](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook?ref=blog.kog.ai) to help you get started.\n\nSo first, let's devise the data recipe!\n\n#### Data mixtures and recipes\n\nOur goal was to produce a capable coding model. Since we did not have access to proprietary data, we approached the problem under an open-source constraint.\n\nThanks to the current open ecosystem, and particularly NVIDIA's work on Nemotron, we were able to gather more than 20TB of high-quality filtered open-source data and avoid the long and costly data mixture search.\n\n🤔 Remark\n\nGetting 20TB of data from HF might sound simple, but working at this scale becomes a systems problem in its own right. Moving, filtering, validating, and reshuffling the data can take days, so small mistakes in the data pipeline can quickly become expensive.\n\nFollowing the [NVIDIA Nemotron paper](https://arxiv.org/abs/2604.12374?ref=blog.kog.ai), we used a multi-phase recipe as our starting point:\n\n- start with a broad generalist data mixture,\n- continue with a more focused data mixture,\n- optionally end with a short final phase to improve specific capabilities.\n\nThat being said, our 2B model is deliberately small. At this scale, it cannot absorb every useful signal contained in trillions of available tokens. So even if we could draw inspiration from Nemotron data mixtures, the final mixture had to be more selective. For Laneformer 2B, we introduced one twist: we concentrated most of the specialization in the second phase.\n\nPhase 1 would build a broad base. Phase 2 would shift the model strongly toward coding and reasoning capability.\n\n🤔 Remark\n\nThis was an explicit trade-off. But a strong data-mixture shift between phases is not standard practice, and we expected it to hurt some general capabilities. For this release, we accepted that risk, betting that we would receive in exchange a strong coding-focused model.\n\n#### Infrastructure and performance\n\nThe training stack was built around repeatability as much as raw throughput: launch scripts with isolated snapshots, checkpoint conversion, HF export, validation, lm-eval support, CORE tracking, and dashboarding all became part of the project to ensure a reliable pre-training run with real-time monitoring.\n\nNo need to repeat how scary it is to spend hundreds of thousands of dollars on a run with no possibility of fixing it afterward. To avoid any potential training downtime we ensured with our partners that we could swap nodes effectively thanks to spare machines, and made restart reliability part of the training system.\n\nWe also spent quite a lot of time precisely tuning our model parameter budget, architecture details, number of data shards, and TorchTitan's distributed-training and compilation knobs to reach about 17k tokens/s/GPU during pre-training.\n\nOur pre-training setup ended up being:\n\n- 24 nodes with 8 NVIDIA H100 GPUs each, for 192 H100 GPUs.\n- Efficient European infrastructure through Scaleway and ADASTRA.\n- About 21 days of training.\n- Mixed FP32/BF16 precision.\n- Fully Sharded Data Parallel (FSDP) parallelism.\n- AdamW with a WSD learning-rate schedule.\n- Final-step checkpoint selection.\n\n#### Runs\n\nFor pre-training and mid-training, we used [TorchTitan](https://github.com/pytorch/torchtitan?ref=blog.kog.ai) as our distributed training stack. TorchTitan gave us a simple and reliable foundation for large-scale FSDP training while still leaving enough flexibility to support the custom Laneformer architecture.\n\nLaneformer is close to a standard decoder-only Transformer in many ways, but its lane structure and DTP-oriented layout meant that we needed to control the model implementation, distributed configuration and compilation carefully.\n\nThe pre-training phases used a sequence length of 4,096 and a global batch size of 1,536, or about 6.29M tokens per optimizer step.\n\nAs we can see, moving to phase 2 drastically reduces the token diversity and the model stabilizes at a much lower cross-entropy than during phase 1. This is to be expected.\n\n🤔 Remark\n\nAlthough the chart shows multiple colors, they all correspond to the same training loss run. The color changes mark points where the training job was restarted from a temporary checkpoint after a failure.\n\n#### Evaluation tracking\n\nFor the pre-training validation monitoring, we decided to use the [DataComp-LM Core centered accuracy](https://arxiv.org/abs/2406.11794?ref=blog.kog.ai) which is computed over 22 low-variance tasks by subtracting each task random baseline.\n\nBesides the stable training loss curves (no spikes, no waves), this broad evaluation suite gave us the signal to ensure that our model was learning and improving across the whole training run.\n\nBelow, we can see the evolution of the metric across the two phases, we note an acceleration of performance during the learning-rate cooldown.\n\nWe also note that our choice to radically change the Phase 2 data mixture to coding/reasoning focused comes with a non-negligible cost on general capabilities of the model (We will see later, after post-training, that it was an acceptable price to pay).\n\n## Post-training\n\nFor post-training, we settled on a very simple recipe, focusing on Supervised Fine Tuning (SFT) with instruction and identity fine-tuning. Stay tuned for our next version of the model with a much stronger post-training pipeline.\n\nOur goal was mainly to surface the base model's capabilities with a lightweight process, using around 200M tokens.\n\nNotably, we experimented with synthetic data generation for the identity fine-tuning and saw promising possibilities for future releases!\n\n### Full training summary\n\n| Stage | Tokens | Steps | Purpose |\n|---|---|---|---|\n| Pre-training | About 4T | 620,000 | Build the base model on broad Nemotron pre-training data |\n| Mid-training | About 2T | 310,000 | Continue on a more code- and reasoning-heavy mixture |\n| Post-training | About 210M | 200 | Instruction-tune the released model |\n\n## Evaluation\n\nWe evaluated our final instruct model on **HumanEval+** and **MBPP+** thanks to the official [evalplus](https://github.com/evalplus/evalplus?ref=blog.kog.ai) runner.\n\nEven though `evalplus`\n\nalready provides a preprocessing step to ensure the correct piece of code is evaluated, we added one more step to make sure that only the relevant code block is preprocessed when possible. This drastically improves the score of models that have a tendency to produce multiple code blocks per response, reducing variance when using pass@N settings.\n\n🤔 Remark\n\nThis preprocessing step consists of making sure we select, if possible, only the \"code block containing the target function name\"; this is what we call `target_function`\n\n.\n\nHere is a summary of our evaluation system, all models were evaluated with the **exact same settings**.\n\n| Mode | `greedy (temperature=0, do_sample=False)` |\n| Preprocessing | `target_function` block selection |\n| Runner | evalplus |\n\n### Second phase validation\n\nThe first verification we ran on top of the base model CORE metrics is the MBPP+ and HumanEval+ benchmarks on our Phase 1 and Phase 2 base models post-trained for coding and instruction.\n\nWe find that:\n\n- Phase 2 did improve our model's coding performance by more than 10 points across HumanEval+ tests.\n\n### Comparison with similar-size models\n\nFinally, using our `evalplus`\n\nsetup, we evaluate well-known public models in our size range to ensure that our training recipe is efficient.\n\nWe find that:\n\n- Our model is extremely competitive for a model of its size on the different HumanEval benchmarks.\n\nThese results validate that the model is useful enough for small code-generation problems.\n\n🤔 Remark\n\n**Laneformer benefits from stochastic decoding**: pass@N improves consistently for N ∈ {2, 4, 8, 16}. As a single pass usually takes less than 0.3 seconds, sampling multiple candidate solutions becomes a practical way to trade a small amount of latency for higher coding accuracy.\n\n### Inference speed\n\nIn our public KIE preview, Laneformer 2B reaches 3,000 output tokens/s/request on 8× AMD MI300X and 2,100 output tokens/s/request on 8× NVIDIA H200, using FP16, batch size 1, and no speculative decoding.\n\nTo our knowledge, this is the fastest publicly demonstrated single-request decoding results for a 2B-class model on standard datacenter GPUs.\n\n## Why open-source our 2B model?\n\nThis Hugging Face release lets you try the model as a Transformers-compatible model, read the custom architecture, run your own experiments, fine-tune it, or use it as a reference when thinking about latency-oriented model design.\n\nIt is both a usable checkpoint and a research artifact.\n\nWe believe open source is the best way to make this kind of systems and architecture work useful, inspectable, and extensible by the broader community.\n\nWe are sharing the weights, implementation, and recipe because we believe latency should be treated as a model design constraint, not only as a serving concern.\n\n### What we are releasing\n\n[ kogai/laneformer‑2b‑it](https://huggingface.co/kogai/laneformer-2b-it?ref=blog.kog.ai) includes:\n\n- An instruction-tuned Laneformer 2B checkpoint in BF16.\n- A custom Hugging Face implementation for the\n`laneformer`\n\narchitecture. - Model configuration, architecture metadata, tokenizer information, and chat template.\n- Evaluation results and documentation.\n\nKog-owned materials in the repository, i.e. model weights, custom Hugging Face code, configuration, metadata, documentation, and model card are released under Apache License 2.0. Tokenizer artifacts are based on the Llama 2 tokenizer and are distributed under the Llama 2 Community License; users and redistributors are responsible for complying with those terms.\n\nTo use it properly, please refer to the Hugging Face repository model card.\n\n## Limitations\n\nLaneformer 2B is a small coding-focused model, not a general-purpose frontier model.\n\n- Its 4,096-token context length and sliding-window attention choices reflect our latency target. Long-context extension is currently in progress.\n- Our Phase 2 data mixture intentionally emphasizes code and reasoning, and CORE tracking suggests that this specialization comes at some cost to broad general capabilities.\n- The released Hugging Face checkpoint is useful as a standard model, but the main latency benefits require the Kog Inference Engine and the DTP-aware execution path.\n\n## Conclusion\n\nWe are releasing Laneformer 2B, a small language model with coding capabilities. In addition to the final instruct model checkpoint, we also share the full backstory and training recipe, including pre-training, mid-training, post-training, and evaluation details.\n\nWe hope this model will spark interest in a new kind of model architecture, co-designed with hardware in mind, not only to achieve high benchmark performance but also to reach high TPS at inference time!\n\n## Acknowledgments\n\nLaneformer 2B was trained and released with support from several partners and open ecosystems. We thank [Scaleway](https://www.scaleway.com/?ref=blog.kog.ai) and [Adastra](https://www.cines.fr/calcul/adastra/?ref=blog.kog.ai) for the French GPU infrastructure used during training, the [TorchTitan](https://github.com/pytorch/torchtitan?ref=blog.kog.ai) project for the distributed training foundation, [Hugging Face](https://huggingface.co/?ref=blog.kog.ai) for the release platform and Transformers ecosystem, [NVIDIA](https://www.nvidia.com/?ref=blog.kog.ai) for the Nemotron datasets and papers, and [evalplus](https://github.com/evalplus/evalplus?ref=blog.kog.ai) for their rigorous evaluation framework!\n\nA special shout-out to the global machine-learning community and open-source ecosystem, without which this work would never have been possible.\n\n## Explore these links to dig deeper\n\n- Test our speed in the\n[Kog Playground](https://playground.kog.ai/?ref=blog.kog.ai) - Kog Laneformer 2B HF model:\n[huggingface.co/kogai/laneformer‑2b‑it](https://huggingface.co/kogai/laneformer-2b-it?ref=blog.kog.ai) - Kog Inference Engine launch post:\n[blog.kog.ai](https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/) - Kog monokernel post:\n[blog.kog.ai](https://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus/) - Delayed Tensor Parallelism post:\n[blog.kog.ai](https://blog.kog.ai/delayed-tensor-parallelism-for-faster-transformer-inference/) - Nemotron dataset:\n[huggingface.co/collections/nvidia/nemotron](https://huggingface.co/collections/nvidia/nemotron-pre-training-datasets?ref=blog.kog.ai) - Hugging Face Smol training playbook:\n[Smol training playbook](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook?ref=blog.kog.ai#training-compass-why--what--how)\n\n## Citation\n\nIf you use Laneformer 2B, please cite this blog post:\n\n```\n@online{kog_laneformer_2b_2026,\n  title   = {{Laneformer 2B: The Latency-First Model Behind Kog Inference Engine}},\n  author  = {Kog Team},\n  year    = {2026},\n  url     = {https://blog.kog.ai/kog-laneformer-2b-the-latency-first-model-behind-kog-inference-engine},\n  note    = {Kog blog}\n}\n```\n\n", "url": "https://wpnews.pro/news/kog-laneformer-2b-the-latency-first-model-behind-kog-inference-engine", "canonical_source": "https://blog.kog.ai/kog-laneformer-2b-the-latency-first-model-behind-kog-inference-engine/", "published_at": "2026-06-25 13:33:40+00:00", "updated_at": "2026-06-25 13:49:37.433677+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-startups", "ai-research", "ai-tools"], "entities": ["Kog", "Laneformer 2B", "Hugging Face", "Kog Inference Engine", "HumanEval+", "MBPP+"], "alternates": {"html": "https://wpnews.pro/news/kog-laneformer-2b-the-latency-first-model-behind-kog-inference-engine", "markdown": "https://wpnews.pro/news/kog-laneformer-2b-the-latency-first-model-behind-kog-inference-engine.md", "text": "https://wpnews.pro/news/kog-laneformer-2b-the-latency-first-model-behind-kog-inference-engine.txt", "jsonld": "https://wpnews.pro/news/kog-laneformer-2b-the-latency-first-model-behind-kog-inference-engine.jsonld"}}