Train your own LLM? Here's what happens

wpnews.pro

Large Language Models feel almost effortless to use today. You send a prompt, get an answer, and move on. But behind that simplicity lies a process that is anything but trivial-and often widely underestimated.

In conversations with colleagues and customers, one question kept coming up: “What does it actually take to build an LLM?” Not use one. Not fine-tune one. But build one end-to-end.

So I decided to find out.

Instead of relying on cloud-scale infrastructure, I took a more constrained (and realistic) approach: build a small LLM on a single notebook, understand every step of the process, and push it all the way to production-like usage. The goal wasn’t to compete with state-of-the-art models-it was to make the mechanics visible. And that’s where things get interesting.

Because training a model is only half the story. The real question is: how do you actually use it in a data-driven environment? In this case, directly where the data lives in the database. In this article, I’ll walk through the full journey:

What it takes to train a simple LLM (and where it fails),
Why fine-tuning beats starting from scratch in most cases,
And how to run inference directly inside the Exasol database, enabling parallel processing of large text datasets without moving data around.

This is not a production blueprint. It’s a hands-on exploration of how LLMs really work – from raw data to SQL query.

If you’ve ever wondered what’s behind the curtain of modern AI – and how it connects to real data systems – this is for you.

Engineering the Training Pipeline #

Before training the LLM, there’s a more practical question to solve: how do you actually run the process? You need scripts – or ideally a full application-that orchestrates everything from data ingestion to deployment.

Here’s where things get interesting. Instead of writing the application from scratch, I took a different route: I asked an AI to build it for me. This approach is often called vibe coding. In simple terms, rather than focusing on syntax, you focus on intent and iterate from there.

Over multiple sessions, I guided an AI coding assistant to develop a complete training application. The goal was ambitious: cover the entire lifecycle of an LLM, including dataset handling, training, evaluation, and deployment. For the runtime environment, I chose the Apple ecosystem, specifically a machine powered by Apple Silicon chips. The integrated GPU provides a significant performance advantage for local model training, making it a practical choice for experimentation. The coding was done in Xcode, Apple’s development environment, which integrates perfectly with Anthropic’s Claude Code and OpenAI Codex. The entire development was handled in one place.

The resulting application serves as a control center for the entire LLM pipeline, enabling users to define a neural network architecture or choose a pre-trained model, load and preprocess training datasets, train the model with configurable parameters, evaluate its performance through checkpoints or intermediate states, deploy the trained model, and run inference directly within an Exasol environment.

Interestingly, building this application required relatively little time compared to the effort required to develop and train the models themselves, highlighting how AI-assisted development is reshaping modern software engineering practices.

The screenshot below captures the application during an active training phase, presenting real-time log output, interval-based metrics collected between model checkpoints, GPU utilization data, and visualizations that track training progress and convergence behavior.

Framing the Problem Space #

Defining the use case means focusing on the system’s practical objective: building a large language model that can answer questions based on previously unseen text. In technical terms, this involves enabling generalization over unstructured data so the model can interpret and extract meaning without requiring task-specific retraining for each new input. From a business perspective, this translates into faster, more consistent access to insights hidden within large volumes of information.

A clear example is a portfolio manager operating in financial markets, whose daily workflow often involves reviewing hundreds or even thousands of documents, such as earnings reports, regulatory disclosures, press releases, and analyst briefings. While the underlying questions tend to remain straightforward, identifying the relevant answers across such a fragmented and high-volume information landscape is time-consuming and error-prone.

An LLM designed for question answering addresses this challenge by automatically processing unstructured text and surfacing the key insights, effectively reducing manual effort while improving speed and scalability. Because the model can handle previously unseen documents without additional retraining, it becomes a flexible tool that supports both technical workflows and business decision-making in dynamic, data-intensive environments.

Training Begins: From Setup to Signals #

In earlier experiments within Exasol’s AI Lab, the workflow relied on pre-trained models from Hugging Face, providing strong baseline performance with minimal setup. In this iteration, the focus shifts to building a model from scratch to better understand the full training lifecycle of a large language model and identify where complexity, resource constraints, and scaling challenges arise during implementation.

Model validation combines standardized benchmarks with custom-generated data. A key reference point is the Stanford Question Answering Dataset (SQuAD) [1], which frames the task as extractive question answering over contextual passages. For example, a passage describing the Apollo program requires the model to infer the year of the first manned mission from within the text rather than relying on memorized facts. To test generalization beyond curated datasets, additional evaluation data includes self-authored content, such as simplified narratives about Walt Disney characters, which introduces more variability and less predictable linguistic structure.

With the pipeline in place and evaluation defined, the system moves into training, where practical challenges quickly emerge. The initial attempt starts from a randomly initialized neural network, meaning all weights are assigned without prior knowledge. Training is performed on an extremely limited dataset consisting of only a few short text samples. From a technical standpoint, the process behaves as expected: the training loop executes correctly, loss values decrease, and inference runs without errors. However, the generated output consists entirely of incoherent text, demonstrating that a functioning pipeline alone is insufficient without adequate data. The model fails to learn any meaningful linguistic structure because the dataset is too small to capture patterns in grammar, semantics, or context. This confirms that the limitation lies not in the architecture or implementation, but in the scale of the training data.

The next iteration addresses this constraint by significantly increasing the dataset size through an extended data ingestion pipeline that can retrieve larger corpora from multiple sources, including Hugging Face. This introduces new challenges related to data handling, as large datasets must be divided into manageable chunks to support efficient batch processing. Preprocessing becomes a non-trivial, time-intensive step, which is mitigated by caching transformed data for reuse across training sessions, particularly when resuming from checkpoints.

With more data, the model begins to produce structurally valid text. Sentences become grammatically consistent, punctuation stabilizes, and outputs are generally readable. Despite these improvements, the semantic quality remains limited, with responses often lacking precision and consistency. This highlights that while increased data volume improves surface-level fluency, achieving meaningful performance also depends on sufficient training time and computational resources.

At this stage, the key constraint shifts from data availability to resource investment. Training a high-quality model from scratch is computationally expensive and time-consuming, especially in constrained environments. Since the system already supports flexible data ingestion from benchmarks, external repositories, and local sources, the primary bottleneck is no longer data access but the time required to achieve acceptable model performance. This leads to a strategic decision between continuing full-scale training from scratch, which offers maximum control but demands significant resources, and transitioning to fine-tuning an existing model, which reduces time to results while introducing trade-offs in flexibility and customization.

Where It Gets Real #

At this stage, it became clear that continuing with full training from scratch was not the most effective approach for this scenario. While earlier experiments demonstrated that building a large language model end-to-end is technically feasible, the associated cost in terms of time and computational resources is substantial. This is particularly evident when working with larger datasets and more complex architectures on consumer-grade hardware. Even a high-performance system such as an Apple MacBook Pro equipped with an Apple M4 Max reaches practical limits. Although such hardware performs well for inference workloads, including running large parameter models locally, training remains significantly more resource-intensive and quickly becomes impractical for iterative development.

A more viable approach is to leverage a pre-trained model as a starting point. In this case, the choice fell on GPT-2, a widely used model developed by OpenAI. GPT-2 was trained on a large corpus of text data and is capable of handling a range of natural language processing tasks, including text generation and zero-shot question answering. However, when applied directly to the specific question-answering use case, the model exhibited several limitations. The generated responses were often inconsistent, occasionally contradictory, and not reliably grounded in the provided context – as seen in the screenshot below. This outcome highlights an important distinction: strong general-purpose capabilities do not automatically translate into task-specific performance.

Rather than discarding the model, the focus shifts to fine-tuning, the process of adapting a pre-trained model to a specific task. Technically, this resembles standard training, but the dynamics of optimization differ significantly. A key parameter in this process is the learning rate, which controls how much the model’s weights are updated during each training step. Unlike training from scratch, where higher learning rates help accelerate convergence from random initialization, fine-tuning requires much smaller adjustments. If the learning rate is too high, the model risks overwriting the knowledge acquired during pre-training, effectively degrading performance rather than refining it. By using a reduced learning rate, the model retains its general language understanding while gradually aligning with the target task.

This shift from full training to fine-tuning represents a move from computationally expensive, brute-force learning toward more targeted optimization. In practice, it significantly reduces resource requirements while still achieving meaningful improvements in task performance. After several thousand training steps and iterative validation during the process, the resulting model reached a level of quality that, while not perfect, is sufficient for the intended use case.

With a fine-tuned model available, the next step is deployment. In this setup, the model is integrated directly into the Exasol database environment. The training application automates deployment by generating a Python-based user-defined function that encapsulates the inference logic. This allows the model to be executed close to the data, eliminating the need for data movement and enabling seamless integration into existing analytical workflows. The required runtime support is provided through a script language container with Transformers compatibility, as available in Exasol’s AI-Lab [2]. Notably, this deployment mechanism aligns with earlier stages of the workflow, where the base model was integrated using the same approach, making it straightforward to replace or upgrade models without altering downstream consumers.

Evaluation of the fine-tuned model on the same context–question pairs used previously shows a clear improvement. The model now produces largely correct, consistent, and contextually grounded answers. Compared to both the from-scratch training attempts and the baseline pre-trained model, the difference in output quality is significant. There remains a small performance gap, as one test case still produces an incorrect answer – see screenshot below again. This is most likely due to insufficient training rather than a fundamental limitation of the model architecture. The overall behavior indicates that the model generalizes well and does not exhibit signs of overfitting, suggesting that additional training iterations or expanded dataset coverage would further improve results without compromising generalization.

From Model to Production: Deploying LLM Inference Inside Exasol #

This deployment pattern is consistent with earlier stages of the workflow, where even the base, non-fine-tuned model was executed using the same user-defined function mechanism. Maintaining this consistency simplifies model lifecycle management, as different model versions can be swapped without modifying how they are consumed. Once the deployment pipeline is established, improvements to the model become an iterative process that does not require changes to downstream integration.

Evaluating the fine-tuned model on the same context–question pairs used in previous experiments shows a clear and measurable improvement in performance. The model now produces answers that are largely correct, consistent, and grounded in the provided context, representing a significant step forward compared to both the from-scratch training attempts and the baseline pre-trained model. There is, however, a remaining edge case where one of the questions is answered incorrectly. This behavior is most likely due to insufficient training rather than a limitation of the model architecture, suggesting the model has not yet fully captured all relevant patterns in the dataset.

The overall performance suggests that the model generalizes well across examples and does not exhibit typical signs of overfitting, such as memorizing specific inputs at the expense of broader applicability. Instead, the remaining gap appears incremental and can likely be addressed through extended training or increased dataset coverage. Fine-tuning has therefore shifted the model from unreliable outputs to largely correct, usable results, with further improvements achievable through continued iteration.

At this stage, the process shifts from experimentation to optimization: refining the model until it consistently meets the required level of accuracy.

The Python-based UDF used for inferring the two LLMs for testing is:

CREATE OR REPLACE PYTHON3_TE SCALAR SCRIPT "EXASOL_DIB"."GPT2_INFERENCE" ("text" VARCHAR(10000) UTF8) RETURNS VARCHAR(10000) UTF8 AS
from pathlib import Path
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, re

MODEL_PATH = Path('/buckets/bfsailab/xxxxxx/ai-lab/models/xxxxxx/dirkllm_gpt2_question-answering')

def run(ctx):
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    model = AutoModelForCausalLM.from_pretrained(MODEL_PATH)
    model.eval()
    inputs = tokenizer(ctx.text, return_tensors='pt')
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=128,
                             pad_token_id=tokenizer.eos_token_id)
    decoded = tokenizer.decode(out[0], skip_special_tokens=True)
    m = re.search(r'(?i)answer\W*(.*?)(?=\n+|$)', decoded)
    return m.group(1).strip() if m else decoded

Results in Action: What the System Actually Delivers #

A subtle but important improvement after fine-tuning is the structure of the generated output. The model now consistently produces a compact answer followed by newline delimiters, effectively behaving like an implicit stop signal that separates the final result from any residual text. This behavior was not present in the base model, which tended to produce more verbose, less deterministic continuations.

From an implementation perspective, this makes post-processing straightforward. A simple regex-based extraction is sufficient to isolate the answer segment, without requiring heuristic parsing or complex output validation logic. You can see the simple logic at the end of the Python function above. This reduces integration complexity significantly and improves reliability in production pipelines.

At the system level, deployment within Exasol provides an additional scalability advantage. When the model is exposed via a user-defined function, execution is automatically distributed across database nodes. Inference runs directly where the data is partitioned, enabling native parallelism without any orchestration layer or application-side coordination.

This architecture removes cross-system data movement entirely and allows inference to scale linearly with the cluster. For workloads involving large document sets, this design effectively turns the LLM into a distributed compute primitive embedded in the SQL execution engine, which is a major advantage over external inference services.

Final Thoughts #

Reflecting on the overall project, the objective was not only to build a model but to understand the full lifecycle of LLM development, from data ingestion and training to deployment and in-database execution. The system that emerged is functional and demonstrates the pipeline’s feasibility, even though it does not represent a production-grade, general-purpose model.

A key enabler in this process was AI-assisted development, also known as vibe coding. It significantly accelerated implementation by supporting rapid prototyping, iterative refinement, and automated debugging. However, this comes with a trade-off: generated code introduces abstraction layers that are not always transparent, shifting part of the engineering effort into prompt design and iterative correction rather than explicit implementation.

From a practical standpoint, training an LLM from scratch is not a viable approach for most scenarios. The requirements for data volume, compute resources, and training time exceed what is reasonable outside dedicated infrastructure environments. In most real-world applications, starting from pre-trained models and applying fine-tuning or retrieval-augmented generation provides a far more efficient path.

In environments with sufficient compute capacity, more aggressive customization or full training may be justified, but those cases are exceptions rather than the norm. The more robust pattern is to build on existing models and extend them incrementally rather than attempting full reconstruction. Ultimately, the main outcome of this project is not the model itself, but the validation of an end-to-end architecture and a clearer understanding of where effort should be invested. The exact tooling is secondary; what matters is how effectively it is applied within the workflow.

Happy Exasoling!

source & further reading

exasol.com — original article