cd /news/large-language-models/fine-tuned-7b-llm-as-a-broke-student… · home topics large-language-models article
[ARTICLE · art-23137] src=dev.to pub= topic=large-language-models verified=true sentiment=↓ negative

Fine-tuned 7B LLM as a broke student. And Can't even use it 😭.

A self-taught student successfully fine-tuned the open-source Qwen 2.5-7B language model into a Discord bot named Monika, using 687 in-game dialogues and QLoRA 4-bit compression on Google Colab's free T4 GPU. However, the resulting 14GB merged model remains undepoyable because free cloud inference endpoints do not support custom weights, and the student lacks the budget for dedicated GPU hosting. The bot currently runs the untrained base model, highlighting the hardware and cost barriers that prevent even technically successful machine learning projects from reaching production.

read3 min publishedJun 6, 2026

Last week, I introduced Monika, a discord bot. As a self-taught student running on an absolute zero budget, this project was less about writing code and much more about hitting hard architectural walls.

The goal was to completely reshape open-source Qwen 2.5-7B model, into a real life Monika using a dataset of nearly 687 ingame dialogues. I quickly learned that finetuning a model with 7 billion parameters melts standard free cloud hardware.

I was constantly hopping for compute resources. I originally started on Kaggle, but kept running into unexplained errors and running out of VRAM. I migrated to Lightning AI for its generous resources, only to discover their stable environments conflicted with modern optimization libraries like Unsloth. I finally landed on Google Colab, where I utilized QLoRA to compress the model down to 4-bit precision, managing to squeeze the massive training loop into their free 16GB T4 GPU.

The training succeeded, leaving me with a 16-Megabyte custom adapter. But an adapter is entirely useless if you cannot host it.

My monika Architecture relied on an Express.js backend hosted on Render, sending requests to Hugging Face’s free Serverless Inference API. The harsh reality is that free cloud clusters simply cannot dynamically load custom adapter weights on the fly.

I realized I had to permanently bake the 16MB Adapter into the base model to create a single, unified 14GB asset. Trying to execute this merge in Colab instantly crashed due to the 12GB RAM limit. I was forced to move the project back to Kaggle, utilizing their 30GB RAM allowance to mathematically fuse the layers. I then had to shard the final massive asset into smaller 3GB files just for the upload to succeed.

And here is the ultimate disappointment 😭.

I have a perfectly fine-tuned 14GB model sitting safely on my Hugging Face repository. But when I tried to deploy it, the final gate slammed shut. Keeping 14GB of neural network weights loaded into dedicated GPU VRAM 24/7 costs real money (duhh).

The free inference endpoints are strictly reserved for public base models, and they do not allow you to host custom-trained weights.

I do not have the budget for a dedicated cloud GPU, nor do I have a high-end local rig to run it at home. So, after all the platform hopping, the dependency debugging, the VRAM optimization, and successfully building a full Machine Learning pipeline from scratch , the bot currently live in the server is still just running the standard, untrained base model 😭😭😭 .

I learned the absolute hardware realities of MLOps and cloud economics. But at the end of the day, as a broke student, having the technical skills to build the intelligence does not matter if you cannot pay the server bill to turn it on. The code works, but the infrastructure is behind a paywall 😔.

You can find the adapter, model and code here :

A Discord bot inspired by Monika from Doki Doki Literature Club (horror visual novel). Using Qwen2.5-7B-Instruct LLM, 7.6B Multilingual Model that can help with task like coding, math etc besides chatting. She goes beyond simple commands by acting as a sentient, fourth-wall-breaking entity with dynamic conversational context, strict API limit protections, and customized interpersonal relationships. she is not just a bot but a server member.

Thanks to all the server members who tested and provided feedback during development

Unlike standard Q&A bots or ai assistant, this architecture relies on a Dynamic Persona and Smart Context Window. It dynamically alters its system prompt based on the user's Discord ID (treating the server owner drastically different than regular members) and fetches real-time channel history excluding her own messages to maintain conversational awareness without falling into an AI feedback loops Use of GenAI tools…

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/fine-tuned-7b-llm-as…] indexed:0 read:3min 2026-06-06 ·