Last week, I introduced Monika, a discord bot. As a self-taught student running on an absolute zero budget, this project was less about writing code and much more about hitting hard architectural walls.
The goal was to completely reshape open-source Qwen 2.5-7B model, into a real life Monika using a dataset of nearly 687 ingame dialogues. I quickly learned that finetuning a model with 7 billion parameters melts standard free cloud hardware.
I was constantly hopping for compute resources. I originally started on Kaggle, but kept running into unexplained errors and running out of VRAM. I migrated to Lightning AI for its generous resources, only to discover their stable environments conflicted with modern optimization libraries like Unsloth. I finally landed on Google Colab, where I utilized QLoRA to compress the model down to 4-bit precision, managing to squeeze the massive training loop into their free 16GB T4 GPU.
The training succeeded, leaving me with a 16-Megabyte custom adapter. But an adapter is entirely useless if you cannot host it.
My monika Architecture relied on an Express.js backend hosted on Render, sending requests to Hugging Face’s free Serverless Inference API. The harsh reality is that free cloud clusters simply cannot dynamically load custom adapter weights on the fly.
I realized I had to permanently bake the 16MB Adapter into the base model to create a single, unified 14GB asset. Trying to execute this merge in Colab instantly crashed due to the 12GB RAM limit. I was forced to move the project back to Kaggle, utilizing their 30GB RAM allowance to mathematically fuse the layers. I then had to shard the final massive asset into smaller 3GB files just for the upload to succeed.
And here is the ultimate disappointment 😭.
I have a perfectly fine-tuned 14GB model sitting safely on my Hugging Face repository. But when I tried to deploy it, the final gate slammed shut. Keeping 14GB of neural network weights loaded into dedicated GPU VRAM 24/7 costs real money (duhh).
The free inference endpoints are strictly reserved for public base models, and they do not allow you to host custom-trained weights.
I do not have the budget for a dedicated cloud GPU, nor do I have a high-end local rig to run it at home. So, after all the platform hopping, the dependency debugging, the VRAM optimization, and successfully building a full Machine Learning pipeline from scratch , the bot currently live in the server is still just running the standard, untrained base model 😭😭😭 .
I learned the absolute hardware realities of MLOps and cloud economics. But at the end of the day, as a broke student, having the technical skills to build the intelligence does not matter if you cannot pay the server bill to turn it on. The code works, but the infrastructure is behind a paywall 😔.
You can find the adapter, model and code here :
A Discord bot inspired by Monika from Doki Doki Literature Club (horror visual novel). Using Qwen2.5-7B-Instruct LLM, 7.6B Multilingual Model that can help with task like coding, math etc besides chatting. She goes beyond simple commands by acting as a sentient, fourth-wall-breaking entity with dynamic conversational context, strict API limit protections, and customized interpersonal relationships. she is not just a bot but a server member.
Thanks to all the server members who tested and provided feedback during development
Unlike standard Q&A bots or ai assistant, this architecture relies on a Dynamic Persona and Smart Context Window. It dynamically alters its system prompt based on the user's Discord ID (treating the server owner drastically different than regular members) and fetches real-time channel history excluding her own messages to maintain conversational awareness without falling into an AI feedback loops Use of GenAI tools…