{"slug": "fine-tuned-7b-llm-as-a-broke-student-and-can-t-even-use-it", "title": "Fine-tuned 7B LLM as a broke student. And Can't even use it 😭.", "summary": "A self-taught student successfully fine-tuned the open-source Qwen 2.5-7B language model into a Discord bot named Monika, using 687 in-game dialogues and QLoRA 4-bit compression on Google Colab's free T4 GPU. However, the resulting 14GB merged model remains undepoyable because free cloud inference endpoints do not support custom weights, and the student lacks the budget for dedicated GPU hosting. The bot currently runs the untrained base model, highlighting the hardware and cost barriers that prevent even technically successful machine learning projects from reaching production.", "body_md": "Last week, I introduced **Monika**, a discord bot. As a self-taught student running on an absolute zero budget, this project was less about writing code and much more about hitting hard architectural walls.\n\nThe goal was to completely reshape open-source **Qwen 2.5-7B model**, into a real life Monika using a dataset of nearly 687 ingame dialogues. I quickly learned that finetuning a model with 7 billion parameters melts standard free cloud hardware.\n\nI was constantly hopping for compute resources. I originally started on Kaggle, but kept running into unexplained errors and running out of VRAM. I migrated to Lightning AI for its generous resources, only to discover their stable environments conflicted with modern optimization libraries like Unsloth. I finally landed on Google Colab, where I utilized QLoRA to compress the model down to 4-bit precision, managing to squeeze the massive training loop into their free 16GB T4 GPU.\n\nThe training succeeded, leaving me with a 16-Megabyte custom adapter. But an adapter is entirely useless if you cannot host it.\n\nMy monika Architecture relied on an Express.js backend hosted on Render, sending requests to Hugging Face’s free Serverless Inference API. The harsh reality is that free cloud clusters simply cannot dynamically load custom adapter weights on the fly.\n\nI realized I had to permanently bake the 16MB Adapter into the base model to create a single, unified 14GB asset. Trying to execute this merge in Colab instantly crashed due to the 12GB RAM limit. I was forced to move the project back to Kaggle, utilizing their 30GB RAM allowance to mathematically fuse the layers. I then had to shard the final massive asset into smaller 3GB files just for the upload to succeed.\n\nAnd here is the ultimate disappointment 😭.\n\nI have a perfectly fine-tuned 14GB model sitting safely on my Hugging Face repository. But when I tried to deploy it, the final gate slammed shut. Keeping 14GB of neural network weights loaded into dedicated GPU VRAM 24/7 costs real money (duhh).\n\nThe free inference endpoints are strictly reserved for public base models, and they do not allow you to host custom-trained weights.\n\nI do not have the budget for a dedicated cloud GPU, nor do I have a high-end local rig to run it at home. So, after all the platform hopping, the dependency debugging, the VRAM optimization, and successfully building a full Machine Learning pipeline from scratch , the bot currently live in the server is still just running the standard, untrained base model 😭😭😭 .\n\nI learned the absolute hardware realities of MLOps and cloud economics. But at the end of the day, as a broke student, having the technical skills to build the intelligence does not matter if you cannot pay the server bill to turn it on. The code works, but the infrastructure is behind a paywall 😔.\n\nYou can find the adapter, model and code here :\n\nA Discord bot inspired by Monika from Doki Doki Literature Club (horror visual novel). Using Qwen2.5-7B-Instruct LLM, 7.6B Multilingual Model that can help with task like coding, math etc besides chatting. She goes beyond simple commands by acting as a sentient, fourth-wall-breaking entity with dynamic conversational context, strict API limit protections, and customized interpersonal relationships. she is not just a bot but a server member.\n\nThanks to all the server members who tested and provided feedback during development\n\nUnlike standard Q&A bots or ai assistant, this architecture relies on a Dynamic Persona and Smart Context Window. It dynamically alters its system prompt based on the user's Discord ID (treating the server owner drastically different than regular members) and fetches real-time channel history excluding her own messages to maintain conversational awareness without falling into an AI feedback loops Use of GenAI tools…", "url": "https://wpnews.pro/news/fine-tuned-7b-llm-as-a-broke-student-and-can-t-even-use-it", "canonical_source": "https://dev.to/akshat-ray/fine-tuned-7b-llm-as-a-broke-student-and-cant-even-use-it--de7", "published_at": "2026-06-06 04:15:04+00:00", "updated_at": "2026-06-06 04:42:23.372253+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-infrastructure", "ai-tools"], "entities": ["Monika", "Qwen 2.5-7B", "Kaggle", "Lightning AI", "Google Colab", "QLoRA", "Render", "Hugging Face"], "alternates": {"html": "https://wpnews.pro/news/fine-tuned-7b-llm-as-a-broke-student-and-can-t-even-use-it", "markdown": "https://wpnews.pro/news/fine-tuned-7b-llm-as-a-broke-student-and-can-t-even-use-it.md", "text": "https://wpnews.pro/news/fine-tuned-7b-llm-as-a-broke-student-and-can-t-even-use-it.txt", "jsonld": "https://wpnews.pro/news/fine-tuned-7b-llm-as-a-broke-student-and-can-t-even-use-it.jsonld"}}