Running Local LLM - 0$ Personal Agentic AI Assistant - Part 3 A developer building a free personal AI assistant on Oracle Cloud's ARM instances found that only smaller language models are practical for CPU-only inference. Testing showed that a 3-billion-parameter model delivers acceptable 4-7 second response times for Telegram messages, while larger models like 70B parameters require 40GB of RAM and cannot run on the free tier. The project recommends using Ollama with the llama3.2:3b model and warns that partial downloads can fill disk space without appearing in the model list. Part 3 of the Zero Dollar personal AI Assistant series, running Local LLMs on a Free Cloud Server — What Actually Works. Part 1 covers the architecture. Part 2 covers free Oracle Cloud setup. Running a language model locally sounds straightforward until you try it. Download a model, point your app at it, done. In practice, there are real constraints: RAM limits, disk-space surprises, and CPU inference-speed walls that most tutorials gloss over. This article is honest about all of it. What works on a free Oracle ARM instance, what doesn't, and how a hybrid local + free API fallback makes the whole thing practical. Before picking a model, understand what you're getting into. Your Oracle ARM instance has no GPU. Every token generated by a language model runs on CPU cores. This matters because modern LLMs were designed to run on a GPU, the parallel processing architecture that makes inference fast. On the CPU, that parallelism doesn't exist in the same way. What this means in practice: | Model size | RAM needed | Tokens/sec on 4 ARM CPUs | Response time 100 tokens | |---|---|---|---| | 3B parameters | ~2GB | 15-25 tok/s | 4-7 seconds | | 8B parameters | ~5GB | 5-10 tok/s | 10-20 seconds | | 14B parameters | ~9GB | 2-5 tok/s | 20-50 seconds | | 70B parameters | ~40GB | Won't fit | — | For a personal assistant responding to Telegram messages, 4-7 seconds for a short response is acceptable. You send a message, put your phone down, and pick it up to respond. Different mental model from a real-time chat UI, but workable. The mistake to avoid: pulling a 70B model because it benchmarks well. It needs 40GB RAM minimum and simply won't run on your instance. I learned this the hard way: a partial 42GB download filled the disk before the model even ran. Ollama is the runtime that downloads and runs open-source models locally. Think of it as the music player; the models are the music it plays. Always use tmux before long-running commands: sudo apt install tmux -y tmux new -s setup If your SSH session drops mid-install, reconnect and tmux attach -t setup to pick up exactly where you left off. Not using tmux for a bigger size model download is how you end up restarting from scratch. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh Verify it's running: systemctl status ollama ollama --version Ollama installs as a systemd service and starts automatically on boot, no manual management needed. This is where most guides give you a benchmark table and call it done. What actually matters for your use case is the RAM-to-quality tradeoff on CPU hardware. The models that make sense for this stack: ollama pull llama3.2:3b ollama pull llama3.1:8b ollama pull phi4 The recommendation for this stack: llama3.2:3b Not because it's the best model, it isn't. But because OpenClaw's agent mode wraps every model call with tool context, memory, session history, and system prompts. What feels fast in a bare ollama run test becomes significantly slower when the agent layer adds 2-3KB of context to every request. With that overhead, the 3B model stays within acceptable response times. The 8B model starts hitting timeout issues in agent mode on the CPU. If you want better quality and can accept 30-90 second response times for complex queries, llama3.1:8b is worth trying. Model files are large. Managing disk space proactively saves painful cleanup sessions later. Check your current disk usage: df -h du -sh /usr/share/ollama/.ollama/models/ List downloaded models: ollama list Remove a model you no longer need: ollama rm