cd /news/large-language-models/running-local-llm-0-personal-agentic… · home topics large-language-models article
[ARTICLE · art-13515] src=dev.to pub= topic=large-language-models verified=true sentiment=· neutral

Running Local LLM - 0$ Personal Agentic AI Assistant - Part 3

A developer building a free personal AI assistant on Oracle Cloud's ARM instances found that only smaller language models are practical for CPU-only inference. Testing showed that a 3-billion-parameter model delivers acceptable 4-7 second response times for Telegram messages, while larger models like 70B parameters require 40GB of RAM and cannot run on the free tier. The project recommends using Ollama with the llama3.2:3b model and warns that partial downloads can fill disk space without appearing in the model list.

read6 min publishedMay 25, 2026

Part 3 of the Zero Dollar personal AI Assistant series, running Local LLMs on a Free Cloud Server — What Actually Works. Part 1 covers the architecture. Part 2 covers free Oracle Cloud setup.

Running a language model locally sounds straightforward until you try it. Download a model, point your app at it, done. In practice, there are real constraints: RAM limits, disk-space surprises, and CPU inference-speed walls that most tutorials gloss over.

This article is honest about all of it. What works on a free Oracle ARM instance, what doesn't, and how a hybrid local + free API fallback makes the whole thing practical.

Before picking a model, understand what you're getting into.

Your Oracle ARM instance has no GPU. Every token generated by a language model runs on CPU cores. This matters because modern LLMs were designed to run on a GPU, the parallel processing architecture that makes inference fast. On the CPU, that parallelism doesn't exist in the same way.

What this means in practice:

Model size RAM needed Tokens/sec on 4 ARM CPUs Response time (100 tokens)
3B parameters ~2GB 15-25 tok/s 4-7 seconds
8B parameters ~5GB 5-10 tok/s 10-20 seconds
14B parameters ~9GB 2-5 tok/s 20-50 seconds
70B parameters ~40GB Won't fit

For a personal assistant responding to Telegram messages, 4-7 seconds for a short response is acceptable. You send a message, put your phone down, and pick it up to respond. Different mental model from a real-time chat UI, but workable.

The mistake to avoid: pulling a 70B model because it benchmarks well. It needs 40GB RAM minimum and simply won't run on your instance. I learned this the hard way: a partial 42GB download filled the disk before the model even ran.

Ollama is the runtime that downloads and runs open-source models locally. Think of it as the music player; the models are the music it plays.

Always use tmux before long-running commands:

sudo apt install tmux -y
tmux new -s setup

If your SSH session drops mid-install, reconnect and tmux attach -t setup

to pick up exactly where you left off. Not using tmux for a bigger size model download is how you end up restarting from scratch.

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Verify it's running:

systemctl status ollama
ollama --version

Ollama installs as a systemd service and starts automatically on boot, no manual management needed.

This is where most guides give you a benchmark table and call it done. What actually matters for your use case is the RAM-to-quality tradeoff on CPU hardware.

The models that make sense for this stack:

ollama pull llama3.2:3b
ollama pull llama3.1:8b
ollama pull phi4

The recommendation for this stack: llama3.2:3b

Not because it's the best model, it isn't. But because OpenClaw's agent mode wraps every model call with tool context, memory, session history, and system prompts. What feels fast in a bare ollama run

test becomes significantly slower when the agent layer adds 2-3KB of context to every request. With that overhead, the 3B model stays within acceptable response times. The 8B model starts hitting timeout issues in agent mode on the CPU.

If you want better quality and can accept 30-90 second response times for complex queries, llama3.1:8b

is worth trying.

Model files are large. Managing disk space proactively saves painful cleanup sessions later.

Check your current disk usage:

df -h
du -sh /usr/share/ollama/.ollama/models/

List downloaded models:

ollama list

Remove a model you no longer need:

ollama rm <modelname>

The gotcha with partial downloads:

If a download fails or you cancel it, Ollama leaves a partial file in the blobs directory. These can be gigabytes in size and won't show up in ollama list

. Check and clean manually:

sudo systemctl stop ollama

sudo -u ollama rm -rf /usr/share/ollama/.ollama/models/blobs/*

sudo systemctl start ollama

If the disk fills and growpart

fails with "no space left on device", you need to free space before the partition can be extended, even growing the volume requires temp space. Remove partial downloads first, then retry growpart.

Here's the truth about local-only inference for an AI assistant: it works, but has a quality ceiling. The 3B model handles most everyday tasks fine. But occasionally, a complex question, a nuanced writing task, something that requires real reasoning, either produces a weak response or times out entirely.

The solution: use the local model as the primary and Google's Gemini API as a free fallback.

Why Gemini free tier works here:

The flow:

Your message
     ↓
Ollama llama3.2:3b (primary)
     ↓ if timeout or failure
Gemini 2.5 Flash (fallback) ← free, fast, no card needed
     ↓
Response to Telegram

Most responses come from the local model at zero cost. Complex queries or timeouts fall through to Gemini, also at zero cost. The experience from your phone is just: you send a message, you get a response.

AIza...

No credit card, no billing setup. Takes two minutes.

Check RAM usage while model is loaded:

free -h

With llama3.2:3b

loaded, you should see ~2-3GB used out of 24GB, plenty of headroom for OpenClaw and everything else.

Check Ollama has auto-started:

systemctl status ollama

Should show active (running)

. The model itself loads into RAM only when first called and then stays resident for subsequent calls, which is why the first response after a reboot takes longer than subsequent ones.

Test Ollama directly:

ollama run llama3.2:3b "Just Reply OKAY!"

Should respond in under 10 seconds. If it takes longer, something is wrong with the Ollama service.

Test Gemini Model API Call

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-lite:generateContent" \
  -H 'Content-Type: application/json' \
  -H 'X-goog-api-key: API_KEY' \
  -X POST \
  -d '{
    "contents": [
      {
        "parts": [
          {
            "text": "Just! Reply OKAY"
          }
        ]
      }
    ]
  }'

HTTP Response status code should be 200 along with response text, and you should see the call log in your Google Studio - Logs

llama3.3

requires 40GB — it will never run on a 24GB instance. Remove it and pull a smaller model:

ollama rm llama3.3
ollama pull llama3.2:3b

Disk full during model download

The download filled your boot volume. Stop Ollama, remove partial files as the ollama user (not root), free space, then extend the partition if needed via Oracle Console → Boot Volume resize.

Ollama slow after reboot

The first call after a reboot loads the model into RAM, expected. Subsequent calls are faster since the model stays resident.

With Ollama running and your hybrid local + Gemini fallback configured, the AI layer is ready.

Part 4 will cover installing OpenClaw on Linux — the right user, systemd service setup, the config file traps, and every mistake worth avoiding so you don't have to make them yourself.

This article is the third in a five-part series:

Stay tuned, all links will be updated as articles are published.

If you have reached this point, I have made a satisfactory effort to keep you reading. Please be kind enough to leave any comments or share any corrections.

── more in #large-language-models 4 stories · sorted by recency
── more on @oracle 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/running-local-llm-0-…] indexed:0 read:6min 2026-05-25 ·