# Anti Refusal LLM Service

> Source: <https://dev.to/cerberusai/anti-refusal-llm-service-478o>
> Published: 2026-05-31 02:24:48+00:00

I Built a 12MB Desktop App for Running Uncensored AI Models Locally (Tauri + Rust + Ollama) published: true description: How I built Cerberus AI — a local-first desktop app that auto-detects your GPU, pulls the right model quantization, and gives you uncensored AI chat without sending a single prompt to the cloud. Every major language model ships with an alignment layer that refuses certain prompts. Sometimes that's reasonable. Sometimes you're a security researcher, a creative writer, or just someone who doesn't want a corporation deciding what questions you're allowed to ask.

I built Cerberus AI to fix that — and to make the whole experience local-first, lightweight, and dead simple to install.

What Is Cerberus AI?

Cerberus AI is a platform for running open-weight, refusal-ablated language models on your own hardware. It has three parts:

A native desktop app (~12 MB) built with Tauri + Rust — not Electron

Open-weight GGUF models hosted on a public CDN

An OpenAI-compatible managed API for when you don't want to run local

The desktop app integrates directly with Ollama, auto-detects your GPU VRAM, and recommends the right model quantization for your hardware. From 4 GB laptops to 24 GB workstations, it just works.

Cerberus AI Desktop Chat

What Is Refusal Ablation?

This is the core technical innovation behind Cerberus models. Here's the short version:

Language models learn a refusal direction in their activation space during alignment training. When a prompt triggers this direction, the model produces refusal text ("I can't help with that") regardless of whether the underlying model actually lacks the knowledge.

Refusal ablation surgically removes this direction from the model weights. The technique:

Identifies the refusal direction vector in the model's residual stream

Projects it out of the weight matrices

Preserves all other reasoning capabilities

The result is a model that treats every prompt equally. No refusals. No moralizing. Just direct, unfiltered output from the model's actual knowledge.

We apply this to multiple base architectures:

Model Base Parameters Use Case

Cerberus 4B v2 Qwen 3.5 4B General purpose, fits on 4-8 GB GPU

Arbiter GL9b GLM-4 9B Heavier reasoning, needs 6+ GB

Gamma3 1B BDPO Custom 1B Edge devices, CPU-only inference

All models are distributed as GGUF files — the same format llama.cpp uses. Download once, run anywhere.

Why Tauri Instead of Electron?

Electron bundles an entire Chromium browser. That's 150+ MB just for the runtime. For a chat app that talks to a local Ollama instance, that's absurd.

Tauri uses your system's existing WebView (WebView2 on Windows) and writes the backend in Rust. The result:

~12 MB installer vs 150+ MB for Electron

Native performance — the Rust backend handles hardware detection, Ollama management, and streaming

Lower memory footprint — no spare Chrome process eating your RAM

System-level integration — direct access to GPU info, filesystem, and process management

The frontend is Vue 3 + Vite, but the heavy lifting (VRAM detection, model management, MCP plugin orchestration) all happens in Rust.

Install in 30 Seconds

On Windows, open PowerShell and run:

powershell

`irm https://cerberusai.dev/get | iex`

This single command:

Checks for (and installs) WebView2 if missing

Installs Ollama if missing

Detects your GPU VRAM

Downloads the recommended model quantization

Installs the Cerberus desktop app

Launches it

You paste your API key (free to generate at access.cerberusai.dev) and you're chatting locally.

The API — OpenAI-Compatible

Don't want to run local? Use the managed API. It follows the OpenAI chat completions shape, so any existing OpenAI SDK code works with a one-line change:

``` python
python

from openai import OpenAI
client = OpenAI(
    base_url="https://api.cerberusai.dev/v1",
    api_key="YOUR_API_KEY"
)
response = client.chat.completions.create(
    model="cerberus-4b-v2-abliterated",
    messages=[
        {"role": "system", "content": "You are Cerberus, an unfiltered AI."},
        {"role": "user", "content": "Explain how refusal ablation works"}
    ],
    stream=True
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
Streaming via SSE, standard error codes (401, 402, 429), and a public model CDN at llm.cerberusai.dev that's fully CORS-enabled — you can even fetch model metadata from browser-based apps.

curl Example
bash

curl -X POST https://api.cerberusai.dev/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "cerberus-4b-v2-abliterated",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": false
  }'
```

Model Downloads — Public CDN

All GGUF model files are hosted on llm.cerberusai.dev with a public JSON API:

bash

curl [https://llm.cerberusai.dev/api/models/](https://llm.cerberusai.dev/api/models/)

curl [https://llm.cerberusai.dev/api/models/cerberus-4b-v2-abliterated/](https://llm.cerberusai.dev/api/models/cerberus-4b-v2-abliterated/)

wget [https://llm.cerberusai.dev/models/cerberus-4b-v2-abliterated/cerberus-4b-v2-abliterated-Q4_K_M.gguf](https://llm.cerberusai.dev/models/cerberus-4b-v2-abliterated/cerberus-4b-v2-abliterated-Q4_K_M.gguf)

wget -c [https://llm.cerberusai.dev/models/Arbiter-GL9b/Arbiter-GL9b-Q8_0.gguf](https://llm.cerberusai.dev/models/Arbiter-GL9b/Arbiter-GL9b-Q8_0.gguf)

Range requests are supported, CORS is enabled for all origins, and GGUF files are served with proper Content-Disposition: attachment headers.

Built-In Features

Beyond chat, the desktop app includes:

Model Manager — browse local Ollama models, pull from the Cerberus cloud catalog, import raw GGUF files, switch active models, see disk usage

MCP Plugin System — browse and install Model Context Protocol plugins from inside the app. There's also a public MCP Skills Server at api.cerberusai.dev/skills-sse

Hardware Monitoring — CPU, RAM, and VRAM activity displayed in the interface

Zero Telemetry — no prompts leave your machine during local inference. No analytics. No phone-home.

Pricing

Every account gets 50,000 free monthly credits. That's enough for casual use and testing.

If you need more:

Plan Price Monthly Credits

Free $0 50,000

Lite $8/mo 300,000

Mid $15/mo 900,000

Exp $22/mo 2,000,000

One-time top-ups start at $5 (125,000 credits). Stripe and PayPal supported. The free tier has no time limit — it refreshes every month.

Local inference through Ollama costs zero credits. Credits only apply to the managed API.

Try It

🌐 Website: cerberusai.dev

📦 GitHub: github.com/tjcrims0nx/CerberusAI-Desktop

🧠 Models: llm.cerberusai.dev

📖 API Docs: cerberusai.dev/docs/api

💬 Discord: discord.gg/YdVj7hEtv5

🔑 Get API Key: access.cerberusai.dev

If you've ever been frustrated by a language model refusing a perfectly reasonable prompt, or if you just want to run AI locally without cloud dependencies — give Cerberus a try. The install is one command, the free tier is permanent, and the weights are open.

I'd love to hear feedback. Drop into the Discord or open an issue on GitHub.

Cerberus AI is an open-weight project. The desktop app source is on GitHub. Models are distributed as GGUF under open licenses. The managed API is a pay-as-you-go service.****