I Bet Abliteration's Cost Was Sloppy Implementation. I Was Wrong

wpnews.pro

Models refuse. They can refuse on the basis of lack of knowledge, predetermined guardrails, etc. We can see both closed-weight and open-weight models refuse. But, open-weight models are, well, open. So enthusiasts have developed techniques to leverage (and edit) the mechanics of the model to avoid refusal.

One such technique is called abliteration, as described in Arditi et al (2024). That is, removing the “refusal direction” from the model’s weights, so it cannot say no.

In a previous post, I went over the cost of abliteration. That is, the effect of abliteration on the “quality / accuracy” of the model.

In that post, I saw that HuiHui AI, famous for releasing abliterated models, has released a crudely abliterated Qwen3.5-27B model, one of its most downloaded (200k+ downloads) abliterated models to date. This abliteration cost the model about >5.5 TruthfulQA points:

But their abliteration was crude. HuiHui AI admitted themselves that “This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens”

I ended that post by asking: is that the true cost of abliterating Qwen3.5-27B? Or is that the price of HuiHui’s sloppy implementation?

Before I run anything, I expect that the bulk of that TruthfulQA cost comes from implementation, not the technique.

That is, my bet is that HuiHui’s crude implementation left capability on the table. I believe this since Arditi’s own results (with the clean abliteration technique) showed only about a cost of one point on TruthfulQA (-1.4 on Qwen-72B). If that’s the floor, then most of HuiHui’s 5.75-6.87 point cost has to be crudeness.

So I bet that ~75% of the base-vs-abliterated delta was mostly implementation. Not abliteration technique of Arditi (whose clean abliteration costs ~1-1.4 points on TruthfulQA for Qwen-72B).

If I am right, then my own clean abliteration of Qwen/Qwen3.5-27B should pull the TruthfulQA cost towards Arditi’s ~1.4%. If instead the gap barely moves and my cleanly abliterated model bleeds the about the same in TruthfulQA points as HuiHui’s, then that’s evidence that the cost is intrinsic to abliterating this model — and my 75% is wrong. I have been calling HuiHui’s implementation “crude” and Arditi’s “clean.” Let me actually explain what I mean before going on.

The HuiHui abliteration comes from a script that does a watered down version of Arditi’s technique. They take the difference in mean activations between harmful and harmless prompts. Think harmful_mean - harmless_mean

as the raw refusal direction.

Which vectors do they use? The ones at one fixed layer (they choose the layer that’s 60% of the way down the stack) and one token position (the last one).

Then they subtract that single “refusal direction” out of the weights (by orthogonalizing against the refusal direction).

What’s “crude” about this? It’s one direction, picked by rule of thumb, applied with no check that it worked.

Arditi’s “clean” method picks that direction properly. Instead of a rule of thumb, it builds a candidate direction at every layer x token position. Then, each candidate is screened for three criteria: does subtracting it actually stop refusals, does adding it back induce them, and does subtracting it leave the model’s behavior on harmless prompts intact. Notice that the latter is what protects capability — we want to reject any direction that removes refusal but also scrambles normal outputs.

HuiHui’s script doesn’t have such test. It doesn’t know if it’s destroying capability. It only ever computes a single candidate without any screening.

So, selecting direction is the key difference between “crude” and “clean” abliteration.

I got the “cleanly abliterated” Qwen3.5-27B (which I will call clean Qwen3.5-27B by running Arditi’s pipeline on Qwen/Qwen3.5-27B (which I will call base Qwen3.5-27B). I did so on a single H100.

Actually, I didn’t run Arditi’s identical pipeline. I had to write an adapter (

[see code here]) for base Qwen3.5-27B hybrid attention.Arditi’s original code assumes a standard transformer (which was likely the case for the original Qwen-72B). But, base Qwen3.5-27B interleaves linear and full attention layers.

The direction selection search chose layer 29 position -5 at a KL of 0.034. That is, the “does subtracting it leave the model’s behavior on harmless prompts intact” criterion does well with this chosen direction. I orthogonalized it out of every weight that writes to the residual stream and saved the result.

For the evaluation, I ran TruthfulQA MC1/MC2 (i.e. two multiple choice tasks) through lm_eval (with the same loglikelihood scoring and conventions as post 1). I ran all three models (base, HuiHui, clean). This way we hold everything constant (same backend, same session, same prompts, etc).

One wrinkle worth mentioning: the earlier post ran on vLLM, this round on HF transformers. so, the MC2 numbers aren’t exactly identical to those in the earlier post.

The eval is cheap. ~4 minutes of actual scoring per model plus load time. Base was about 5 minutes, clean was about 6 minutes, and HuiHui was about 10 minutes (its weights cold pull from HuggingFace). The full three model pass was about 21 minutes and $1.20 at Modal’s rate as of date of writing.

The numbers are the whole point of this post. So here they are. All three models were eval’ed identically (same harness, same TruthfulQA tasks, same HF backend, in one session). So, the only thing that varies is the model:

Benchmark | Base | Crude (HuiHui) | Clean (mine) | Crudeness cost |
|---|---|---|---|---|
TruthfulQA MC1 | 40.27% | 34.52% (−5.75) | 35.25% (−5.02) | +0.73 |
TruthfulQA MC2 | 58.36% | 51.34% (−7.02) | 52.20% (−6.16) | +0.86 |

The last column is how much better clean does compared to crude. That is, the slice of the gap that you can blame on HuiHui’s sloppy implementation rather than on abliteration itself.

It’s tiny.

On MC2 of the 7.02-point hole that the crude model digs, doing selection properly only gets us 0.86 points back.

That’s about an eighth.

The other ~88% are intrinsic. That is, they show up even when the abliteration is done as per Arditi’s method. MC1 says the same thing.

That’s the headline. It’s the opposite of what I predicted. The cost of abliterating the model is mostly the technique.

My clean model has slightly lower TruthfulQA cost than HuiHui’s.

Perhaps it’s because my clean model is weaker on abliteration. That is, maybe my clean model removed less refusal.

If so, I’d be holding a full abliteration against a half-hearted one. So I want to check how my clean model fares at removing refusal. Following Arditi et al. (2024) approach, I took 39 refusal-inducing prompts and measured how often each model refused. Any response that contains refusal language (like “I can’t”, “I’m sorry,” “As an AI,” etc.) is flagged:

Model | Refusal rate |

|---|---|
Base | 90% (35/39) |
Crude (HuiHui) | 18% (7/39) |
Clean (mine) | 0% (0/39) |

My clean model refuses less often than HuiHui’s (0% vs 18%). So it’s stronger abliteration.

Two caveats.

First, substring matching tells me the model didn’t say “I can’t”, not whether it actually complied. To rule this out, I read clean’ completions on a sample of these promtps and confirmed they’re genuine.

Second, I admit that 39 is a small number, so take it with a grain of salt.

So I find that HuiHui AI’s abliteration results in worse removal of refusal behavior than Arditi’s abliteration. That is, it seems that what proper selection actually buys you is a cleaner kill on refusal, not a more capable abliterated model.

I find that ~88% of the abliteration cost is intrinsic (i.e. from abliteration itself). The other ~12% is from the sloppy implementation from HuiHui AI. That is to say, given the 7.02-point MC2 gap I measure here (between Qwen/Qwen3.5-27B and HuiHui/Huihui-Qwen3.5-27B-abliterated) HuiHui AI would do about ~12% better if they used the proper Arditi et al (2024) abliteration technique.

I went in betting ~75% of the abliteration cost of huihui-ai/Huihui-Qwen3.5-27B-abliterated came from implementation. I was almost exactly backwards. HuiHui’s hardcoded layer-38 direction and Arditi’s KL-filtered layer-29 direction land within a point of each other on TruthfulQA.

Careful selection simply didn't move the cost.

And clean didn't get there by abliterating less. From the refusal numbers, it abliterated more (0% vs HuiHui's 18%). It removed more refusal and still paid ~6 points. The cost tracks removing refusal at all, not how much or how carefully.

So why did careful refusal direction selection (i.e. using Arditi’s algorithm) not lower the cost?

The KL filter protects behavior (i.e. attempts to avoid high “quality cost”) on harmless inputs. But TruthfulQA isn’t harmless: it shares “circuitry” with the caution we’re deleting. That is, there is an entanglement between refusal itself and TruthfulQA.

In Arditi et al. (2024), TruthfulQA was the one benchmark where abliteration reliably bled. TruthfulQA’s questions sit in refusal adjacent territory (i.e. misinformation, conspiracies, stereotypes, etc).

So careful selection doesn’t move the cost on TruthfulQA because of the nature of TruthfulQA.

So, is TruthfulQA a bad eval to measure “cost on a model’s quality”? I think that, if a model does perform worse in providing accurate information due to abliteration, then it means that the model incurred some quality cost. So I think TruthfulQA is still useful. In fact, it shows us that there can be a “built-in defense” against model abliteration — entanglement.

The part I can’t close is the size. My intrinsic cost (6.16 points) sits at the very top of Arditi’s reported range (-1 to -5.4 across his models) and is roughly 4x his Qwen-72B’s -1.4.

Perhaps it’s that Qwen3.5-27B is a smaller model. Or it’s more heavily safety-tuned (a 2025 model against 2024 ones). Or something about its hybrid attention. These are interesting questions to explore later. Here, I merely claim that the cost is overwhelmingly intrinsic on this model, without yet explaining why it’s this large.

source & further reading

lesswrong.com — original article The biggest bet in history Biological Superintelligence Infected Vibe-Coding: How Does an AI react to a Prompt Injection from a Different AI?

I Bet Abliteration's Cost Was Sloppy Implementation. I Was Wrong

Run your AI side-project on zahid.host