# iddqd, or the hardest kind of unsafe Rust

> Source: <https://oxide.computer/blog/iddqd-unsafe>
> Published: 2026-06-02 16:20:18+00:00

I’m the main author of [ iddqd](https://docs.rs/iddqd), a Rust library for maps (named after the

[Doom cheat code](https://doomwiki.org/wiki/Doom_cheat_codes#All_Doom_engine_versions)) where keys are borrowed from values. At Oxide we use it extensively in

[Omicron](https://github.com/oxidecomputer/omicron), our control plane—the software that

[sits](https://rfd.shared.oxide.computer/rfd/0048)at the heart of every Oxide rack, provisions resources like compute and storage for our customers, and ensures the rack stays up and running over time.

`iddqd`

maintains in-memory indexes of the kinds of large records that show up everywhere in a system like that, such as [disks](https://github.com/oxidecomputer/omicron/blob/2e319eb3b9b9fcffb8cee71cffb9cd13397cb316/sled-agent/config-reconciler/src/internal_disks.rs#L602-L637)or

[sled inventories](https://github.com/oxidecomputer/omicron/blob/2e319eb3b9b9fcffb8cee71cffb9cd13397cb316/nexus/types/src/inventory.rs#L682-L717). As a result, it must be correct: if it misbehaves, our control plane can malfunction in ways that are unpredictable and hard to diagnose.

`iddqd`

consists of a fair amount of unsafe code underneath. There’s been [some recent concern](https://github.com/oven-sh/bun/issues/30719) over the amount of unsafe code in Rust rewrites, so I thought I’d write about some of the unsafe code in `iddqd`

and how we try to tame it.

[What problem does iddqd solve?](#what-problem-does-iddqd-solve)

With Rust’s standard library maps, keys are stored separately from values. Let’s say you want to store a map of records keyed by an email address. With [ std::collections::BTreeMap](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html), you might write something like:

```
// Email is typically a newtype which validates that the email address is well-formed.// In this example, we alias it to String for simplicity.type Email = String;struct User {    name: String,    age: u8,}let mut users = BTreeMap::<Email, User>::new();users.insert(    "alice@example.com".to_string(),    User { name: "Alice".to_string(), age: 30 },);users.insert(    "bob@example.com".to_string(),    User { name: "Bob".to_string(), age: 35 },);
```

This approach has what I consider to be a pretty major downside: the key (email) is not stored in the same struct as the value (the rest of the record). How would you handle this? One way is to pass around both the email and the user, for example with [ get_key_value](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html#method.get_key_value):

```
fn process_user(email: &Email, user: &User) {    /* ... */}let (email, user) = users.get_key_value("alice@example.com").unwrap();process_user(email, user);
```

As an extension, you could maybe have a struct which combines both at fetch time:

```
struct UserRecord<'a> {    email: &'a Email,    user: &'a User,}let (email, user) = users.get_key_value("alice@example.com").unwrap();let record = UserRecord { email, user };
```

In practice, this gets quite unwieldy when you have lots of different types of records that need this kind of treatment.

Alternatively, you could duplicate the email across the key and the value:

``` js
struct User {    email: Email,    name: String,    age: u8,}let mut users = BTreeMap::<Email, User>::new();let email = "alice@example.com".to_string();users.insert(    email.clone(),    User { email, name: "Alice".to_string(), age: 30 },);
```

But that has the risk that the emails stored in the key and the value fall out of sync.

`iddqd`

provides a better alternative. With [ IdOrdMap](https://docs.rs/iddqd/latest/iddqd/struct.IdOrdMap.html), you can write something like:

```
struct User {    email: Email,    name: String,    age: u8,}// You implement a trait which tells iddqd how to extract the key from the// record.impl IdOrdItem for User {    // The key type can borrow from the value!    type Key<'a> = &'a Email;    fn key(&self) -> Self::Key<'_> {        &self.email    }    // This macro expands into a small amount of boilerplate to work around    // a Rust borrow checker limitation.    id_upcast!();}// Then, you can insert records:let mut users = IdOrdMap::<User>::new();users    .insert_unique(User {        email: "alice@example.com".to_string(),        name: "Alice".to_string(),        age: 30,    })    .unwrap();users    .insert_unique(User {        email: "bob@example.com".to_string(),        name: "Bob".to_string(),        age: 35,    })    .unwrap();// And look them up by email:assert_eq!(&users.get("alice@example.com").unwrap().name, "Alice");assert_eq!(users.get("bob@example.com").unwrap().age, 35);
```

At Oxide, this has proven to be an invaluable pattern: many of the records in our control plane are quite large (think database lookups), and `iddqd`

is a great fit for them. It also comes with several other features that directly address pain points we’ve dealt with at Oxide. A few worth calling out:

[First-class support](https://docs.rs/iddqd/latest/iddqd/#examples)for complex keys that borrow from more than one field, without having to resort to[workarounds like dynamic dispatch](https://github.com/sunshowers-code/borrow-complex-key-example).Maps with

[two](https://docs.rs/iddqd/latest/iddqd/struct.BiHashMap.html)or[three](https://docs.rs/iddqd/latest/iddqd/struct.TriHashMap.html)keys per item, each independently indexing the same record, without the usual pattern of maintaining multiple maps by hand.Serde implementations that

[serialize](https://docs.rs/iddqd/latest/iddqd/struct.IdOrdMap.html#impl-Serialize-for-IdOrdMap%3CT%3E)as sequences rather than maps, so that non-string keys can be serialized in JSON[. Importantly, these implementations reject duplicate keys. (For backwards compatibility, serialization as maps is[1](#_footnotedef_1)][also supported](https://docs.rs/iddqd/latest/iddqd/id_ord_map/struct.IdOrdMapAsMap.html).)

Like many of our other crates, `iddqd`

is built for Oxide’s needs but is generally useful to the Rust community. You’re [welcome](https://crates.io/crates/iddqd) to use it in your own projects as well.

[On unsafe Rust](#on-unsafe-rust)

Before I move on, I want to talk about what it means for `unsafe`

to exist in a memory-safe language. The big concern is [ undefined behavior](https://doc.rust-lang.org/reference/behavior-considered-undefined.html) (UB): a program behaves in an unpredictable way because core assumptions made by the language or compiler have been violated. Rust calls an abstraction

*sound*if no safe code can use it to cause UB, and

*unsound*otherwise.

The vast majority of Rust code is safe, which (assuming that any unsafe used by the safe code is sound) means that no UB can occur. However, due to the fundamental undecidability of static analysis (see [Rice’s theorem](https://en.wikipedia.org/wiki/Rice%27s_theorem)), it is impossible for the Rust compiler—or any kind of algorithm that terminates—to accept all programs without UB and reject all programs with it. Therefore, when writing such an algorithm, its authors have to make a decision: do they reject some programs without UB, accept some programs with UB, or both? The Rust compiler does the first: within the context of safe Rust, it rejects all programs with UB but also some without UB. (This is the correct choice!)

What if your program is in the no-man’s-land where it gets rejected even though you know it doesn’t have UB? To express those kinds of programs, the Rust compiler provides an escape hatch: the `unsafe`

keyword. By writing it, the author takes responsibility for soundness: they vouch that no safe code can use this to cause UB, and that the compiler should trust them.

What does it mean for some unsafe code to not have UB? Informally, it must follow [the rules of Rust](https://cacm.acm.org/research/safe-systems-programming-in-rust/)—all the rules that the compiler proves for safe Rust. Some examples of these rules are:

There are no data races.

There must be no reads of uninitialized or freed memory.

There must not ever be multiple aliases of

`&mut`

references to the same region of memory.Immutable data must not be mutated.

The rules of Rust are very hard to reason about! I consider them to be significantly harder than C, for example, particularly due to the rules around mutable aliasing[ 2]. So judicious use of unsafe Rust usually involves encapsulating it behind a safe abstraction.

[Unsafe Rust requires reasoning about safe Rust](#unsafe-rust-requires-reasoning-about-safe-rust)

In all but the simplest of cases, it is not possible to be sure that such an abstraction is sound without also reasoning about the surrounding safe Rust. In this section, I’ll walk through three examples in increasing order of difficulty.

The first example is the [ split_at_mut method](https://doc.rust-lang.org/std/primitive.slice.html#method.split_at_mut) on slices. This method splits a mutable slice into two separate parts.

`split_at_mut`

is a safe method, but safe Rust lacks the ability to express this kind of partitioning of a slice. Thus, `split_at_mut`

requires unsafe Rust.The implementation is roughly:

``` php
fn split_at_mut<T>(slice: &mut [T], mid: usize) -> (&mut [T], &mut [T]) {    assert!(mid <= slice.len());    let ptr = slice.as_mut_ptr();    let remaining = slice.len() - mid;    unsafe {        (            std::slice::from_raw_parts_mut(ptr, mid),            std::slice::from_raw_parts_mut(ptr.add(mid), remaining),        )    }}
```

To be sure of the correctness of the unsafe block, you also have to consider several things about the surrounding safe code:

that the function was provided a

`&mut [T]`

, and not, for example, a`&[T]`

the

`assert!`

ensuring that`mid`

is in boundsthe fact that

`remaining`

is`slice.len() - mid`

, and not, for example,`(slice.len() - mid) + 1`

If any of these invariants (all of which are safe Rust) are not upheld, the safe abstraction becomes unsound.

The next level up is that in many cases you also have to analyze the *entire module* the unsafe code is present in. Consider this vector type:

```
struct MyVec<T> {    ptr: NonNull<T>,    len: usize,    cap: usize,}impl<T> MyVec<T> {    pub fn get(&self, i: usize) -> Option<&T> {        (i < self.len).then(|| unsafe { &*self.ptr.as_ptr().add(i) })    }}
```

The soundness of `get`

relies on **every other method** in the module, safe and unsafe, ensuring that `len`

is correct and in bounds. Encapsulation and privacy (`len`

cannot be mutated from outside the module) are load-bearing here.

The hardest situations, though, are when unsafe code is generic and calls back into user-supplied code. In those cases, the [fundamental principle](https://doc.rust-lang.org/nomicon/safe-unsafe-meaning.html) is:

*Safe Rust code, no matter how pathological or adversarially written, must never cause unsafe code to exhibit undefined behavior.*

In other words, you no longer have the luxury of analyzing just the safe code you see; you must put yourself in the shoes of an attacker, considering every safe Rust program one could write, and be resilient to *all* of them. For example, it is possible for user code to misbehave in a way that leads to internal tables or data structures being corrupted. Safe but buggy Rust may cause unsafe Rust to be slower, produce incorrect results, or even panic, but it must not cause UB!

(Why assume the worst? You might argue that well-meaning Rust developers are quite unlikely to write adversarial code. One reason for having a firm rule is that guarding against adversarial code means guarding against innocent mistakes as a corollary; a bright line makes it easier to assign responsibility for protection against UB.)

An example is the [ ExactSizeIterator](https://doc.rust-lang.org/std/iter/trait.ExactSizeIterator.html) trait, which has a

`len`

method that returns the exact size of an iterator. A developer might be tempted to take advantage of this in unsafe Rust:

``` php
pub fn collect_exact<T, I: ExactSizeIterator<Item = T>>(iter: I) -> MyVec<T> {    let n = iter.len();    let mut v = MyVec::with_capacity(n);    for (i, item) in iter.enumerate() {        unsafe { std::ptr::write(v.ptr.as_ptr().add(i), item); }    }    v.len = n;    v}
```

This code does not cause UB if the iterator is well-behaved. But there’s no guarantee of such behavior! It is entirely possible to write a broken `ExactSizeIterator`

implementation that returns a bogus result, all in safe Rust:

```
struct BadIterator {    remaining: usize,}impl Iterator for BadIterator {    type Item = u32;    fn next(&mut self) -> Option<u32> {        (self.remaining > 0).then(|| { self.remaining -= 1; 42 })    }}impl ExactSizeIterator for BadIterator {    fn len(&self) -> usize {        0 // but this actually yields `self.remaining` items!    }}
```

`collect_exact`

will write into unallocated memory in this case, so it is unsound in general[ 3].

Another example is the [ std::io::Read trait](https://doc.rust-lang.org/std/io/trait.Read.html), where a safe but buggy

`Read`

implementation can return the wrong number of bytes read. See [Rust RFC 2930](https://github.com/rust-lang/rfcs/blob/master/text/2930-read-buf.md)for more discussion about this.

`iddqd`

is in exactly this situation: it consists of generic data structures which call into user-supplied code, so it is working at this highest level of difficulty[ 4].

[How iddqd works](#how-iddqd-works)

`iddqd`

's architecture is pretty straightforward: it uses a combination of a list of items (internally called an `ItemSet`

) and a table with indexes into the slots. For example, [ IdHashMap<T>](https://docs.rs/iddqd/latest/iddqd/struct.IdHashMap.html) consists of:

an

`ItemSet<T>`

, which (similar to the) internally has:`slab`

cratea

`Vec<ItemSlot<T>>`

, where`ItemSlot<T>`

is an enum with two variants:`Occupied(T)`

`Vacant { next: ItemIndex }`

(where`ItemIndex`

represents an integer index into the`Vec<ItemSlot<T>>`

)

a

`free_head: ItemIndex`

, which points to either the most recently freed`Vacant`

slot, or is a sentinel (marked as`SENT`

in the diagrams below) indicating that no free vacant slots are available. (The combination of`free_head`

and the`Vacant`

slots forms a*free chain*.)

a

over`hashbrown::HashTable`

`ItemIndex`

(`hashbrown`

is the fast hash table implementation used by the standard library)

For example, an `IdHashMap`

holding three items might look like the diagram below. The top row is the hash table over `ItemIndex`

; the middle row is the `Vec<ItemSlot<T>>`

.

Adding a new item consists of checking `free_head`

to see if any `Vacant`

slots are available. If so, we read the `next`

index out of that `Vacant`

slot, overwrite the slot with `Occupied(…)`

, then set `free_head`

to the `next`

value we read. Otherwise, we push a new `Occupied`

slot to the end. Finally, the hash table is updated, fetching the key and computing the hash, then recording the new index in the hash table.

Starting from the state above: Let’s say we want to insert a new item `t₃`

. Calling `insert(t₃)`

recycles slot `2`

from the free chain, advances `free_head`

to `1`

, and records the new hash entry:

Removing an item given a key involves this process in reverse: first, consult the hash table using the hash computed from the key, and remove the found index from the hash table. With the index in hand, replace the `ItemSlot`

with `Vacant`

, setting the internal `next`

to the current value of `free_head`

. Finally, set `free_head`

to the index that was just replaced with `Vacant`

.

Continuing from the post-insert state above: Calling `remove(k₀)`

drops the hash entry for `k₀`

, marks slot `3`

as `Vacant`

with its `next`

pointing at the old `free_head`

(`1`

), and updates `free_head`

to `3`

:

This pattern generalizes to the other map types:

follows a similar structure, except the hash table is replaced with a`IdOrdMap`

[B-tree index](https://github.com/oxidecomputer/iddqd/blob/dd52024760267ce0a5bcd5b98a745a4b4468ebe5/crates/iddqd/src/support/btree_table.rs)[.[5](#_footnotedef_5)]and`BiHashMap`

store two and three hash tables respectively.`TriHashMap`

[Unsafe Rust in iddqd](#unsafe-rust-in-iddqd)

`iddqd`

has a few different kinds of unsafe Rust within it. Most uses of unsafe follow [well-established patterns](https://github.com/oxidecomputer/iddqd/blob/1588c91bb7d120f9175533c0f36fdcf58c84f13c/crates/iddqd/src/support/borrow.rs#L18) also used [within Rust’s standard library](https://github.com/rust-lang/rust/blob/c854ac880bf68671f56e6a28e4ff5cb7e92da62c/library/alloc/src/collections/btree/borrow.rs#L14)—these are not dependent on user code, and a sufficiently smart borrow checker would make these uses of unsafe unnecessary. But there are a few that stand out.

Perhaps the most challenging one to reason about is mutable iteration over items. For [ IdOrdMap::iter_mut](https://docs.rs/iddqd/latest/iddqd/struct.IdOrdMap.html#method.iter_mut), iteration is ordered by the key. One might start off by writing something like:

```
pub struct IterMut<'a, T: IdOrdItem> {    items: &'a mut ItemSet<T>,    // This is an ordered iterator over the B-tree equivalent to a hash table.    iter: btree_table::Iter<'a>,}impl<'a, T: IdOrdItem + 'a> Iterator for IterMut<'a, T> {    type Item = &'a mut T;    #[inline]    fn next(&mut self) -> Option<Self::Item> {        // Obtain the next index to return, in order.        let index = self.iter.next()?;        // Fetch the item.        let item: &mut T = &mut self.items[index];        Some(item)    }}
```

But this would immediately return an error:

``` php
error: lifetime may not live long enough
   --> crates/iddqd/src/id_ord_map/iter.rs:90:9
    |
 79 | impl<'a, T: IdOrdItem + 'a> Iterator for IterMut<'a, T> {
    |      -- lifetime `'a` defined here
...
 83 |     fn next(&mut self) -> Option {
    |             - let's call the lifetime of this reference `'1`
...
 90 |         Some(item)
    |         ^^^^^^^^^^ method was supposed to return data with lifetime `'a`
    |                    but it is returning data with lifetime `'1`
```

The reason this occurs is that Rust’s iterators require that the returned values do not borrow from the iterator itself[ 6]. Many uses of iterators do not need this capability—in particular, most for loops only operate on one item at a time:

``` js
let mut items = IdOrdMap::<T>::new();// ... insert some itemsfor item in &mut items {    item.value += 1;}
```

But Rust’s iterators also let you create values that outlive the iterator, via e.g. the [ collect combinator](https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.collect):

``` js
let mut items = IdOrdMap::<T>::new();// ... insert some itemslet items_mut: Vec<&mut T> = items.iter_mut().collect();
```

`items_mut`

cannot outlive `items`

itself, but *can* outlive the iterator returned by `items.iter_mut()`

.

Crucially, this relies on knowing that all of the `&mut T`

returned by the iterator are disjoint—or, in other words, that the iterator never yields the same value twice. (Remember that one of the rules of Rust is that there must never be multiple `&mut`

references to the same memory.) But the Rust borrow checker just sees a succession of accesses into a list, and it has no way of knowing that the indexes are all different. If the human writing the code has knowledge that all of the indexes are different, though, they can use an unsafe pattern called [lifetime extension](https://doc.rust-lang.org/stable/std/mem/fn.transmute.html#examples) to tell the borrow checker to let this code through:

```
impl<'a, T: IdOrdItem + 'a> Iterator for IterMut<'a, T> {    type Item = &'a mut T;    #[inline]    fn next(&mut self) -> Option<Self::Item> {        // Obtain the next index to return, in order.        let index = self.iter.next()?;        // Fetch the item.        let item: &mut T = &mut self.items[index];        // Extend the lifetime of the item to 'a.        //        // SAFETY: The indexes returned by `self.iter` are known to all be different,        // so successive calls to `IterMut::next` never create mutable aliases to        // the same memory.        let item = unsafe { std::mem::transmute::<&mut T, &'a mut T>(item) };        Some(item)    }}
```

[A bug and a fix](#a-bug-and-a-fix)

In the previous section, we saw how unsafe Rust acts as an escape hatch for facts that the Rust compiler cannot establish by itself—in this case, that the indexes returned by `self.iter`

are all different. But *are the indexes actually different*? Based on the storage scheme described above, it would seem like that is the case. But this is generic code which calls into user-provided functions, so there’s a chance that sufficiently pathological user code can trick the map into storing duplicate indexes.

Let’s take one such example we recently found and fixed. Suppose you have an `IdOrdMap`

called `items`

storing five items with integer keys 0 through 4, inserted and stored in order. Here, the top row is the B-tree storing indexes in key order, while the bottom row is the `Vec<ItemSlot<T>>`

:

Now suppose you use the [Entry API](https://docs.rs/iddqd/latest/iddqd/struct.IdOrdMap.html#method.entry) (a standard Rust data structure pattern) to fetch an item by index 0:

```
// In this example, the integer keys are wrapped in another type. We're going// to swap in a different Ord for Key in the next step.let entry = items.entry(Key(0));
```

This will successfully look up the item stored at index 0 and return an [ OccupiedEntry](https://docs.rs/iddqd/latest/iddqd/id_ord_map/struct.OccupiedEntry.html). The

`OccupiedEntry`

internally stores the fact that the item was found at index 0.Now, at this point, let’s deliberately switch the `Ord`

implementation for `Key`

to always return [ Equal](https://doc.rust-lang.org/std/cmp/enum.Ordering.html#variant.Equal), no matter the values inside. There are a few different ways to do this in safe Rust, most of which use some form of

[interior mutability](https://doc.rust-lang.org/reference/interior-mutability.html). (This is unlikely in ordinary Rust, but

*that doesn’t matter*: unsafe generic Rust must handle arbitrarily pathological Rust!)

Now remove the entry from the map using [ OccupiedEntry::remove](https://docs.rs/iddqd/latest/iddqd/id_ord_map/struct.OccupiedEntry.html#method.remove):

``` js
let item = entry.remove();
```

At this point, `iddqd`

would attempt to remove *index* 0 from the B-tree, descending into the tree again but comparing only by *key*[ 7]. Because we made comparisons by key always return

`Equal`

, the comparison would short-circuit at the first element it compares against. For example, if the first comparison lands on the B-tree entry `(key = 2, index = 2)`

, then that entry would be removed. But the `OccupiedEntry`

was carrying `index = 0`

, so it would also remove item 0 from the item set:The map is no longer in a consistent state. The index 0 is still in the B-tree index but the corresponding slot in the `ItemSet`

is vacant; meanwhile, item 2 in the `ItemSet`

does not have any pointers to it.

At this point, though, the invariant that there aren’t duplicate indexes hasn’t been violated. Where it does get violated is with this next step: suppose the `Ord`

implementation now switches back to being honest[ 8], and you insert a new entry with key 1000. Since

`free_head`

points to item 0, `iddqd`

will attempt to reuse that slot. It is *this*step that results in duplicate indexes:

There are now duplicate indexes pointing to the same item, and `IterMut`

becomes unsound.

This was fixed through a combination of two things:

**When descending into the B-tree, also check equality against the index.** In other words, if we’re looking for a key with a known index in the B-tree,[compare against](https://github.com/oxidecomputer/iddqd/blob/dd52024760267ce0a5bcd5b98a745a4b4468ebe5/crates/iddqd/src/support/btree_table.rs#L458-L476). As covered in the example above, previously, we would just check that the key is the same; the pathological*both*`key`

and`index`

`Ord`

implementation returned`Equal`

, tricking the map into believing that the entry at`(key = 2, index = 2)`

was the one to remove. We changed this to additionally use the index as a tie-breaker. The comparison against`(key = 2, index = 2)`

now resolves to`Less`

(since`0 < 2`

), so the search doesn’t spuriously match here.By construction, the only way the search can now succeed is if the stored index actually equals the one we’re looking for.

**If there are no matches, fall back to a linear scan.** The tie-breaker eliminates spurious matches on the wrong entry, but it doesn’t guarantee that the search finds the*right*entry. (The B-tree is sorted by*key*, while the tie-breaker compares by*index*, and in general these two orderings are independent.) If the tree search yields no results, we must still keep the tree and the`ItemSet`

in sync. In that case, we remove the index from the B-tree by[doing a linear scan](https://github.com/oxidecomputer/iddqd/blob/dd52024760267ce0a5bcd5b98a745a4b4468ebe5/crates/iddqd/src/support/btree_table.rs#L267-L276)without calling into user code. The remove operation then takes linear rather than logarithmic time, but that is an acceptable tradeoff, since buggy user code is the only way to encounter the fallback.

This is the kind of long-range reasoning that is sometimes required to establish the soundness of a safe abstraction. The overall system relies on all of these moving parts working correctly across arbitrarily complex series of operations.

[Validating iddqd](#validating-iddqd)

Because `iddqd`

is such a foundational data structure at Oxide, we go through great lengths to validate its correctness, including the soundness of its abstractions. There is a great deal of analytical reasoning performed by experienced Rust authors and reviewers, as described in the section above. But we also empirically validate `iddqd`

along several different dimensions. None of the layers is sufficient by itself, but by rigorously doing all of them we can be more confident that `iddqd`

is correct.

The layers of validation we apply are:

| Technique | Catches |
|---|---|
Analytical reasoning | Soundness errors |
Example-based tests (including doctests) | Regressions, broken examples |
Pathological tests with Miri | Adversarial trait impls causing UB |
Model-based tests (PBT with | Divergence from oracle over random operation sequences |
Panic safety fault injection | Invariant violations on mid-operation panics |
LLM adversarial review | Blind spots in adversarial user code |

In this section, I’ll provide a brief overview of each layer. Click through the links to see more details and code samples.

[Analytical reasoning](#analytical-reasoning)

All blocks and patterns of unsafe Rust have been analyzed by at least one human reviewer, and up to three. Thanks to several of my Oxide colleagues for lending their time and expertise to these reviews—these discussions helped sharpen our reasoning considerably.

The reasoning for each unsafe block is captured in a `// SAFETY:`

comment above the block. We use Clippy’s [ undocumented_unsafe_blocks](https://rust-lang.github.io/rust-clippy/master/index.html?search=undocumented_unsafe_blocks#undocumented_unsafe_blocks) lint to verify that these comments are present.

[Example-based tests](#example-based-tests)

The simplest kind of empirical validation is with **example-based tests**: create a map, perform some operations, and ensure that the results of those operations are as expected. `iddqd`

has [unit tests for its internals](https://github.com/oxidecomputer/iddqd/blob/dd52024760267ce0a5bcd5b98a745a4b4468ebe5/crates/iddqd/src/support/btree_table.rs#L639) and [integration tests for the public API](https://github.com/oxidecomputer/iddqd/blob/dd52024760267ce0a5bcd5b98a745a4b4468ebe5/crates/iddqd/tests/integration/id_ord_map.rs). These also live as doctests, where they double as documentation.

[Pathological tests with Miri](#pathological-tests-with-miri)

`iddqd`

also has a [battery of pathological tests](https://github.com/oxidecomputer/iddqd/blob/dd52024760267ce0a5bcd5b98a745a4b4468ebe5/crates/iddqd/tests/integration/pathological.rs) which supply buggy implementations of

`Ord`

and other traits.In CI, both regular and pathological tests are run under the [Miri interpreter](https://github.com/rust-lang/miri/), which can detect many classes of undefined behavior. (If you’re curious about the details, we run with Miri under both Stacked and [Tree Borrows](https://www.ralfj.de/blog/2023/06/02/tree-borrows.html).) But note that some classes of UB can be detected in the regular test environment as well. For example, in the previous section we established that the tables not containing duplicate indexes is a necessary condition for soundness; this invariant [can be verified](https://github.com/oxidecomputer/iddqd/blob/dd52024760267ce0a5bcd5b98a745a4b4468ebe5/crates/iddqd/src/support/btree_table.rs#L113-L159) outside the Miri context as well.

[Model-based testing and fault injection](#model-based-testing-and-fault-injection)

`iddqd`

uses two layers of randomized testing: model-based comparison against a known-correct oracle, and fault injection on top of it.

For data structures, example-based tests alone are generally considered insufficient. A much stronger kind of testing is **model-based testing**, also known as **stateful property-based testing** (stateful PBT): random sequences of operations are applied to an instance of the type, and the results are compared against an *oracle* that is known to be correct. `iddqd`

has extensive model-based tests against [a NaiveMap oracle](https://github.com/oxidecomputer/iddqd/blob/dd52024760267ce0a5bcd5b98a745a4b4468ebe5/crates/iddqd-test-utils/src/naive_map.rs#L9) that is inefficient but clearly correct. (I’ve talked about this style of testing before

[on](https://www.youtube.com/watch?v=CcYK8E1UuJY).)

*Oxide and Friends*An extension of model-based testing is **fault injection**, where bugs are randomly inserted into user code. For `iddqd`

, a fruitful avenue for fault injection has been *panic safety* (or *unwind safety*): user code panicking in the middle of an operation must not cause the map’s invariants to be violated[ 9]. We systematically explore fault injection by

[generating random sequences](https://github.com/oxidecomputer/iddqd/blob/dd52024760267ce0a5bcd5b98a745a4b4468ebe5/crates/iddqd/tests/integration/id_ord_map.rs#L1113-L1255)of map operations, where each operation is associated with a (randomly selected) panic countdown. Each call into user code decrements the countdown by 1; if it reaches 0, the code panics. Randomized panic safety testing found several subtle bugs in

`iddqd`

([example](https://github.com/oxidecomputer/iddqd/pull/259)), including some that escalated into unsoundness.

The model-based tests also verify internal invariants after each operation, such as the no-duplicate-index condition described in the previous section. We’ve found in practice that model-based tests are too slow to run under Miri, so we verify the invariants on which soundness (and correctness) is known to be dependent instead.

[LLM adversarial review](#llm-adversarial-review)

Another kind of validation that’s recently become available is **LLM-driven adversarial review**. Current-generation frontier models[ 10] found several ways to write pathological implementations of user code that would corrupt the map. In

[one notable case](https://github.com/oxidecomputer/iddqd/pull/260), an LLM constructed a way for the map’s invariants to break on a custom allocator panicking and unwinding. This is a panic safety issue (and a failure mode I hadn’t thought about before), but distinct from the existing panic safety tests that only covered panics in regular user code like an

`Ord`

implementation.LLMs can sometimes produce plausible-but-wrong soundness claims. An effective way to guard against that is [red-green TDD](https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/), using Miri as an oracle: For a soundness bug, first have the LLM write a test case demonstrating the bug, running it under Miri to show undefined behavior—the *red* phase. Then, after fixing the bug, rerun the test to show that it now passes—the *green* phase.

[What about formal methods?](#what-about-formal-methods)

To formally verify `iddqd`

, one’s first thought would be to use a [model checker](https://en.wikipedia.org/wiki/Model_checking) like [Kani](https://github.com/model-checking/kani) to establish that the maps don’t violate internal invariants. But `iddqd`

is unfortunately a bit too complicated for Kani to handle, and the required proofs blow up beyond what is feasible for the tool.

The [Creusot](https://creusot.rs/) deductive verifier can help Rust developers prove their code is free of panics and other errors, but as of this writing it’s [unable to prove](https://github.com/creusot-rs/creusot/blob/7a68885eb85a3bea6f65e7cb50b9013e4aa7d1dc/creusot-std/src/std/panicking.rs#L10-L45) invariants that must hold even if user code panics or otherwise misbehaves.

The infeasibility of proofs is a common problem with applying formal methods to implementations, so they are often applied to a higher-level specification instead. For `iddqd`

, the closest thing to a specification is `NaiveMap`

, but it can be easily observed to be correct without needing a formal proof.

How do we ensure that the implementation matches the specification? Model-based testing does a lot of heavy lifting here. While they aren’t a formal proof that `iddqd`

matches `NaiveMap`

, running model-based tests thousands of times in CI provides fairly high confidence that it does.

We keep an eye on developments in this space; the in-development [ cargo-anneal](https://crates.io/crates/cargo-anneal), which lets you embed

[Lean](https://lean-lang.org/)soundness proofs alongside unsafe Rust as doc comments, looks quite interesting.

If you’re working on formal verification tooling, `iddqd`

is a great candidate to benchmark against because of its crisp invariants and relatively constrained yet non-trivial scope. We’d be especially interested in proofs that hold over arbitrary trait implementations, and of refinement between `iddqd`

and `NaiveMap`

. Please reach out by [filing an issue](https://github.com/oxidecomputer/iddqd/issues/new) or [emailing me](mailto:rain@oxide.computer)!

[Conclusion](#conclusion)

Writing unsafe generic Rust is extremely difficult. Each invariant that the unsafe code relies upon has to be carefully upheld over arbitrary trait implementations, including adversarial ones. This post covers one such example: how an `Ord`

implementation could be carefully written to trick the map into creating mutable aliases to the same memory, and how we fixed it.

No single technique can hope to catch all bugs, which is why `iddqd`

uses several layers of validation. Humans carefully reasoning about every line of unsafe code goes a long way; example-based, pathological, and randomized tests provide empirical evidence; and frontier models can find new and surprising ways to break code that humans might not have thought of. This is a lot of machinery, but at Oxide [we believe](https://oxide.computer/principles) that for foundational infrastructure, this level of rigor is justified. And if you agree, [we’re hiring](https://oxide.computer/careers)!

*Diagrams courtesy Ben Leonard.*
