Comparison and Benchmarking of Rust Decimal Crates

wpnews.pro

My English is not very good, so this article was translated with the help of AI. Here is the

[Chinese version].

As is well known, because 2 and 10 do not share the same prime factors, binary fractions cannot represent decimal fractions exactly. For example, f64

has the classic arithmetic error: 0.1 + 0.2 != 0.3

.

Some application scenarios, such as finance, require exact representation of decimal fractions. This is why decimal crates are needed. Their use integers to represent the mantissa, along with a scale representing the number of decimal places. For example, the value 1.23

can be represented using integer 123

with scale = 2

.

There are many decimal crates in the Rust ecosystem, each with different designs and trade-offs. Their differences mainly fall into two dimensions:

Whether the scale is fixed or variable. This corresponds to Fixed-point vs Floating-point.

Whether the count of integers is fixed or arbitrary. This corresponds to Fixed-precision vs Arbitrary-precision.

This article chooses several crates for comparison and benchmarking.

Table of contents:

The first two sections (Fixed-point and Floating-point, Fixed-size and Arbitrary-precision) introduce the characteristics of these categories. There is nothing particularly new here, so experienced readers may skip them.

The next section (Choosing Crates) introduces several decimal crates.

The final section (Benchmark Comparison) is the main focus of this article, benchmarking and comparing these crates.

Fixed-point* vs

In fixed-point arithmetic, the scale is fixed and bound to the type. In floating-point arithmetic, the scale is variable and stored in each instance.

Let’s illustrate this with code.

A typical fixed-point type definition might look like this:

struct FixedPoint<const SCALE: i32>(i128); // scale is bound to type

A typical floating-point decimal type might look like this:

struct FloatingPoint {
    mantissa: i128,
    scale: i32, // scale is stored in each instance
}

This clearly shows that fixed-point numbers have fixed decimal precision, while floating-point decimals have variable precision. For example, FixedPoint<2>

always has 2 decimal places, while the precision of FloatingPoint

depends on each instance’s scale.

Because of this distinction, fixed-point and floating-point types exhibit the following differences:

Fixed-point numbers have a smaller representable range, while floating-point numbers can represent a much larger range. This is because floating-point numbers sacrifice decimal precision as values become larger.

Fixed-point arithmetic is simpler and faster, while floating-point arithmetic is more complex and slower. For example, addition for fixed-point numbers only requires integer addition on the mantissa. Floating-point addition must first check whether the scales are equal (this check itself can already be slower than the addition), and if not, align the scales through multiplication. This will be discussed in detail in the benchmark section.

Fixed-point arithmetic is somewhat more cumbersome to use, while floating-point arithmetic is more convenient. For example, with the FixedPoint

type above, the scale must be determined at compile time for each type, such as how many decimal places Balance

or Price

should have. Floating-point decimals do not require this consideration.

The difference between the two is somewhat analogous to the difference between statically typed and dynamically typed languages.

Most applications use decimal crates simply to represent decimal fractions exactly, without particularly high requirements for performance or strict decimal precision. In such cases, floating-point decimals are usually preferred for convenience. However, for more serious services, especially many financial systems that require strict decimal precision or high performance, fixed-point decimals are recommended. For example, USD assets should have exactly 2 decimal places, neither more nor less.

NOTE: Since built-in floating-point types in programming languages (such as C’s float

and double

, or Rust’s f32

and f64

) are commonly referred to as “floating-point”, and these types cannot represent decimal fractions exactly, many people mistakenly think that “floating-point” inherently cannot represent decimal fractions exactly. This is WRONG! More precisely, these are “binary floating-point” numbers. The inability to represent decimal fractions exactly comes from the “binary” part, not the “floating-point” part. Because people often omit the word “binary”, floating-point arithmetic unfairly gets blamed. In fact, even binary fixed-point types, such as the fixed crate, also cannot represent decimal fractions exactly. As long as a crate is decimal-based, whether fixed-point or floating-point, it can represent decimal fractions exactly.

NOTE: Floating-point arithmetic has a standard called IEEE 754, which defines both binary floating-point formats (used by f32

/f64

) and decimal floating-point formats. However, this standard is only one implementation approach for floating-point arithmetic, not the entirety of floating-point arithmetic itself. Other implementations are also possible. In practice, most decimal crates do not follow IEEE 754 decimal formats.

Fixed-precision* vs

First, let’s clarify the meaning of the word “precision” here. The term has two conflicting meanings:

For example, the value 1.23

has 2 fraction places but 3 significant digits. Both meanings are widely used. For example, std::fmt uses the former meaning, while here (Fixed-precision vs Arbitrary-precision) the latter meaning is used. This is the standard terminology, but it easily causes confusion. “Fixed-precision” is often misunderstood as fixed fraction places, leading to confusion with fixed-point arithmetic.

To avoid ambiguity, this article uses the term Fixed-size instead of Fixed-precision.

As the name suggests, Fixed-size types use a fixed number of integers (one or more). Arbitrary-precision types use as many integers as necessary: expanding to the left to avoid overflow, and expanding to the right to avoid precision loss.

Naturally, this requires heap allocation, meaning the type is not Copy

, and the crate is not no-alloc

. All operations also become significantly slower. Unless there is a clear requirement for arbitrary precision, Fixed-size types are generally preferable.

We choose several decimal crates for comparison and benchmarking:

| Floating-point | Arbitrary-precision |

This is currently the only actively maintained Arbitrary-precision decimal crate. Internally, it uses a Vec<u64>

or Vec<u32>

to represent the mantissa. Its memory layout looks like this:

+-u64----+--------+--------+--------+--------+
| sign   | Vec<u64>                 | scale  |
+--------+--+-----+--------+--------+--------+
            |
            +--------+--------+----
            | u64    |  …     |
            +--------+--------+----

Metadata alone occupies 5 machine words, totaling 40 bytes, making the memory layout relatively loose. Since memory allocation is required during creation and expansion, and pointer dereferencing is needed during access, performance is relatively poor, as will be clearly shown in the benchmarks below.

In short, this crate prioritizes Arbitrary-precision at the expense of memory efficiency and performance.

| Floating-point | Fixed-size |

Its Decimal

definition is:

struct Decimal<const N: usize>

Here, N

is the number of u64

s used to represent the mantissa. For example, Decimal<2>

uses two u64

s, giving a 128-bit mantissa. This is why its documentation also describes it as Arbitrary-precision. The difference is that bigdecimal

adjusts precision at runtime, while fastnum

determines it at compile time.

The memory layout is:

+-u64----+--------+...+--------+
| [u64; N]            | CBlock |
+--------+--------+...+--------+

CBlock

is an 8-byte ControlBlock

used by fastnum

to store metadata. Besides sign and scale, it contains additional fields. See the documentation for details.

fastnum

also provides many scientific functions typically found in f32

/f64

, such as sin

, cos

, sqrt

, and log

. None of the other decimal crates provide such functionality. Personally, I do not think these features are particularly reasonable. People use decimal arithmetic to represent decimal fractions exactly, while scientific computations typically produce irrational numbers that cannot be represented exactly anyway. Scenarios requiring such operations (even in finance, such as pricing models) are better suited to much faster binary floating-point types (f32

/f64

).

The documentation claims the crate is blazing fast, but its benchmark comparisons are mostly against the already slow bigdecimal

. In the benchmarks below, compared to the other selected crates, fastnum

turns out to be the slowest. However, since it considers itself Arbitrary-precision, its intended competitor is probably bigdecimal

.

Also, its documentation is extremely detailed.

| Floating-point | Fixed-size |

The most popular decimal crate in the Rust ecosystem. Judging from download counts, reverse dependencies, and ecosystem integration (serde, postgres, etc.), it is by far the most widely used. It is also one of the oldest decimal crates, with its first release dating back to late 2016. Its age is probably a major reason for its popularity.

It only supports 128-bit signed decimals. Memory layout:

+-u32--+------+------+------+
| flag | high | mid  | low  |
+------+------+------+------+

The mantissa consists of three u32

s (high

, mid

, and low

), totaling 96 bits, roughly equivalent to 28 decimal digits. Arithmetic operations must process all three u32

s sequentially, which hurts performance.

The flag

field stores:

[0, 28]

)The documentation claims this memory layout is chosen for performance optimization. However, the benchmarks below show that rust_decimal

is not actually the fastest. Historically, this design likely existed because Rust originally lacked stable 128-bit integers.

The API also reveals traces of the pre-i128

era. For example, the constructor from i64

is called new, while the later-added

i128

constructor is named from_i128_with_scale

| Floating-point | Fixed-size |

This crate occupies essentially the same niche as rust_decimal

.

Advantages:

Disadvantages:

rust_decimal

.One reason this crate was selected is that I am its author :)

It uses a single integer representation. For the 128-bit signed type, the memory layout is:

+-u128-----------------------+
|S|scale| mantissa           |
+----------------------------+

The sign (S

) and scale occupy 1 bit and 5 bits respectively, leaving 122 bits for the mantissa, or roughly 36 decimal digits — significantly more than rust_decimal

’s 28 digits.

Arithmetic uses a single u128

instead of three u32

s, making it faster.

| Fixed-point | Fixed-size |

This is the only Fixed-point crate selected in this article. Its main difference from the others is precisely that it is Fixed-point, as discussed earlier in Fixed-point and Floating-point.

Compared with other Fixed-point decimal crates, its biggest feature is that besides the typical FixedPoint

style (using const generics to fix decimal places at compile time), it also provides an Out-of-band scale mode, allowing the scale to be specified at runtime for greater flexibility.

For example, in a multi-currency fund management system, using the typical FixedPoint

type forces all currencies to share the same decimal precision. Defining:

type Balance = FixedPoint<2>

means all currencies are limited to 2 decimal places.

With the crate’s Out-of-band scale

types, each currency can define its own decimal precision. See the Out-of-band documentation for details.

Since the scale is bound to the type (either through const generics or Out-of-band metadata), no scale needs to be stored in the instance itself. Therefore, instances only store the mantissa. For the 128-bit signed type, the memory layout is:

+-i128-----------------------+
| signed-mantissa            |
+----------------------------+

This crate also differs in another implementation detail: it uses signed mantissas, while all the other selected crates separate sign and mantissa handling. This distinction also originates from the difference between floating-point and fixed-point arithmetic, but we will not go into detail here. The only thing worth noting is that this leaves the mantissa with 127 bits instead of 128.

Let’s compare memory efficiency by looking at metadata size:

Spoiler: this ranking matches the benchmark results.

Now we arrive at the core of this article: benchmark results.

We use criterion for benchmarking. The project source code is available on GitHub.

Benchmarks were run on three machines:

Results vary somewhat across environments. For simplicity, this article only presents and analyzes the first machine (AMD EPYC). Readers interested in other environments can refer to the full results. You are also welcome to run the benchmarks on your own machine; instructions are included in the project’s page.

Besides the decimal crates above, native Rust f64

is also included for comparison. Since stable f128

is not yet available, it was not benchmarked. However, in my private tests, f128

performs almost identically to f64

.

We primarily benchmark 128-bit and 64-bit signed types. However:

bigdecimal

is variable-sized, so bit width is irrelevant.fastnum

supports much larger sizes, making this benchmark somewhat underutilize it.rust_decimal

only supports 128-bit, not 64-bit.Benchmark cases:

Subtraction behaves similarly to addition and is therefore omitted.

Operand selection: Different benchmark cases use different scale configurations depending on the scenario. The mantissas themselves (more precisely: both addition operands, both multiplication operands, and the dividend for division) are all powers of 10, increasing exponentially. For example, x = 3

on the chart means the operand is 1e3

.

Because different crates support different mantissa sizes, their representable ranges differ, resulting in different line lengths in the charts:

bigdecimal

supports arbitrary precision, but was restricted here to 128-bit-equivalent values, or 38 decimal digits.fastnum:128

has a full 128-bit mantissa, also about 38 digits.prim-fpdec:128

has a 127-bit mantissa, but still roughly 38 decimal digits.decimax:128

has a 122-bit mantissa, about 36 digits.rust_decimal

has a 96-bit mantissa, only about 28 digits.The following sections explain the details.

The addition process works as follows:

This section benchmarks the equal-scale case. The next section covers unequal scales.

For simplicity, we use identical operands. The scale does not affect the benchmark and is fixed at 10. The mantissas are powers of 10 increasing in magnitude.

Chart:

As expected, bigdecimal

sits far above the others. The remaining crates are compressed near the bottom, so we temporarily remove bigdecimal

:

Now things are much clearer.

For 128-bit types:

fastnum:128

is the slowestrust_decimal

comes nextdecimax

followsprim-fpdec:128

is the fastestThe first three are floating-point decimals, so they must first check whether the scales are equal before addition. This check itself is relatively expensive and slows down the entire operation.

prim-fpdec:128

is fixed-point, so the operation is essentially just integer addition, almost a single CPU instruction.

For 64-bit types:

fastnum:64

is slightly faster than fastnum:128

decimax:64

performs similarly to decimax:128

prim-fpdec:64

performs similarly to prim-fpdec:128

Most curves are stable, except rust_decimal

and fastnum:64

, both of which exhibit noticeable jumps, though for different reasons:

For rust_decimal

, the jump occurs because numbers are internally represented using three u32

s. Small mantissas fitting within one u32

only require one addition, while larger mantissas require operations across all three u32

s. Hence the jump around x = 9

.

For fastnum:64

, the jump occurs because its 64-bit mantissa can represent up to 19 decimal digits. Since our benchmarks use powers of 10, the problematic case occurs around 1e19

. Adding two such values yields 2e19

, exceeding the 64-bit range (~1.84e19

). Following floating-point behavior, the implementation must rescale: mantissa /= 10; scale += 1;

. Since division is slow, the addition operation suddenly becomes much slower. Other floating-point crates may encounter similar situations, though not within this benchmark range. Fixed-point crates cannot rescale, so they simply overflow and return an error instead.

Now let’s look at addition where the operand scales differ.

Fixed-point types cannot participate in this benchmark, so primitive_fixed_point_decimal

is excluded.

Before adding mantissas, floating-point decimals must first align the scales. The algorithm typically works as follows:

In this benchmark, operand scales are fixed at 10 and 0, differing by 10. Therefore, alignment requires multiplying by 1e10

. Once the mantissa grows beyond 1e(MAX_SCALE - 10)

, multiplication overflows and the slower fallback path involving division is triggered.

Chart:

Again, bigdecimal

dominates the chart, so we temporarily remove it:

Compared with equal-scale addition, absolute times are much slower because of scale alignment.

As explained above, all curves eventually exhibit jumps.

Among them:

rust_decimal

shows the largest jump, tripling from ~15ns to ~45ns and becoming unstable afterward.fastnum:128

shows a moderate jump.decimax:128

shows the smallest jump.Performance ranking (slower first):

Before the jump:

fastnum:128

rust_decimal

decimax:128

After the jump:

rust_decimal

fastnum:128

decimax:128

Now let’s examine multiplication.

Decimal multiplication consists of two parts:

Both steps may overflow. If either overflows, a second phase is triggered, reducing both mantissa and scale to avoid overflow. Since division is involved, performance degrades significantly.

We again use identical operands with exponentially increasing mantissas. To avoid overflow of the decimal value itself multiplication (not the mantissa multiplication), scales are increased simultaneously so that the actual value remains 1.

Once the mantissa reaches approximately half the representable range, mantissa multiplication overflows and triggers the second phase.

Chart:

Besides bigdecimal

, both fastnum

curves become extremely large in the latter half. To better observe the other crates, we remove the entire bigdecimal

curve and truncate the fastnum

curves:

The chart is still somewhat messy, so let’s break it down carefully.

Because of mantissa multiplication overflow, most curves exhibit jumps around their midpoint.

First, consider the post-jump behavior for 128-bit types:

fastnum:128

slows down extremely rapidly after the jump.rust_decimal

exhibits multiple jumps, likely because of its three-u32

representation.decimax

and prim-oob-fpdec:128

are much more stable and significantly faster.Now consider the pre-jump region:

fastnum:128

and rust_decimal

are both stable before their jumps (x=19

and x=14

respectively), though fastnum

survives longer.decimax

and prim-oob-fpdec:128

are not only stable but extremely fast before their jumps.Careful readers may notice that primitive_fixed_point_decimal

appears as two variants: prim-oob-fpdec:128

and prim-const-fpdec:128

. Only the former was discussed earlier. This difference arises from fixed-point semantics. The multiplication process described earlier (multiply mantissas, add scales) applies to floating-point decimals. For fixed-point decimals, however, the result scale is predetermined. After adding operand scales, the implementation must further adjust to the target scale, similar to the overflow-adjustment phase. In other words, the second phase that floating-point types only enter later is always active for fixed-point types. This is somewhat unfair to fixed-point arithmetic. Fortunately, primitive_fixed_point_decimal

provides the more flexible Out-of-band Scale

mode, allowing the result scale to equal the sum of operand scales. This avoids the second phase during the early part of the benchmark, enabling fairer comparison with floating-point types. That is what prim-oob-fpdec:128

measures.

However, this is not the real-world use case for fixed-point arithmetic. The Out-of-band Scale

feature was not designed specifically for this benchmark. To reflect realistic fixed-point usage, we also benchmark prim-const-fpdec:128

, where the result scale remains fixed, forcing the second phase throughout the entire benchmark. As the chart shows, prim-const-fpdec:128

is initially the slowest, later it becomes one of the fastest, converging with prim-oob-fpdec:128

Does this mean fixed-point multiplication is slower than floating-point multiplication for small mantissas? For this specific case, yes. But over longer computation chains, not necessarily. Floating-point multiplication appears faster because it postpones scale adjustment, allowing both scale and mantissa to grow. As shown throughout this article, larger scales and mantissas tend to slow down subsequent operations. Unless the multiplication result is final and never used again (not even formatted as a string), the earlier performance advantage tends to be paid back later.

The 64-bit results behave similarly and are omitted here.

Division has several notable characteristics:

Overall, division tends to consume disproportionate development and benchmarking effort for a relatively small portion of real-world usage. Therefore, this article only benchmarks two simple cases:

without attempting exhaustive or perfectly fair comparison.

This section discusses the former, exactly division.

For exactly divisible floating-point division, there are again two subcases:

200 / 25

.2 / 25

.In the second case, 2

does not divide evenly by 25

, but after rescaling to 200

, division succeeds. The difficulty is that the implementation initially does not know: how much rescaling is needed, or whether exact division is even possible. Therefore, implementations often: first aggressively scale up, then perform division, and strip trailing zeros afterward finally. For example, 2

might first become 20000000000

, producing 800000000

, and only afterward get reduced back to 8

. Even the zero-stripping phase must be discovered iteratively, making this path potentially very slow.

To cover both cases, the benchmark fixes the divisor at 1e8

, while the dividend again increases as powers of 10.

Thus:

x=8

, rescaling is required (slow path)x=8

, direct division succeeds (fast path)Fixed-point types do not have these distinctions because quotient scale is predetermined.

Chart:

For floating-point types:

x=8

, all implementations are very slow.rust_decimal

, fastnum:128

, and decimax

become much faster, while bigdecimal

remains slow.For fixed-point:

prim-fpdec:128

avoids quotient-scale determination and is initially very fast. Later, larger mantissas gradually slow it down.Now consider the non-exact division case.

As explained above, exactness only matters for floating-point decimals. Fixed-point behavior remains unchanged, so the fixed-point results here should match the previous benchmark.

Chart:

Again, removing bigdecimal

makes the comparison clearer:

Compared with their exact-division counterparts:

bigdecimal

, fastnum:128

, and rust_decimal

are consistently much slower.decimax:128

becomes significantly faster and very stable.prim-fpdec:128

, being fixed-point, behaves identically to the exact-division benchmark.The reasons likely require code-level analysis of each implementation and are beyond the scope of this article.

Overall, except for a few special cases, the approximate performance ranking is:

bigdecimal << fastnum < rust_decimal < decimax < primitive_fixed_point_decimal

(Further left means slower.)

Floating-point arithmetic paths depend heavily on the specific operands, making performance relatively unstable. Fixed-point arithmetic, by comparison, is much more predictable, which is reflected in the mostly flat curves above.

Again, it is important to emphasize that these crates target different use cases, so pure performance comparison is not entirely fair.

This article introduced several categories of decimal crates and benchmarked several representative implementations.

Based on the results, the following recommendations can be made:

If dynamic arbitrary precision is required, bigdecimal

is the only option, at the cost of losing Copy

semantics and suffering very poor performance.

If types larger than 128-bit are required, fastnum

is the only choice. This article does not benchmark larger-than-128-bit types, but performance is unlikely to be excellent. Interested readers can modify the benchmark project and test it themselves.

If fixed decimal precision is required, primitive_fixed_point_decimal

is the only suitable option. Although slightly less convenient than floating-point types, it provides higher and more stable performance.

If none of the above requirements apply and you simply want exact decimal representation, rust_decimal

or decimax

are both good choices. The former has a stronger ecosystem; the latter offers better performance.

source & further reading

wubingzheng.github.io — original article

Comparison and Benchmarking of Rust Decimal Crates

Run your AI side-project on zahid.host