Bonsai LLM Benchmark: Jetson Orin Nano Super 8GB

wpnews.pro

5 Bonsai-family 1–1.58bit LLMs benchmarked across 4 power modes on Jetson Orin Nano Super 8GB. 25W sweet spot: 47–48% more tok/s than 15W, best output tok/J for all sub-4B models.

My Jetson Nano Orin Super cluster setup - its real!

Benchmark Configuration #

| Platform | NVIDIA Jetson Orin Nano Super 8GB | | CPU | 6-core Arm Cortex-A78AE | | GPU | NVIDIA Ampere (1024 CUDA cores, 32 Tensor cores) | | Memory | 8 GB LPDDR5 shared CPU+GPU | | JetPack | R36.4.7 (L4T 36.4) | | Backend | llama.cpp CUDA, -ngl 99 (all layers on GPU), --no-cache-prompt | | Runs | Four full sweeps: 7W, 15W, 25W, MAXN_SUPER | | Sweep | prompt ∈ {256, 512, 1024, 2048} tok × gen ∈ {128, 256, 512} tok × 20 reqs/combo | | Concurrency | 1 (single-user) | | Key metric | |

Github for scripts and code available at

[link.]## Executive Summary

Five Bonsai-family 1-1.53bit LLMs were benchmarked across all four Jetson Orin Nano Super power modes: 7W, 15W, 25W, and MAXN_SUPER. Each model ran 12 combinations of prompt × generation length (20 requests per combo) at every power mode where it could load.

Key finding:

25W is the energy-efficiency sweet spot for all models ≤4B parameters. For Bonsai-8B, 15W and 25W deliver near-identical output tok/J (~1 % difference), making 15W the more power-conservative choice.
MAXN costs 10–11 % more energy per token than 25W across every model tested.** 25W delivers *47–48 %*more output tok/s than 15W while maintaining or improving output tok/J for sub-4B models (ctx=2048, gen=512). No thermal throttling was observed at any power mode- peak junction temperature (TJ) reached 75.3 °C at MAXN (Bonsai-8B), well below the 95 °C hardware throttle threshold. All other models peak below 72 °C even at MAXN.

Throughput and efficiency winner at each mode (ctx=2048, gen=512, Ternary-Bonsai-1.7B dominates):

[ ]Table 1: Throughput and efficiency winner at each power mode

Mode	Fastest model	Output Tok/s	Output Tok/J
7W	Ternary-Bonsai-1.7B	9.0	4.64
15W	Ternary-Bonsai-1.7B	23.4	4.94
25W	Ternary-Bonsai-1.7B	34.7
5.18
MAXN	Ternary-Bonsai-1.7B	38.0	4.55

Ternary-Bonsai-8B (Q2_0, ~1.4 GB) failed at every power mode: OOM in 8 GB unified memory when combined with KV cache and CUDA overhead. All five remaining models have complete data across all four power modes.

Data Availability #

Complete per-cell JSON exports (all 33 metrics, all 12 prompt×gen combos × 20 requests per cell) are published on Hugging Face Datasets:

Mode	Dataset	Models	Cells
7W
`YuvrajSingh9886/bonsai-jetson-benchmark-7w`

YuvrajSingh9886/bonsai-jetson-benchmark-15w

YuvrajSingh9886/bonsai-jetson-benchmark-25w

YuvrajSingh9886/bonsai-jetson-benchmark-maxn

Each dataset contains the full profile_export_aiperf.json

per cell (all 33 metrics including ISL,

,

OSL

avg/p50/p90/p99,

TTFT

,

ITL

,

output tok/s

,

request latency

,

prefill tok/s

,

power W

),

output tok/J

tegrastats.log

, and per-model server logs.## 1. Test Setup

1.1 Hardware

[ ]Table 2: Hardware configuration

Component	Detail
Board	Jetson Orin Nano Super 8GB (Developer Kit)
CPU	6× Arm Cortex-A78AE @ up to 1.728 GHz
GPU	NVIDIA Ampere, 1024 CUDA cores, 32 Tensor cores
Memory	8 GB LPDDR5 102 GB/s (unified CPU + GPU)
CMA	256 MB (contiguous memory pool; depletes across sequential model loads)
Cooling	Active fan; peak junction temperature ≤ 75 °C across all modes

1.2 Software Stack

Layer	Version / Detail
OS / JetPack	JetPack R36.4.7 (Ubuntu 22.04, L4T 36.4)
CUDA	12.6
llama.cpp	CUDA backend, `-ngl 99` , `--no-cache-prompt --cache-ram 0`
Inference server	`llama-server` : host `0.0.0.0:8080` , `--parallel 1` , `-c 2560`
Load generator	`aiperf` (NVIDIA AI Performance tool)
Power telemetry	`tegrastats` at 500 ms,
`VDD_CPU_GPU_CV`
Python	3.10 (aiperf-env), pandas, seaborn, matplotlib
Datasets	Synthetic prompts at exact token counts (256, 512, 1024, 2048) generated synthetically via aiperf
Concurrency	1 user, 1 request at a time (`--parallel 1` , `--concurrency 1` ) - single-user latency and throughput profile only
Batch size	512 tokens physical (`-ub` / ubatch, default) · 2048 logical (`-b` , default) for llama.cpp

1.3 Models Under Test

Model	Quant	GGUF size	Tokenizer
Bonsai-1.7B	Q1_0	~237 MB	Qwen/Qwen3-1.7B
Bonsai-4B	Q1_0	~540 MB	Qwen/Qwen3-4B
Bonsai-8B	Q1_0	~1.1 GB	Qwen/Qwen3-8B
Ternary-Bonsai-1.7B	Q2_0	~300 MB	Qwen/Qwen3-1.7B
Ternary-Bonsai-4B	Q2_0	~700 MB	Qwen/Qwen3-4B
Ternary-Bonsai-8B	Q2_0	~1.4 GB	N/A

1.4 Power Modes

[ ]Table 5: Power mode configurations

`VDD_CPU_GPU_CV`

7W-m 3

15W-m 0

25W-m 1

MAXN-m 2

jetson_clocks

1 020 MHz****1 728 MHz### 1.5 Benchmark Methodology

For each model

×prompt

×gen combo

,aiperf

sends 20 single-concurrency requests with synthetic prompts at the exact target token count. - Power is sampled from tegrastats

(mW → W) at 500 ms intervals. Tegrastats samples are assigned to exact prefill/decode phase windows using per-request nanosecond timestamps fromVDD_CPU_GPU_CV

profile_export.jsonl

.=output_tok_J

÷ (OSL

×decode_power_W

) - decode energy only, prefill excluded. Seedecode_time_p50_s

Appendix H.3. - Clocks were locked with jetson_clocks

at all modes. - Each run’s power and clock speed was capped at x W through nvpmodel

and monitored for thermal stability (no sustained throttling;junction temp

≤ 75 °C). Latency percentile used throughout: all,TTFT

, and request latency (ITL

) values reported in charts, tables, and energy calculations use theRL

p50 (median) over the 20 requests per combo. The mean is not used for latency because occasional slow requests (GC , memory compaction, OS scheduling) inflate it without reflecting typical behaviour. p90 and p99 are available in the raw per-mode report files (Data Availability) for tail-latency analysis.

2. Results: Charts #

All charts use data from all four power modes.

2.1 Throughput vs Prompt Length

Output tok/s vs prompt length

at gen=512 across all models and modes; 25W (orange) consistently leads for models ≤4B:

[ ]Figure 1: Output tok/s vs prompt length (gen=512, all models and modes)

Canonical cell

(ctx=2048, gen=512), side-by-side output tok/s and output tok/J bars for all 4 modes:

[ ]Figure 2: Canonical cell: output tok/s and tok/J side by side (ctx=2048, gen=512)

2.2 Energy Efficiency

Output Tok/J vs prompt length

atgen=512; 25W leads for ≤4B models; 15W and 25W are near-tied for Bonsai-8B:

[ ]Figure 3: Output tok/J vs prompt length (gen=512, all models and modes)

Output Tok/J heatmap

(gen × prompt) for Standard Bonsai models (1.7B, 4B, 8B) at all 4 modes:

[ ]Figure 4: Output tok/J heatmap: Standard Bonsai models at all 4 power modes (gen × prompt)

Output Tok/J heatmap

for Ternary Bonsai models (1.7B, 4B) at all 4 power modes:

[ ]Figure 5: Output tok/J heatmap: Ternary Bonsai models at all 4 power modes (gen × prompt)

Bonsai-8B spotlight

: output tok/J at all 4 power modes across all three gen lengths - the model where 15W and 25W are energy-equivalent:

[ ]Figure 6: Bonsai-8B output tok/J at all 4 power modes across gen lengths

Prefill tok/J

(input tokens per joule of prefill energy) vs prompt length atgen=512, how efficiently each mode processes the prompt; higher is better:

[ ]Figure 7a: Prefill tok/J (input tok / J) vs prompt length (gen=512, all models and modes)

Note:[=]prefill_tok_J

[/ (]ISL

[×]prefill_power_W

[_s) uses]TTFT

prefill_power_W

derived from tegrastats samples assigned to exact prefill windows using per-request nanosecond timestamps (request_start_ns

→request_ack_ns

) fromprofile_export.jsonl

.Prefill draws significantly more power than decodeon Bonsai models (up to ~1.6x for 1.7B/4B). See[Appendix H.5]for the full methodology.

Decode tok/J

(output tokens per joule of decode energy) vs prompt length atgen=512, output generation efficiency; 25W leads for sub-4B models:

[ ]Figure 7b: Decode tok/J (output tok / J) vs prompt length (gen=512, all models and modes)

Total tok/J

((input + output) tokens per joule of total request energy) vs prompt length atgen=512, overall request efficiency; 25W wins for sub-4B models at every prompt length:

[ ]Figure 7c: Total tok/J (input+output tok / J) vs prompt length (gen=512, all models and modes)

Phase power draw at the canonical cell (ctx=2048, gen=512) - the wattage each phase actually draws, showing why prefill and decode have different energy costs:

[ ]Figure 7d: Prefill phase power (W) - all models × all power modes (ctx=2048, gen=512)

[ ]Figure 7e: Decode phase power (W) - all models × all power modes (ctx=2048, gen=512)

Prefill is a fully-batched forward pass over the prompt (compute-heavy); decode is one token at a time (memory-bandwidth bound). Prefill consistently draws more watts than decode at the same power mode. Per-mode heatmaps across all 12 prompt × gen combinations:

[C.3 Phase Power].

Full tok/J charts for all ctx/gen combinations:

[D.1 Prefill]·[D.2 Decode]·[D.3 Total].

2.3 Latency

TTFT p50 at ctx=2048, gen=512; 25W and MAXN reduce

by

TTFT

*29–39 %*vs 15W:

[ ]**Figure 8: **

TTFT

p50 by power mode (ctx=2048, gen=512)ITL

(inter-token latency) p50 at ctx=2048, gen=512; lower is better:

[ ]**Figure 9: **

ITL

p50 by power mode (ctx=2048, gen=512)Request latency (E2E)

p50 at ctx=2048, gen=512; total time from request start to last token received:

[ ]Figure 10: Request latency (E2E) p50 by power mode (ctx=2048, gen=512)

2.4 Prefill Throughput

25W and MAXN provide ~29–47 % faster prefill than 15W:

[ ]Figure 11: Prefill throughput by power mode (gen=512, avg over all prompt lengths)

2.5 Power Draw

[ ]**Figure 12: Median **

VDD_CPU_GPU_CV

power draw per model at each power mode[ ]**Table 6: Median power draw per model at each power mode (W, **

VDD_CPU_GPU_CV

)| Model | 7W | 15W | 25W | MAXN | |---|---|---|---|---| | Bonsai-1.7B | 1.48 | 3.24 | 4.51 | 5.41 | | Bonsai-4B | 1.60 | 3.59 | 5.07 | 6.10 | | Bonsai-8B | 2.07 | 5.42 | 8.03 | 9.91 | | Ternary-Bonsai-1.7B | 1.95 | 4.75 | 6.71 | 8.42 | | Ternary-Bonsai-4B | 1.99 | 5.06 | 7.41 | 9.05 |

Median over all 12 prompt × gen combos. Bold = highest observed power draw per model row.

2.5.1 VDD_IN + VDD_CPU_GPU_CV ≠ TDP

The nvpmodel TDP cap (7W/15W/25W/MAXN) applies to the total module draw, measured by the VDD_IN rail.

is a

VDD_CPU_GPU_CV

subset- they’re not additive:

VDD_IN ≈ VDD_CPU_GPU_CV + VDD_SOC + misc rails

The wattage looks low as in table 6 because the

rail only captures the GPU, CPU, and CV engine power - not the entire module. The remaining power is drawn by other components (memory controller, media blocks, I/O, DRAM self-refresh, PMIC losses) that are outside the scope ofVDD_CPU_GPU_CV

but still contribute to the total power budget under the TDP cap.VDD_CPU_GPU_CV

Since, we only test single-user, single-request mode, there isn’t enough load to push the total module power up to the TDP ceiling.

VDD_IN               ≈ 7.5W  (total module - well under the 15W cap)
├── VDD_CPU_GPU_CV    = 3.2W  (CPU + GPU + CV engine)
├── VDD_SOC           = 1.7W  (memory controller, media blocks)
└── other rails       ≈ 2.6W  (misc I/O, DRAM self-refresh, PMIC losses)

No thermal throttling was triggered at any mode because the Bonsai decode workload never saturates the GPU compute units - the TDP ceiling is never approached.

3. Analysis #

3.1 Higher tok/sec != efficient model (tok/J)

Tok/s (left half) and tok/J (right half) are intentionally both shown- a faster mode does not always mean a more efficient one.

MAXN beats 25W on raw tok/s (+8–11 %) for all models but loses on tok/J (−10–11 %) because its power increase outpaces the throughput gain for ctx = 2048, gen = 512.

[=]output_tok_J

[/ (]OSL

[×]decode_power_W

[) - decode energy only. See]decode_time_p50_s

[Appendix H.3]for full formula.

Full breakdown across all 12 ctx × gen combinations:

[Appendix I].

[ ]Table 7: Canonical cell comparison (ctx=2048, gen=512)

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.5	16.4	24.1
26.1	4.27	4.99	5.24
4.77	25W
Bonsai-4B	3.5	9.2	13.5
14.6	2.19	2.51	2.64
2.37	25W
Bonsai-8B	3.6	9.9	14.0
15.1	1.70	1.83
1.81	1.63	15W
Ternary-Bonsai-1.7B	9.0	23.4	34.7
38.0	4.64	4.94	5.18
4.55	25W
Ternary-Bonsai-4B	4.1	11.4	16.9
18.6	2.08	2.25	2.30
2.06	25W

3.2 The 25W Sweet Spot

25W is the best mode for output tok/J and tok/sec for all sub-4B Bonsai models, and is near-parity for Bonsai-8B. The arithmetic is:

Going from 15W → 25W: output tok/s rises47–48 %(GPU clock 612 → 820 MHz), while power rises*~38–46 %. Net output tok/J gain:+1 to +6 %for 1.7B–4B models (including Ternary-Bonsai-4B: 2.25 → 2.30);−1 %for Bonsai-8B (memory-bandwidth bound, clock gain wiped by power overhead). - Going from 25W → MAXN: output tok/s gains+8–11 %(decode is memory-bandwidth bound, not compute bound), while power rises~20–25 %*. Net output tok/J loss:*−10 to −11 %*across all models.

The GPU clock ceiling at 15W (612 MHz) leaves significant decode throughput on the table for sub-8B models. Raising it to 820 MHz at 25W captures most of the available improvement at modest extra power. The final jump to 1020 MHz at MAXN costs disproportionate power for marginal gains.

3.3 Best use cases for each power mode

[ ]Table 8: Recommended power mode by use case

Use case	Recommended mode
Always-on inference (sub-4B)	25W: best
`TTFT`

output tok/J

) and highest tok/sec; vs 25W: +8-10% tok/s, +10% prefill time (TTFT

), -9-12%TTFT

output tok/J, -2-5%TTFT

output tok/J, -6-14%TTFT

output tok/J; lowest power budgetAll above numbers are for sub-4B models tested. For 8B models performance, please check the individual sections for your interest.

3.4 Throughput Speedup Summary

All figures are median(p50) across the full prompt × gen sweep (12 combos per model); throughput uses median(p50 tok/s).

[ ]Table 9: Output throughput speedup ratios - all pairwise mode comparisons

Model	25W / 15W	MAXN / 15W	15W / 7W	25W / 7W	MAXN / 7W
Bonsai-1.7B	1.47x	1.59x	2.57x	3.78x	4.08x
1.08x
Bonsai-4B	1.48x	1.59x	2.65x	3.90x	4.22x
1.08x
Bonsai-8B	1.48x	1.64x	2.79x	4.11x	4.57x
1.11x
Ternary-Bonsai-1.7B	1.48x	1.62x	2.64x	3.92x	4.28x
1.09x
Ternary-Bonsai-4B	1.48x	1.64x	2.79x	4.13x	4.56x
1.10x

25W delivers a consistent*~1.47–1.48xspeedup vs 15W across all models.15W gives about2.5–2.8xboost vs7W; Bonsai-8B and Ternary-4B show the largest gain at 2.78x.** MAXN/25W**is consistently~1.08–1.10xfor all models - extra compute headroom over an already-fast 25W baseline translates to only modest gains, since decode is memory-bandwidth bound.MAXN/7W reaches4.56xforTernary-Bonsai-4B- the largest speedup in the sweep.** 25W/7W**for Ternary-Bonsai-4B is4.13x*, the highest 25W gain of any model.- For Bonsaimodels (1.4B/4B), the throughput isconsistenly higher compared toTernarymodels.

[ ]Figure 14: Output throughput speedup vs 15W baseline - all models and modes

3.5 Latency Characteristics

At ctx=256 a model like Bonsai-1.7B prefills in ~170 ms (25W); at ctx=2048 that grows to ~1353 ms.

TTFT

scales near-linearly with prompt across all modes.Inter-token latency ( is the median per-token decode cost.

ITL

) p50heatmaps per power mode (all 5 models, all 12 prompt×gen combos) are in

ITL

- see Figures F.2a–F.2d. At the canonical ctx=2048, gen=512:

Appendix F.2 7W | 15W | 25W | MAXN |

**Figure 10a: **

ITL

p50 heatmaps - all 4 power modes (rows = gen length, cols = prompt length)is driven primarily by model size and GPU clock (or power mode) but also loosely on prompt length too, like between ctx length=256 to ctx length=2048, theITL

increases because of the larger context (prompt + generation -> kv cache scans increases too).ITL

Bonsai models have lowerthanITL

Ternarymodels at every mode.- The Ternary-Bonsai-1.7B achieves lower than Bonsai-1.7B at every mode despite larger file size, consistent with ternary weights being faster to load from DRAM per decode step.ITL

** Decode time (s) p50** is the time spent generating output tokens.

[ ]Table 10a: Decode time speedup ratios - all pairwise mode comparisons

Model	25W vs 15W	MAXN vs 15W	15W vs 7W	25W vs 7W	MAXN vs 7W
Bonsai-1.7B	1.47x	1.58x	2.57x	3.78x	4.08x
1.08x
Bonsai-4B	1.48x	1.59x	2.65x	3.91x	4.22x
1.08x
Bonsai-8B	1.50x	1.64x	2.79x	4.17x	4.57x
1.10x
Ternary-Bonsai-1.7B	1.48x	1.62x	2.64x	3.92x	4.28x
1.09x
Ternary-Bonsai-4B	1.48x	1.64x	2.79x	4.13x	4.56x
1.10x

See

[Appendix H.9]for the full speedup formula. decode_time =[p50 −]RL

[p50; median over all 12 prompt × gen combos.]TTFT

Decode time for Ternary models is lower than those ofBonsai models. Why? - The reason lies in 1-bit vs 1.58-bit model structure. Bonsai’s 1-bit quantization requires more complex bit unpacking and on-the-fly dequantization during decode, whichadds overhead per token. Ternary’s 2-bit structure withoptimized ternary-CUDA kernels allows for moreefficient memory access patterns and simpler decode logic, reducing per-token latency despite the larger file size.

Thinking with Claude, there is this amusing reason, since 1.58-bits (-1,0,1} has ‘0’ as a valid weight, it can skip the multiply-accumulate for those zero weights during dequant stage (conversion to fp16 or bf16 for GEMM), while 1-bit quantization has to do the full compute for every weight bit, even if many are effectively zero after unpacking. This leads to more efficient decoding for the ternary models.

median(

TTFT

speedupTTFT_baseline

) / median(TTFT_mode

) over all 12 prompt × gen combos (see H.9):

[ ]**Table 11: **

TTFT

speedup ratios - all pairwise mode comparisons| Model | 25W vs 15W | MAXN vs 15W | 15W vs 7W | 25W vs 7W | MAXN vs 7W | MAXN vs 25W | |---|---|---|---|---|---|---| | Bonsai-1.7B | 1.46x | 1.60x | 2.71x | 3.95x | 4.34x | 1.10x | | Bonsai-4B | 1.47x | 1.61x | 2.86x | 4.21x | 4.62x | 1.10x | | Bonsai-8B | 1.29x | 1.56x | 2.84x | 3.66x | 4.42x | 1.21x | | Ternary-Bonsai-1.7B | 1.46x | 1.60x | 2.75x | 4.02x | 4.41x | 1.10x | | Ternary-Bonsai-4B | 1.48x | 1.62x | 3.08x | 4.54x | 4.98x | 1.10x |

Bonsai-8B shows a smallerimprovement at 25W vs 15W (1.29x) compared to the 4B models (1.47–1.48x). This confirms the prefill is also becoming memory-bandwidth bound for the larger model.TTFT

MAXN/7W reaches4.98xfor Ternary-Bonsai-4B prefill - the largestspeedup in the sweep.TTFT

25W/7W is4.54xfor Ternary-Bonsai-4B, also the highest across all models at that comparison.

Request latency (E2E) speedup - median( RL p50 at baseline) / median(

p50 at mode) over all 12 prompt × gen combos (see

RL

H.9):

[ ]Table 12: Request latency (E2E) speedup ratios - all pairwise mode comparisons

Model	25W vs 15W	MAXN vs 15W	15W vs 7W	25W vs 7W	MAXN vs 7W
Bonsai-1.7B	1.47x	1.59x	2.57x	3.78x	4.09x
1.08x
Bonsai-4B	1.47x	1.60x	2.66x	3.92x	4.24x
1.08x
Bonsai-8B	1.51x	1.60x	2.79x	4.21x	4.47x
1.06x
Ternary-Bonsai-1.7B	1.48x	1.62x	2.64x	3.91x	4.28x
1.09x
Ternary-Bonsai-4B	1.48x	1.64x	2.81x	4.16x	4.59x
1.10x

Mirrors the speedup trends since prefill dominates request latency at these context sizes.TTFT

3.6 Model Size vs Efficiency

The relationship is clear: smaller quantized models always win on total tok/J, not just tok/s.

[ ]Table 13: Best total tok/J ranked by model size

Model	Params	GGUF	Best total tok/J
Bonsai-1.7B	1.7B	~237 MB	62.5
25W / 2048 / 128
Ternary-Bonsai-1.7B	1.7B	~300 MB	59.6
25W / 2048 / 128
Bonsai-4B	4B	~540 MB	28.7
25W / 2048 / 128
Ternary-Bonsai-4B	4B	~700 MB	25.5
25W / 2048 / 128
Bonsai-8B	8B	~1.1 GB	18.8
15W / 2048 / 128

Total tok/J = (

[+]ISL

[) / (]OSL

[×]p50_power_W

[_p50_s) - see]RL

[Appendix H.6]for the full formula. Peaks at ctx=2048, gen=128 for every model because the long prompt dominates the numerator while short gen minimises decode energy. All 48 mode × ctx × gen combinations were searched.

[ ]Figure 13: Best output tok/J per model - all power modes

Bonsai-1.7B at 25W achieves 62.5 total tok/J, more than 3× more efficient than Bonsai-8B (18.8) across the full request.

Notably, Bonsai-1.7B edges Ternary-Bonsai-1.7B on total tok/J (62.5 vs 59.6) despite having fewer parameters - the Q1_0 Bonsai-1.7B is slightly lighter on memory bandwidth than the Q2_0 ternary variant. Ternary-Bonsai-1.7B wins on output tok/s and output tok/J at the canonical cell instead.

3.7 Energy Efficiency: Decode tok/J and Total tok/J

[ ]Table 14: Request energy split – prefill vs decode % of total request energy (median over all 12 prompt x gen combos)

Model	Phase	7W	15W	25W	MAXN
Bonsai-1.7B	Prefill	7%	7%	6%	6%
Bonsai-1.7B	Decode	93%	93%	94%	94%
Bonsai-4B	Prefill	9%	9%	9%	9%
Bonsai-4B	Decode	91%	91%	91%	91%
Bonsai-8B	Prefill	11%	12%	15%	12%
Bonsai-8B	Decode	89%	88%	85%	88%
Ternary-Bonsai-1.7B	Prefill	9%	9%	9%	9%
Ternary-Bonsai-1.7B	Decode	91%	91%	91%	91%
Ternary-Bonsai-4B	Prefill	10%	10%	10%	10%
Ternary-Bonsai-4B	Decode	90%	90%	90%	90%

[=]prefill_J

[x]prefill_power_W

[_p50_s;]TTFT

[=]decode_J

[x]decode_power_W

[;]decode_time_p50_s

prefill%

=[/ (]prefill_J

[+]prefill_J

[). Phase power from tegrastats samples assigned to exact phase windows via per-request nanosecond timestamps. See]decode_J

[H.5].

Two complementary tok/J lenses on energy efficiency - see H.6 for formulas:

See Figure 7b (decode tok/J vs prompt length) and Figure 7c (total tok/J vs prompt length) in section 2.2. Full combinations: D.2 Decode · D.3 Total.

Key findings:

25W wins on both metrics for sub-4B models at every prompt and gen length. The exception is Bonsai-8B, where 15W and 25W are almost the same on tok/J (1.81 vs 1.83 output tok/J at canonical cell). - The 1.7B models reach

~5 tok/J (decode) vs**~2.5–2.7 tok/J** for the 4B models and**~1.8 tok/J** for Bonsai-8B. Smaller models are dramatically more energy-efficient per output token. - Decode dominates request energy. As clearly visible inTable 14, the decode phase accounts for ~90–94 % of total request energy across all models and modes. Prefill is a smaller fraction. The reason is the amount of time spent. Prefill is a one-time cost at the start of the request, while decode is an ongoing cost that accumulates with every generated token. Even though prefill power can be similar to decode power, the much longer duration of decode means it contributes more to total energy. - The ternary 1.7B model has

slightly lowerdecode tok/J than Bonsai-1.7B (standard) despite higher raw throughput - the Q2_0 format requires more DRAM bandwidth per token than Q1_0, which slightly increases decode energy relative to throughput. - Ternary-Bonsai models draw 10–20 % more decode power than same-size Bonsai-1bit despite both dequantizing to fp16 GEMM. The difference is fully explained by memory bandwidth - not compute intensity:

Bonsai Q1_0 (1.125 bpw)	Ternary-Bonsai Q2_0 (2.06 bpw)
Model size (4B)	540 MiB	1,020 MiB
Bits per weight	1.125	2.06
Dequant op	`sign × scale` (trivial)
2-bit lookup → `{-1,0,+1}` × scale
Bytes read per token	540 MiB	1,020 MiB (1.9× more)
Power impact	baseline	+10–20 % (DRAM traffic dominates)

Neither format runs natively on GPU hardware - Q1_0 and Q2_0 both dequantize to fp16 before GEMM. The power gap comes from

1.9× more weight bytes moved through the memory controllerin the Ternary variant, not from a difference in compute.

Phase power by mode (W) – median over all 12 prompt × gen combos:

Model	Phase	7W	15W	25W	MAXN
Bonsai-1.7B	Prefill	2.07	4.52	5.81	7.30
Bonsai-1.7B	Decode	1.48	3.22	4.50	5.41
Bonsai-4B	Prefill	2.15	5.44	7.70	9.09
Bonsai-4B	Decode	1.60	3.59	5.05	6.10
Bonsai-8B	Prefill	2.13	5.97	8.47	10.56
Bonsai-8B	Decode	2.05	5.41	8.03	9.91
Ternary-Bonsai-1.7B	Prefill	2.05	5.04	7.08	8.80
Ternary-Bonsai-1.7B	Decode	1.95	4.75	6.71	8.42
Ternary-Bonsai-4B	Prefill	2.05	5.49	7.90	9.62
Ternary-Bonsai-4B	Decode	1.99	5.06	7.41	9.04

[ ]Table 14a: Prefill power ratios – all pairwise mode comparisons

Model	25W/15W	MAXN/15W	15W/7W	25W/7W	MAXN/7W
Bonsai-1.7B	1.29x	1.62x	2.18x	2.80x	3.52x
1.26x
Bonsai-4B	1.42x	1.67x	2.53x	3.57x	4.22x
1.18x
Bonsai-8B	1.42x	1.77x	2.80x	3.97x	4.95x
1.25x
Ternary-Bonsai-1.7B	1.40x	1.75x	2.46x	3.45x	4.29x
1.24x
Ternary-Bonsai-4B	1.44x	1.75x	2.68x	3.85x	4.69x
1.22x

[ ]Table 14b: Decode power ratios – all pairwise mode comparisons

Model	25W/15W	MAXN/15W	15W/7W	25W/7W	MAXN/7W
Bonsai-1.7B	1.40x	1.68x	2.17x	3.04x	3.65x
1.20x
Bonsai-4B	1.41x	1.70x	2.25x	3.16x	3.81x
1.21x
Bonsai-8B	1.48x	1.83x	2.64x	3.91x	4.83x
1.23x
Ternary-Bonsai-1.7B	1.41x	1.77x	2.43x	3.44x	4.31x
1.25x
Ternary-Bonsai-4B	1.46x	1.79x	2.54x	3.72x	4.54x
1.22x

Ratios show how much more power mode A draws vs mode B (ratio > 1 = A draws more). Computed from

[tegrastats samples assigned to prefill/decode phase windows via per-request nanosecond timestamps; median over all 12 prompt x gen combos. See]VDD_CPU_GPU_CV

[H.5]for phase assignment methodology.

Key observations:

Prefill draws more power than decode at every mode for sub-4B models (prefill is compute-bound; decode is memory-bandwidth-bound). Bonsai-8B is the exception – its decode power approaches prefill at higher modes because of higher FLOPS and memory bandwidth demands and so does the energy too.25W vs 15W adds only 1.29-1.48x power but delivers 1.47-1.50x throughput– the sweet spot where each extra watt of draw returns more than proportional speed.** MAXN vs 25W adds just 1.18-1.26x power on prefill and 1.20-1.25x on decode**, yet delivers only +8-10% tok/s (1.08-1.10x) and loses -9-12% output tok/J – the diminishing-returns zone where the power increase outpaces the out tok/sec and tok/J gain because of single-user, single-request setup not able to fully utilzie the extra GPU headroom.7W decode power (1.48-2.05 W) is remarkably low– the SoC at minimum clocks draws less than a typical USB charger during autoregressive generation.

[ ]Figure 15: Total energy per request vs output length at 25W, ctx=2048

[ ]Figure 16: Decode energy per output token in mJ (ctx=2048, gen=512)

4. Conclusion #

What These Numbers Mean for Edge Inference

At Ternary-Bonsai-1.7B Q2_0:

up to 38.4 tok/s at 25W (ctx=256): real-time fluent generation0.24 s at ctx=256 (25W)TTFT

~300 MB on disk: trivially portable** 6.83 W under load**: runs on a USB-C power bank** 5.74 output tok/J**(ctx=256, gen=256): best output tok/J for the Ternary-1.7B at 25W

Bonsai-1.7B Q1_0 pushes even further: 5.84 output tok/J (ctx=256, gen=256) in only 237 MB at 4.51 W average under load, 26.0 tok/s and **0.21 s ** (25W, ctx=256). Total tok/J peaks at

TTFT

62.5(ctx=2048, gen=128, best in suite) where the long prompt dominates the numerator.

The standard Q1_0 models are lighter on disk and memory bandwidth; the Ternary Q2_0 variants generate faster output tokens per second, thus Ternary models are better for latency-sensitive applications whileBonsai models are mostly energy-efficient per output token.

The Clear Winner: 25W Mode (for sub-4B models)

25W (nvpmodel -m 1) is the Pareto-optimal power mode for Bonsai-1.7B and Bonsai-4B inference on the Jetson Orin Nano Super. It is the right answer for the majority of deployments:

~47 % morethroughput than 15W (1.7B class) and*~46-47 % faster*(prefill time)TTFT

Only ~40 % morepower than 15W *10–11 %*better output tok/J than MAXN (25W: 2.30–5.24 output tok/J across sub-4B models at ctx=2048, gen=512; up to 5.84 at ctx=256)- Low peak power (≤ 7 W for 1.7B–4B models) for sustained operation; peak TJ 65.9-66.2 °C – over 28 °C of thermal headroom before the 95 °C hardware throttle threshold

For Bonsai-8B: prefer 15W for energy-efficiency-neutral operation at ~43 % lower board power than 25W.

Appendix A: Full 4-Mode Comparison (ctx=2048, gen=512) #

Raw numbers from the canonical benchmark cell. All latencies in milliseconds. Power =

[median over each run window.]VDD_CPU_GPU_CV

[ ]Table 15: Full 4-mode comparison, ctx=2048, gen=512

| Model | Mode | Output Tok/s |

`TTFT`

p50 (ms)

ITL

24.11353.141.435.2413.53133.374.042.641.8314.05502.471.4834.71515.428.845.1816.93569.059.29****2.30## Appendix B: Thermal Summary - All Power Modes

Power and temperature: median over each model’s full benchmark window (all 12 prompt×gen combos). No model triggered thermal throttling at any power mode (threshold ≈ 95 °C).

Junction temperature (TJ)is the hottest internal die temperature on the Jetson SoC, reported bytegrastats

astj@

. It is the primary metric for thermal safety: if TJ reaches ~95 °C, the hardware automatically throttles clocks to prevent damage. Peak TJ < 76 °C across all runs means thermal headroom is ample.

[ ]Table 16: Thermal summary - all power modes

Model	Mode	Avg Power (W)	Avg CPU (°C)	Avg GPU (°C)	Peak TJ (°C)	Throttled
Bonsai-1.7B	7W	1.48	53.6	54.9	55.8	No
Bonsai-1.7B	15W	3.24	55.6	56.8	59.0	No
Bonsai-1.7B	25W	4.51	62.5	63.7	65.9	No
Bonsai-1.7B	MAXN	5.41	62.1	63.3	65.9	No
Bonsai-4B	7W	1.60	53.7	55.0	57.3	No
Bonsai-4B	15W	3.59	58.3	59.5	61.7	No
Bonsai-4B	25W	5.07	62.4	63.8	66.2	No
Bonsai-4B	MAXN	6.10	63.4	64.7	67.7	No
Bonsai-8B	7W	2.07	54.7	56.1	58.3	No
Bonsai-8B	15W	5.42	61.1	62.5	64.6	No
Bonsai-8B	25W	8.03	66.3	67.9	70.4	No
Bonsai-8B	MAXN	9.91	69.9	71.8	75.3	No
Ternary-Bonsai-1.7B	7W	1.95	54.8	56.2	57.0	No
Ternary-Bonsai-1.7B	15W	4.75	61.2	62.5	63.8	No
Ternary-Bonsai-1.7B	25W	6.71	64.3	65.9	69.2	No
Ternary-Bonsai-1.7B	MAXN	8.42	68.2	69.7	72.4	No
Ternary-Bonsai-4B	7W	1.99	54.7	56.0	57.8	No
Ternary-Bonsai-4B	15W	5.06	60.6	62.0	63.7	No
Ternary-Bonsai-4B	25W	7.41	65.7	67.2	69.3	No
Ternary-Bonsai-4B	MAXN	9.05	68.4	70.0	71.8	No

Appendix C: Full 12-Combination Heatmaps (All Power Modes) #

Each heatmap is a 2×3

grid (5 models, 6th panel unused) showing all 12 prompt×gen

combinations for one power mode and one metric. Rows = gen length (128, 256, 512 tok), columns = prompt length (256, 512, 1024, 2048 tok). Brighter colour = higher value.

C.1 Output Tok/s heatmaps

Figure C.1a: All 12 combos at 7W

Figure C.1b: All 12 combos at 15W

Figure C.1c: All 12 combos at 25W

Figure C.1d: All 12 combos at MAXN

C.2 Output Tok/J heatmaps

Figure C.2a: All 12 combos at 7W

Figure C.2b: All 12 combos at 15W

Figure C.2c: All 12 combos at 25W

Figure C.2d: All 12 combos at MAXN

C.3 Phase power heatmaps (prefill and decode W)

Observed VDD_CPU_GPU_CV power during the prefill and decode phases separately, across all 12 prompt × gen combinations per power mode. Prefill is compute-heavy (batched forward pass); decode is memory-bandwidth bound (one token at a time) - the difference in watts between the two phases is visible in every mode.

Figure C.3a: Prefill power (W) at 7W

Figure C.3b: Prefill power (W) at 15W

Figure C.3c: Prefill power (W) at 25W

Figure C.3d: Prefill power (W) at MAXN

Figure C.3e: Decode power (W) at 7W

Figure C.3f: Decode power (W) at 15W

Figure C.3g: Decode power (W) at 25W

Figure C.3h: Decode power (W) at MAXN

Appendix D: Prefill / Decode / Total tok/J: All Combinations #

All charts are 2×3 faceted line plots with a fixed y-scale across all subplots. The canonical combination (ctx=2048, gen=512) is also shown in §2.2.

D.1 Prefill tok/J (input tok / J) vs prompt length

Figure D.1a: Prefill tok/J vs prompt length: gen=128

Figure D.1b: Prefill tok/J vs prompt length: gen=256

Figure D.1c: Prefill tok/J vs prompt length: gen=512 (canonical, also in § 2.2)

D.2 Decode tok/J (output tok / J) - independent of prompt length

Decode tok/J depends on the number of output tokens (gen length), not input prompt length, since decode happens after prefill completes. These charts show decode tok/J as a function of gen length for each prompt context length.

Figure D.2a: Decode tok/J vs gen length: ctx=256

Figure D.2b: Decode tok/J vs gen length: ctx=512

Figure D.2c: Decode tok/J vs gen length: ctx=1024

Figure D.2d: Decode tok/J vs gen length: ctx=2048

D.3 Total tok/J ((input+output) tok / J) vs prompt length

Figure D.3a: Total tok/J vs prompt length: gen=128

Figure D.3b: Total tok/J vs prompt length: gen=256

Figure D.3c: Total tok/J vs prompt length: gen=512 (canonical, also in § 2.2)

Appendix E: Latency: All Combinations #

E.1 Request Latency (E2E)

Request latency (E2E) p50 - total time from request start to last token received. Line charts show variation with prompt length (2×3 facet, fixed y-scale).

E.1a Request latency vs prompt length (by gen length)

Figure E.1a: Request latency vs prompt length: gen=128

Figure E.1b: Request latency vs prompt length: gen=256

Figure E.1c: Request latency vs prompt length: gen=512 (canonical)

E.2 `TTFT`

: All Prompt × Gen Combinations

TTFT

TTFT p50 (median time to first token, ms) is driven almost entirely by prompt length - it is the prefill cost. These charts show how it varies across all 12 prompt × gen combinations and across all 4 power modes.

E.2a `TTFT`

vs prompt length (by gen length)

TTFT

**Figure E.2a: **

TTFT

vs prompt length: gen=128**Figure E.2b: **

TTFT

vs prompt length: gen=256**Figure E.2c: **

TTFT

vs prompt length: gen=512*(canonical)*

E.3 `TTFT`

heatmaps (gen x prompt) per power mode

TTFT

Each cell is TTFT in ms. Rows = gen length, columns = prompt length. Independent of

gen

length hence the same across rows.| Figure E.3a: TTFT |

Figure E.3b:heatmap: 15WTTFT

Figure E.3c:heatmap: 25WTTFT

Figure E.3d:heatmap: MAXNTTFT

Appendix F: `ITL` #

: All Combinations

ITL

Inter-token latency (ms) = time between consecutive output tokens. It measures decode cost and is driven by model size and GPU clock, not prompt length.

F.1 `ITL`

vs prompt length (by gen length)

ITL

**Figure F.1a: **

ITL

vs prompt length: gen=128**Figure F.1b: **

ITL

vs prompt length: gen=256**Figure F.1c: **

ITL

vs prompt length: gen=512*(canonical, also in section 2.3)*

F.2 `ITL`

heatmaps (gen x prompt) per power mode

ITL

| Figure F.2a: ITL |

Figure F.2b:heatmap: 15WITL

Figure F.2c:heatmap: 25WITL

Figure F.2d:heatmap: MAXNITL

Appendix G: Prefill Throughput: All Combinations #

Prefill throughput (tok/s) measures how fast the model processes input tokens. It scales with prompt length (longer prompts hit peak GPU utilisation) and GPU clock speed.

G.1 Prefill throughput vs prompt length (by gen length)

Figure G.1a: Prefill throughput vs prompt length: gen=128

Figure G.1b: Prefill throughput vs prompt length: gen=256

Prefill throughput is independent of gen length, so gen=128 and gen=256 show the same trend.

G.2 Prefill throughput heatmaps (gen x prompt) per power mode

| Figure G.2a: Prefill throughput heatmap: 7W |

Figure G.2b: Prefill throughput heatmap: 15WFigure G.2c: Prefill throughput heatmap: 25WFigure G.2d: Prefill throughput heatmap: MAXN## Appendix H: All Metrics, Formulas, and Calculation Methods

This appendix documents every metric reported in this benchmark, its formula, its source, and any caveats.

H.1 Raw inputs from aiperf and tegrastats

Symbol	Source	Definition
`ISL`
aiperf JSON `input_sequence_length.avg`
Actual input tokens processed per request (may differ from target due to tokenizer rounding)
`OSL`
aiperf JSON `output_sequence_length.avg`
Actual output tokens generated per request
`TTFT`
aiperf JSON `time_to_first_token.p50` (ms)
Median time from request sent to first output token received; proxy for prefill duration. p50 used (not avg) to avoid skew from occasional slow requests
`ITL`
aiperf JSON `inter_token_latency.p50` (ms)
Median time between consecutive output tokens; per-token decode cost. p50 used for robustness against outliers
`RL`
aiperf JSON `request_latency.p50` (ms)
Median total wall time per request:
`TTFT`

tok_s

output_token_throughput_per_user.p50

/ RL in steady state)OSL

prefill_tput

prefill_throughput_per_user.p50

t0

, t1

start_time

, end_time

(ISO 8601)mW_i

field (mW)VDD_CPU_GPU_CV

i

All aiperf metrics are averages over 20 requests per combo. Percentile variants (p50, p90, p99) are also available in the raw JSON but not reproduced here.

H.2 Power

p50_power_W = median(mW_i for all tegrastats samples where t0 <= sample_time <= t1) / 1000

covers the CPU, GPU, and Computer Vision engine railVDD_CPU_GPU_CV

Does NOT include board overhead (fan, storage, USB) which is on VDD_IN

VDD_IN

is ~1.5-3 W higher thanduring inferenceVDD_CPU_GPU_CV

Tegrastats interval: 500 ms

H.3 Output tok/J (main efficiency metric)

output_tok_J = OSL / decode_J
             = OSL / (decode_power_W * (RL_p50_s - TTFT_p50_s))

decode_power_W

is the median power across tegrastats samples assigned to exact decode windows using per-request nanosecond timestamps from profile_export.jsonl

(see H.5). The denominator is decode-phase energy only - prefill excluded because no output tokens are generated during prefill.

Higher is better. This is the primary metric of the benchmark.

Note: because decode time and decode power are both roughly independent of prompt length, output tok/J is also roughly independent of prompt length.

H.4 Request latency energy

total_J = p50_power_W * (RL_p50 / 1000)

Energy consumed by one median request from first byte sent to last token received. p50_power_W

is the median tegrastats sample over the full run window. Accurate for all cells regardless of TTFT.

H.5 Prefill and decode energy

prefill_J  = prefill_power_W * (TTFT / 1000)
decode_J   = decode_power_W  * ((RL - TTFT) / 1000)
total_J    = prefill_J + decode_J

prefill_%  = prefill_J / total_J * 100

Exact per-request phase power from profile_export.jsonl. aiperf writes one JSON record per request to

profile_export.jsonl

, with nanosecond-precision timestamps:

request_start_ns  - when request was sent
request_ack_ns    - when first output token was received  (= prefill end)
request_end_ns    - when last output token was received   (= decode end)

For each request i

, phase windows in wall-clock seconds are:

prefill_start_i = t0 + (request_start_ns_i - request_start_ns_0) / 1e9
prefill_end_i   = t0 + (request_ack_ns_i   - request_start_ns_0) / 1e9
decode_end_i    = t0 + (request_end_ns_i   - request_start_ns_0) / 1e9

where t0

is the aiperf start_time

ISO timestamp and request_start_ns_0

is the first request’s start. Each tegrastats sample at wall-clock time ep

is assigned to whichever request’s phase window it falls in.

prefill_power_W

= median of samples in any prefill window across all 20 requests. decode_power_W

= median of samples in any decode window across all 20 requests.

Energy uses exact p50 (median) phase durations from the per-request data:

p50_ttft_s   = median(request_ack_ns_i - request_start_ns_i) / 1e9  over all 20 requests
p50_decode_s = median(request_end_ns_i - request_ack_ns_i)   / 1e9  over all 20 requests

Why this matters: prefill draws significantly more power than decode on Bonsai models - prefill is a batched forward pass (compute-heavy), decode is one token at a time (memory-bandwidth bound). At 25W the ratio is ~1.6x for 1.7B/4B models, ~1.1x for 8B. Using full-run average power for both phases would underestimate decode_power_W

and therefore overestimate decode_J

, making output_tok_J

too low.

Residual approximation: a 500 ms tegrastats sample that straddles a prefill→decode boundary within a single request is assigned to one phase entirely. This affects only samples near the request_ack_ns

boundary and is negligible across 20 requests totalling hundreds of samples.

H.6 Phase tok/J metrics

prefill_tok_J = ISL / prefill_J
              = ISL / (prefill_power_W * TTFT_s)

decode_tok_J  = OSL / decode_J
              = OSL / (decode_power_W * (RL_s - TTFT_s))

total_tok_J   = (ISL + OSL) / total_J
              = (ISL + OSL) / (prefill_J + decode_J)

Where TTFT_s = [

TTFT](#glossary) / 1000

, RL_s = RL / 1000

.

: input tokens per joule of prefill energy, using phase-specificprefill_tok_J

prefill_power_W

.: identical todecode_tok_J

the primary benchmark metric, using phase-specificoutput_tok_J

decode_power_W

.: all tokens (in + out) per joule of total request energy.total_tok_J

H.7 mJ per output token

mJ_per_output_tok = (decode_J / OSL) * 1000
                  = 1000 / decode_tok_J

Millijoules per generated output token ( decode_J is in joules, ×1000 converts to mJ for readability). Carries the same caveat as I.5 for cells where

< 500 ms.

TTFT

H.8 Prefill throughput

prefill_tput (tok/s) = aiperf JSON prefill_throughput_per_user.p50

Directly from aiperf. Measures how fast input tokens are processed during the prefill phase. Scales with prompt length (longer prompts hit peak GPU utilisation) and GPU clock.

H.9 Speedup ratio methodology (Tables 9, 10a, 11, 12)

All speedup ratios use median over all 12 prompt × gen combos (4 ctx lengths × 3 gen lengths). “A vs B” means A is the faster mode; ratio > 1 means A is faster than B.

Table 9 - Output throughput (tok/s):

speedup_A_vs_B = median(tok_s_A  over 12 combos) / median(tok_s_B  over 12 combos)
tok_s = output_token_throughput_per_user.p50  (aiperf)

Table 10a - Decode time:

decode_time_s = (request_latency.p50 - time_to_first_token.p50) / 1000
speedup_A_vs_B = median(decode_time_s_B over 12 combos) / median(decode_time_s_A over 12 combos)

Table 11 - TTFT (prefill time):

speedup_A_vs_B = median(TTFT_B over 12 combos) / median(TTFT_A over 12 combos)
TTFT = time_to_first_token.p50  (aiperf, ms)

Table 12 - Request latency (E2E):

speedup_A_vs_B = median(RL_B over 12 combos) / median(RL_A over 12 combos)
RL = request_latency.p50  (aiperf, ms)

H.10 Best total tok/J per model (Table 13)

best_total_tok_J(model) = max(total_tok_J(mode, model, gen, ctx))
                          over all modes in {7W, 15W, 25W, MAXN}
                          and all gen in {128, 256, 512}
                          and all ctx in {256, 512, 1024, 2048}

total_tok_J = (ISL + OSL) / (p50_power_W * RL_p50_s)

The single highest total tok/J value observed for that model across all 48 combinations. Peaks at ctx=2048, gen=128 for every model because the long prompt dominates the ( ISL +

) numerator.

OSL

H.11 `TTFT`

, ITL

, RL

percentiles

TTFT

ITL

RL

All percentile variants come directly from aiperf JSON without further computation:

TTFT       = time_to_first_token.p50   (canonical; p50 used everywhere)
TTFT_p90   = time_to_first_token.p90
TTFT_p99   = time_to_first_token.p99
ITL        = inter_token_latency.p50    (canonical; p50 used everywhere)
ITL_p99    = inter_token_latency.p99
RL         = request_latency.p50        (canonical; p50 used everywhere)
RL_p99     = request_latency.p99

H.12 Energy caveat: which metrics are accurate vs approximate

Metric	Accurate?	Condition
`output_tok_J`

total_J

decode_J

decode_tok_J

total_tok_J

which is accuratetotal_J

prefill_J

>= 500 ms onlyTTFT

prefill_tok_J

>= 500 ms onlyTTFT

prefill_J

prefill_%

>= 500 ms onlyTTFT

prefill_J

mJ_per_output_tok

decode_J

Phase energies use phase-specific power from tegrastats samples assigned to exact prefill/decode windows via per-request nanosecond timestamps from profile_export.jsonl

(see I.5). The residual approximation is only the few tegrastats samples that straddle a phase boundary near each request’s TTFT transition - negligible across 20 requests.

is computable for 239 of 240 cells; the one exception (Bonsai-8B / 25W / ctx=256 / gen=512) is a complete benchmark failure - all 20 aiperf requests errored - unrelated to power measurement.

output_tok_J

H.13 Power and temperature

p50_power_W = median(tegrastats.VDD_CPU_GPU_CV[mW] / 1000
               for all samples where aiperf_t0 <= sample_time <= aiperf_t1)

Power is the **median ** (CPU+GPU+CV rail) from

VDD_CPU_GPU_CV

tegrastats

sampled at 500 ms intervals, over each model’s active inference window only (idle/cool-down between models excluded). Median is used instead of mean to suppress rare outlier samples (e.g. OS scheduling spikes) that do not reflect sustained inference power.Junction temperature (TJ) is the hottest internal die temperature on the Jetson SoC, reported by tegrastats

as tj@

. The hardware automatically throttles GPU/CPU clocks when TJ reaches ~95 °C to prevent damage. Peak TJ < 76 °C across all runs confirms ample thermal headroom at every power mode.

Symbol	Source	Definition
`VDD_CPU_GPU_CV`

cpu@

gpu@

tj@

p50_power_W

over active inference window (W)VDD_CPU_GPU_CV

avg_cpu_C

avg_gpu_C

peak_tj_C

Throttling is flagged when peak_tj_C > 85 °C

(leaving a 10 °C safety margin below the hardware limit).

Appendix I: All ctx x gen Combinations - tok/s and tok/J #

Full breakdown of output tok/s and output tok/J for all 5 models across all 4 power modes, repeated for every combination of prompt context length (ctx) and generation length (gen). Each table mirrors Table 7 (the canonical ctx=2048, gen=512 cell) but at a different operating point.

Bold values indicate the peak tok/J mode for that model row; when two modes tie on tok/J (to 2 decimal places), the higher-throughput mode is bolded. All values use p50 (median) over 20 requests.

gen=128 tok

Table I.1: ctx=256, gen=128

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.8	17.7	26.0
28.0	4.64	5.53	5.81
5.30	25W
Bonsai-4B	3.6	9.6	14.2
15.3	2.22	2.71	2.83
2.55	25W
Bonsai-8B	3.7	10.4	15.5
17.1	1.77	1.93	1.93
1.74	25W
Ternary-Bonsai-1.7B	9.7	25.9	38.2
41.9	5.12	5.50	5.60
5.05	25W
Ternary-Bonsai-4B	4.3	12.1	18.0
19.8	2.19	2.39	2.44
2.22	25W

Table I.2: ctx=512, gen=128

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.8	17.5	25.7
27.7	4.60	5.47	5.80
5.16	25W
Bonsai-4B	3.6	9.6	14.1
15.2	2.51	2.69	2.79
2.52	25W
Bonsai-8B	3.7	10.3	15.3
16.9	1.80	1.91	1.92
1.72	25W
Ternary-Bonsai-1.7B	9.6	25.6	37.7
41.3	5.07	5.48	5.49
4.93	25W
Ternary-Bonsai-4B	4.3	12.0	17.8
19.7	2.18	2.39	2.42
2.19	25W

Table I.3: ctx=1024, gen=128

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.7	17.1	25.2
27.2	4.54	5.30	5.58
5.00	25W
Bonsai-4B	3.6	9.4	13.9
15.0	2.42	2.66	2.76
2.48	25W
Bonsai-8B	3.7	10.2	15.1
16.7	1.75	1.90	1.90
1.70	25W
Ternary-Bonsai-1.7B	9.4	24.9	36.6
40.2	4.87	5.28	5.35
4.80	25W
Ternary-Bonsai-4B	4.2	11.8	17.5
19.3	2.15	2.35	2.38
2.15	25W

Table I.4: ctx=2048, gen=128

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.5	16.5	24.3
26.2	4.43	5.04	5.29
4.75	25W
Bonsai-4B	3.5	9.2	13.5
14.7	2.21	2.56	2.66
2.41	25W
Bonsai-8B	3.6	9.9	14.0
15.1	1.72	1.84	1.84
1.65	25W
Ternary-Bonsai-1.7B	9.1	23.5	34.8
38.1	4.68	4.99	5.17
4.55	25W
Ternary-Bonsai-4B	4.1	11.4	16.9
18.7	2.14	2.29	2.32
2.09	25W

gen=256 tok

Table I.5: ctx=256, gen=256

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.8	17.7	26.0
28.0	4.62	5.51	5.84
5.20	25W
Bonsai-4B	3.6	9.6	14.2
15.3	2.27	2.70	2.82
2.54	25W
Bonsai-8B	3.7	10.4	15.5
17.1	1.87	1.92	1.93
1.73	25W
Ternary-Bonsai-1.7B	9.7	25.9	38.4
41.9	5.10	5.48	5.74
4.99	25W
Ternary-Bonsai-4B	4.3	12.1	18.0
19.8	2.18	2.40	2.42
2.19	25W

Table I.6: ctx=512, gen=256

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.8	17.5	25.7
27.7	4.58	5.45	5.77
5.14	25W
Bonsai-4B	3.6	9.6	14.1
15.2	2.31	2.68	2.80
2.52	25W
Bonsai-8B	3.7	10.3	15.3
16.9	1.83	1.90	1.91
1.71	25W
Ternary-Bonsai-1.7B	9.6	25.5	37.8
41.3	4.94	5.40	5.68
4.92	25W
Ternary-Bonsai-4B	4.3	12.0	17.8
19.6	2.30	2.38	2.41
2.17	25W

Table I.7: ctx=1024, gen=256

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.7	17.1	25.2
27.2	4.52	5.28	5.62
5.01	25W
Bonsai-4B	3.6	9.4	13.9
15.0	2.24	2.62	2.76
2.47	25W
Bonsai-8B	3.7	10.2	15.1
16.7	1.84	1.89	1.89
1.69	25W
Ternary-Bonsai-1.7B	9.4	24.8	36.8
40.1	4.85	5.25	5.53
4.78	25W
Ternary-Bonsai-4B	4.2	11.8	17.5
19.3	2.28	2.34	2.37
2.14	25W

Table I.8: ctx=2048, gen=256

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.5	16.5	24.3
26.2	4.41	5.02	5.31
4.76	25W
Bonsai-4B	3.5	9.2	13.5
14.6	2.20	2.55	2.65
2.38	25W
Bonsai-8B	3.6	9.9
14.0	15.1	1.71	1.84
1.82	1.64	15W
Ternary-Bonsai-1.7B	9.1	23.5	34.8
38.2	4.66	4.97	5.15
4.60	25W
Ternary-Bonsai-4B	4.1	11.4	16.9
18.7	2.18	2.28	2.30
2.08	25W

gen=512 tok

Table I.9: ctx=256, gen=512

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.8	17.6	25.9
27.9	4.59	5.47	5.80
5.20	25W
Bonsai-4B	3.6	9.6	14.1
15.3	2.26	2.69	2.80
2.53	25W
Bonsai-8B	3.7	10.4

| 17.0 | 1.90 | 1.91 |
| 1.72 | 15W | | Ternary-Bonsai-1.7B | 9.7 | 25.7 | 38.1 | 41.7 | 5.06 | 5.38 | 5.69 | 4.97 | 25W | | Ternary-Bonsai-4B | 4.3 | 12.1 | 17.9 | 19.7 | 2.17 | 2.39 | 2.41 | 2.18 | 25W |

Table I.10: ctx=512, gen=512

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.7	17.4	25.6
27.6	4.56	5.41	5.73
5.15	25W
Bonsai-4B	3.6	9.5	14.0
15.2	2.25	2.64	2.79
2.49	25W
Bonsai-8B	3.7	10.3
15.3	16.9	1.86	1.91
1.90	1.71	15W
Ternary-Bonsai-1.7B	9.6	25.4	37.6
41.1	5.01	5.35	5.62
4.91	25W
Ternary-Bonsai-4B	4.3	12.0	17.7
19.6	2.24	2.37	2.40
2.17	25W

Table I.11: ctx=1024, gen=512

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.7	17.1	25.0
27.1	4.38	5.31	5.58
5.02	25W
Bonsai-4B	3.6	9.4	13.9
15.0	2.28	2.58	2.75
2.46	25W
Bonsai-8B	3.7	10.2
14.4	16.6	1.84	1.88
1.85	1.69	15W
Ternary-Bonsai-1.7B	9.4	24.6	36.6
40.0	4.81	5.20	5.46
4.77	25W
Ternary-Bonsai-4B	4.2	11.7	17.4
19.2	2.13	2.33	2.36
2.13	25W

Table I.12: ctx=2048, gen=512

Model	7W tok/s	15W tok/s	25W tok/s
Bonsai-1.7B	6.5	16.4	24.1
26.1	4.27	4.99	5.24
4.77	25W

Bonsai-4B	3.5	9.2	13.5
14.6	2.19	2.51	2.64
2.37	25W

Bonsai-8B	3.6	9.9
14.0	15.1	1.70	1.83
1.81	1.63	15W

Ternary-Bonsai-1.7B	9.0	23.4	34.7
38.0	4.64	4.94	5.18
4.55	25W

Ternary-Bonsai-4B	4.1	11.4	16.9
18.6	2.08	2.25	2.30
2.06	25W
—

Trivia: Why Ternary Beats 1-bit on Jetson Despite Being Larger #

Running the ternary models faster than the Q1_0 Bonsai models seems counterintuitive: ternary weights are stored at 2 bits each (4 per byte) while Q1_0 is 1 bit each (8 per byte), so Q1_0 has a smaller file and less DRAM pressure per decode step. Yet ternary wins on every mode. Here is why.

Storage vs compute type are separate things.

Format	Storage	Compute path	Tensor core support on GA10B
Ternary {-1, 0, +1}	2 bits/weight	unpack to INT8, use INT8 WMMA	Yes
Q1_0 {-1, +1}	1 bit/weight	needs XNOR+popcount (BMMA / INT1)	No

True 1-bit compute uses XNOR + popcount via BMMA instructions. XOR the packed weight bits with packed activation bits, then count the ones with popcount. Extremely fast when supported. NVIDIA added INT1 tensor cores (BMMA) on A100 (GA100) and select Turing chips. The Jetson Orin Nano runs the GA10B Ampere die, which does not include BMMA. So Q1_0 falls back to slow CUDA core emulation.

Ternary unpacks to INT8 {-1, 0, 1}, and INT8 WMMA tensor cores are present on GA10B. Ternary gets hardware-accelerated matmul; Q1_0 does not.

Zero-weight sparsity is a second real advantage. BitNet 1.58 models carry roughly 30-50% zero weights. Those accumulations are skipped entirely, saving a genuine fraction of flops even in the FP16/INT8 accumulation path. Q1_0 binary weights have no zeros, so there is no sparsity to exploit.

What good 1-bit kernels would look like. A proper XNOR+popcount kernel on hardware that supports BMMA would in theory beat ternary: smaller DRAM footprint AND fast bitwise arithmetic. The reason 1-bit loses here is not an inherent property of 1-bit quantization – it is a hardware availability gap on this specific Jetson die, combined with immature llama.cpp kernel support for Q1_0.

Kernel quality is about more than the arithmetic. Even when the per-weight operation is just add/subtract, the CUDA kernel controls: memory coalescing (whether 32 warp threads read contiguous addresses into a single DRAM transaction or scatter into 32 separate ones); shared memory tiling ( weight blocks into fast on-chip SRAM before accumulation vs hitting DRAM on every access); warp divergence from the zero-weight skip (a naive if/skip branch serializes the warp – a good kernel uses predicated execution or precomputed bitmasks); and occupancy (register pressure determines how many warps stay in flight to hide memory latency). The arithmetic is one clock cycle. Everything around it determines whether the GPU is actually executing that cycle or stalling.

⚠ Why Models with Weights > ~1 GB Were Not Tested (and even so with higher ctx/gen lengths) #

All models in this benchmark have GGUF weights ≤ 958 MB. Larger models (e.g. Gemma3-4B Q4_K_M at 2.4 GB) fail to load on JetPack R36.4.7 (L4T 36.4) regardless of power mode or available memory.This is a known regression in the CUDA IOVA / NvMap contiguous-memory allocator introduced in this JetPack release - not a simple “out of RAM” failure.

Root cause:On Jetson platforms, the CUDA driver allocates device-mapped memory through theNvMap

kernel driver, which requires asingle contiguous blockin the IOVA (I/O Virtual Address) space. Unlike a general-purpose allocator that can stitch together scattered pages, NvMap must find one unbroken IOVA range large enough for the entire allocation in a single call. For a 2.4 GB GGUF like Gemma3-4B, that means requesting a contiguous block of roughly2.4 GB(plus KV-cache and CUDA runtime overhead) before any other tensor or buffer is placed in the address space.The allocation fails immediately with

NvMapMemAllocInternalTagged: error 12 (ENOMEM)

errno 12 isENOMEM

, “not enough memory” in the contiguous-mapping sense, not the total-capacity sense.

What this means in practice:Any GGUF model requiring more than roughly1.1 GBof contiguous CUDA buffers is blocked at load time on this JetPack version. This is why the benchmark is limited to models under ~1 GB GGUF size. Smaller models load fine because their contiguous IOVA requirement fits within what the fragmented address space can still provide.

Affected platform:NVIDIA Jetson Orin Nano Super 8GB running JetPack R36.4.7 (L4T 36.4.7 / Ubuntu 22.04). The same board onJetPack 6.2.2 (L4T 36.5)resolves this regression.

source & further reading

smolhub.com — original article Tiny LLM Benchmark: Jetson Orin Nano Super 8GB Clustering 3 Jetson Orin Nano Super

Bonsai LLM Benchmark: Jetson Orin Nano Super 8GB

Benchmark Configuration #

Data Availability #

1.1 Hardware

1.2 Software Stack

1.3 Models Under Test

1.4 Power Modes

2. Results: Charts #

2.1 Throughput vs Prompt Length

2.2 Energy Efficiency

2.3 Latency

2.4 Prefill Throughput

2.5 Power Draw

2.5.1 VDD_IN + VDD_CPU_GPU_CV ≠ TDP

3. Analysis #

3.1 Higher tok/sec != efficient model (tok/J)

3.2 The 25W Sweet Spot

3.3 Best use cases for each power mode

3.4 Throughput Speedup Summary

3.5 Latency Characteristics

3.6 Model Size vs Efficiency

3.7 Energy Efficiency: Decode tok/J and Total tok/J

4. Conclusion #

What These Numbers Mean for Edge Inference

The Clear Winner: 25W Mode (for sub-4B models)

Appendix A: Full 4-Mode Comparison (ctx=2048, gen=512) #

Appendix C: Full 12-Combination Heatmaps (All Power Modes) #

C.1 Output Tok/s heatmaps

C.2 Output Tok/J heatmaps

C.3 Phase power heatmaps (prefill and decode W)

Appendix D: Prefill / Decode / Total tok/J: All Combinations #

D.1 Prefill tok/J (input tok / J) vs prompt length

D.2 Decode tok/J (output tok / J) - independent of prompt length

D.3 Total tok/J ((input+output) tok / J) vs prompt length

Appendix E: Latency: All Combinations #

E.1 Request Latency (E2E)

E.1a Request latency vs prompt length (by gen length)

E.2 TTFT

E.2a TTFT

E.3 TTFT

Appendix F: ITL #

F.1 ITL

F.2 ITL

Appendix G: Prefill Throughput: All Combinations #

G.1 Prefill throughput vs prompt length (by gen length)

G.2 Prefill throughput heatmaps (gen x prompt) per power mode

H.1 Raw inputs from aiperf and tegrastats

H.2 Power

H.3 Output tok/J (main efficiency metric)

H.4 Request latency energy

H.5 Prefill and decode energy

H.6 Phase tok/J metrics

H.7 mJ per output token

H.8 Prefill throughput

H.9 Speedup ratio methodology (Tables 9, 10a, 11, 12)

H.10 Best total tok/J per model (Table 13)

H.11 TTFT

H.12 Energy caveat: which metrics are accurate vs approximate

H.13 Power and temperature

Appendix I: All ctx x gen Combinations - tok/s and tok/J #

gen=128 tok

gen=256 tok

gen=512 tok

Trivia: Why Ternary Beats 1-bit on Jetson Despite Being Larger #

⚠ Why Models with Weights > ~1 GB Were Not Tested (and even so with higher ctx/gen lengths) #

Run your AI side-project on zahid.host

E.2 `TTFT`

E.2a `TTFT`

E.3 `TTFT`

Appendix F: `ITL` #

F.1 `ITL`

F.2 `ITL`

H.11 `TTFT`