Bonsai LLM Benchmark: Jetson Orin Nano Super 8GB

NVIDIA Jetson Orin Nano Super 8GB benchmarks show 25W as the energy-efficiency sweet spot for sub-4B Bonsai LLMs, delivering 47-48% more tokens per second than 15W while maintaining or improving output tok/J. The Ternary-Bonsai-1.7B model achieved the highest throughput and efficiency across all power modes, while the 8B model failed due to out-of-memory errors.

Bonsai LLM Benchmark: Jetson Orin Nano Super 8GB 5 Bonsai-family 1–1.58bit LLMs benchmarked across 4 power modes on Jetson Orin Nano Super 8GB. 25W sweet spot: 47–48% more tok/s than 15W, best output tok/J for all sub-4B models. My Jetson Nano Orin Super cluster setup - its real Benchmark Configuration | Platform | NVIDIA Jetson Orin Nano Super 8GB | | CPU | 6-core Arm Cortex-A78AE | | GPU | NVIDIA Ampere 1024 CUDA cores, 32 Tensor cores | | Memory | 8 GB LPDDR5 shared CPU+GPU | | JetPack | R36.4.7 L4T 36.4 | | Backend | llama.cpp CUDA, -ngl 99 all layers on GPU , --no-cache-prompt | | Runs | Four full sweeps: 7W, 15W, 25W, MAXN SUPER | | Sweep | prompt ∈ {256, 512, 1024, 2048} tok × gen ∈ {128, 256, 512} tok × 20 reqs/combo | | Concurrency | 1 single-user | | Key metric | | Github for scripts and code available at link. Executive Summary Five Bonsai-family 1-1.53bit LLMs were benchmarked across all four Jetson Orin Nano Super power modes: 7W , 15W , 25W , and MAXN SUPER . Each model ran 12 combinations of prompt × generation length 20 requests per combo at every power mode where it could load. Key finding : - 25W is the energy-efficiency sweet spot for all models ≤4B parameters. For Bonsai-8B, 15W and 25W deliver near-identical output tok/J ~1 % difference , making 15W the more power-conservative choice. - MAXN costs 10–11 % more energy per token than 25W across every model tested. 25W delivers 47–48 % more output tok/s than 15W while maintaining or improving output tok/J for sub-4B models ctx=2048, gen=512 . No thermal throttling was observed at any power mode - peak junction temperature TJ reached 75.3 °C at MAXN Bonsai-8B , well below the 95 °C hardware throttle threshold. All other models peak below 72 °C even at MAXN. Throughput and efficiency winner at each mode ctx=2048, gen=512, Ternary-Bonsai-1.7B dominates : Table 1: Throughput and efficiency winner at each power mode | Mode | Fastest model | Output Tok/s | Output Tok/J | |---|---|---|---| | 7W | Ternary-Bonsai-1.7B | 9.0 | 4.64 | | 15W | Ternary-Bonsai-1.7B | 23.4 | 4.94 | | 25W | Ternary-Bonsai-1.7B | 34.7 | 5.18 | | MAXN | Ternary-Bonsai-1.7B | 38.0 | 4.55 | Ternary-Bonsai-8B Q2 0, ~1.4 GB failed at every power mode: OOM in 8 GB unified memory when combined with KV cache and CUDA overhead. All five remaining models have complete data across all four power modes. Data Availability Complete per-cell JSON exports all 33 metrics, all 12 prompt×gen combos × 20 requests per cell are published on Hugging Face Datasets: | Mode | Dataset | Models | Cells | |---|---|---|---| | 7W | YuvrajSingh9886/bonsai-jetson-benchmark-7w | YuvrajSingh9886/bonsai-jetson-benchmark-15w YuvrajSingh9886/bonsai-jetson-benchmark-25w YuvrajSingh9886/bonsai-jetson-benchmark-maxn Each dataset contains the full profile export aiperf.json per cell all 33 metrics including ISL glossary , , glossary OSL avg/p50/p90/p99, glossary TTFT , glossary ITL , glossary output tok/s , glossary request latency , appendix-h8 prefill tok/s , appendix-h2 power W , appendix-h3 output tok/J tegrastats.log , and per-model server logs. 1. Test Setup 1.1 Hardware Table 2: Hardware configuration | Component | Detail | |---|---| | Board | Jetson Orin Nano Super 8GB Developer Kit | | CPU | 6× Arm Cortex-A78AE @ up to 1.728 GHz | | GPU | NVIDIA Ampere, 1024 CUDA cores, 32 Tensor cores | | Memory | 8 GB LPDDR5 102 GB/s unified CPU + GPU | | CMA | 256 MB contiguous memory pool; depletes across sequential model loads | | Cooling | Active fan; peak junction temperature ≤ 75 °C across all modes | 1.2 Software Stack | Layer | Version / Detail | |---|---| | OS / JetPack | JetPack R36.4.7 Ubuntu 22.04, L4T 36.4 | | CUDA | 12.6 | | llama.cpp | CUDA backend, -ngl 99 , --no-cache-prompt --cache-ram 0 | | Inference server | llama-server : host 0.0.0.0:8080 , --parallel 1 , -c 2560 | | Load generator | aiperf NVIDIA AI Performance tool | | Power telemetry | tegrastats at 500 ms, VDD CPU GPU CV | | Python | 3.10 aiperf-env , pandas, seaborn, matplotlib | | Datasets | Synthetic prompts at exact token counts 256, 512, 1024, 2048 generated synthetically via aiperf | | Concurrency | 1 user, 1 request at a time --parallel 1 , --concurrency 1 - single-user latency and throughput profile only | | Batch size | 512 tokens physical -ub / ubatch, default · 2048 logical -b , default for llama.cpp | 1.3 Models Under Test | Model | Quant | GGUF size | Tokenizer | |---|---|---|---| | Bonsai-1.7B | Q1 0 | ~237 MB | Qwen/Qwen3-1.7B | | Bonsai-4B | Q1 0 | ~540 MB | Qwen/Qwen3-4B | | Bonsai-8B | Q1 0 | ~1.1 GB | Qwen/Qwen3-8B | | Ternary-Bonsai-1.7B | Q2 0 | ~300 MB | Qwen/Qwen3-1.7B | | Ternary-Bonsai-4B | Q2 0 | ~700 MB | Qwen/Qwen3-4B | | Ternary-Bonsai-8B | Q2 0 | ~1.4 GB | N/A | 1.4 Power Modes Table 5: Power mode configurations | Mode | nvpmodel | GPU clock | CPU clock | VDD CPU GPU CV | |---| 7W -m 3 15W -m 0 25W -m 1 MAXN -m 2 + jetson clocks 1 020 MHz 1 728 MHz 1.5 Benchmark Methodology - For each model × prompt × gen combo , aiperf sends 20 single-concurrency requests with synthetic prompts at the exact target token count. - Power is sampled from tegrastats mW → W at 500 ms intervals. Tegrastats samples are assigned to exact prefill/decode phase windows using per-request nanosecond timestamps from VDD CPU GPU CV profile export.jsonl .= output tok J ÷ OSL × decode power W - decode energy only, prefill excluded. See decode time p50 s Appendix H.3 appendix-h3 . - Clocks were locked with jetson clocks at all modes. - Each run’s power and clock speed was capped at x W through nvpmodel and monitored for thermal stability no sustained throttling; junction temp ≤ 75 °C . Latency percentile used throughout: all, TTFT , and request latency ITL values reported in charts, tables, and energy calculations use the RL p50 median over the 20 requests per combo. The mean is not used for latency because occasional slow requests GC pause, memory compaction, OS scheduling inflate it without reflecting typical behaviour. p90 and p99 are available in the raw per-mode report files Data Availability data-availability for tail-latency analysis. 2. Results: Charts All charts use data from all four power modes. 2.1 Throughput vs Prompt Length Output tok/s vs prompt length at gen=512 across all models and modes; 25W orange consistently leads for models ≤4B: Figure 1: Output tok/s vs prompt length gen=512, all models and modes Canonical cell ctx=2048, gen=512 , side-by-side output tok/s and output tok/J bars for all 4 modes: Figure 2: Canonical cell: output tok/s and tok/J side by side ctx=2048, gen=512 2.2 Energy Efficiency Output Tok/J vs prompt length at gen=512 ; 25W leads for ≤4B models; 15W and 25W are near-tied for Bonsai-8B: Figure 3: Output tok/J vs prompt length gen=512, all models and modes Output Tok/J heatmap gen × prompt for Standard Bonsai models 1.7B, 4B, 8B at all 4 modes: Figure 4: Output tok/J heatmap: Standard Bonsai models at all 4 power modes gen × prompt Output Tok/J heatmap for Ternary Bonsai models 1.7B, 4B at all 4 power modes: Figure 5: Output tok/J heatmap: Ternary Bonsai models at all 4 power modes gen × prompt Bonsai-8B spotlight : output tok/J at all 4 power modes across all three gen lengths - the model where 15W and 25W are energy-equivalent: Figure 6: Bonsai-8B output tok/J at all 4 power modes across gen lengths Prefill tok/J input tokens per joule of prefill energy vs prompt length at gen=512 , how efficiently each mode processes the prompt; higher is better: Figure 7a: Prefill tok/J input tok / J vs prompt length gen=512, all models and modes Note: = prefill tok J / ISL × prefill power W s uses TTFT prefill power W derived from tegrastats samples assigned to exact prefill windows using per-request nanosecond timestamps request start ns → request ack ns from profile export.jsonl .Prefill draws significantly more power than decodeon Bonsai models up to ~1.6x for 1.7B/4B . See Appendix H.5 for the full methodology. Decode tok/J output tokens per joule of decode energy vs prompt length at gen=512 , output generation efficiency; 25W leads for sub-4B models: Figure 7b: Decode tok/J output tok / J vs prompt length gen=512, all models and modes Total tok/J input + output tokens per joule of total request energy vs prompt length at gen=512 , overall request efficiency; 25W wins for sub-4B models at every prompt length: Figure 7c: Total tok/J input+output tok / J vs prompt length gen=512, all models and modes Phase power draw at the canonical cell ctx=2048, gen=512 - the wattage each phase actually draws, showing why prefill and decode have different energy costs: Figure 7d: Prefill phase power W - all models × all power modes ctx=2048, gen=512 Figure 7e: Decode phase power W - all models × all power modes ctx=2048, gen=512 Prefill is a fully-batched forward pass over the prompt compute-heavy ; decode is one token at a time memory-bandwidth bound . Prefill consistently draws more watts than decode at the same power mode. Per-mode heatmaps across all 12 prompt × gen combinations: C.3 Phase Power . Full tok/J charts for all ctx/gen combinations: D.1 Prefill · D.2 Decode · D.3 Total . 2.3 Latency TTFT glossary p50 at ctx=2048, gen=512; 25W and MAXN reduce by glossary TTFT 29–39 % vs 15W: Figure 8: TTFT p50 by power mode ctx=2048, gen=512 ITL inter-token latency p50 at ctx=2048, gen=512; lower is better: Figure 9: ITL p50 by power mode ctx=2048, gen=512 Request latency E2E p50 at ctx=2048, gen=512; total time from request start to last token received: Figure 10: Request latency E2E p50 by power mode ctx=2048, gen=512 2.4 Prefill Throughput 25W and MAXN provide ~29–47 % faster prefill than 15W: Figure 11: Prefill throughput by power mode gen=512, avg over all prompt lengths 2.5 Power Draw Figure 12: Median VDD CPU GPU CV power draw per model at each power mode Table 6: Median power draw per model at each power mode W, VDD CPU GPU CV | Model | 7W | 15W | 25W | MAXN | |---|---|---|---|---| | Bonsai-1.7B | 1.48 | 3.24 | 4.51 | 5.41 | | Bonsai-4B | 1.60 | 3.59 | 5.07 | 6.10 | | Bonsai-8B | 2.07 | 5.42 | 8.03 | 9.91 | | Ternary-Bonsai-1.7B | 1.95 | 4.75 | 6.71 | 8.42 | | Ternary-Bonsai-4B | 1.99 | 5.06 | 7.41 | 9.05 | Median over all 12 prompt × gen combos. Bold = highest observed power draw per model row. 2.5.1 VDD IN + VDD CPU GPU CV ≠ TDP The nvpmodel TDP cap 7W/15W/25W/MAXN applies to the total module draw , measured by the VDD IN glossary rail. is a glossary VDD CPU GPU CV subset - they’re not additive: VDD IN ≈ VDD CPU GPU CV + VDD SOC + misc rails - The wattage looks low as in table 6 because the rail only captures the GPU, CPU, and CV engine power - not the entire module. The remaining power is drawn by other components memory controller, media blocks, I/O, DRAM self-refresh, PMIC losses that are outside the scope of VDD CPU GPU CV but still contribute to the total power budget under the TDP cap. VDD CPU GPU CV - Since, we only test single-user, single-request mode, there isn’t enough load to push the total module power up to the TDP ceiling. VDD IN ≈ 7.5W total module - well under the 15W cap ├── VDD CPU GPU CV = 3.2W CPU + GPU + CV engine ├── VDD SOC = 1.7W memory controller, media blocks └── other rails ≈ 2.6W misc I/O, DRAM self-refresh, PMIC losses No thermal throttling was triggered at any mode because the Bonsai decode workload never saturates the GPU compute units - the TDP ceiling is never approached. 3. Analysis 3.1 Higher tok/sec = efficient model tok/J Tok/s left half and tok/J right half are intentionally both shown- a faster mode does not always mean a more efficient one. - MAXN beats 25W on raw tok/s +8–11 % for all models but loses on tok/J −10–11 % because its power increase outpaces the throughput gain for ctx = 2048, gen = 512 . = output tok J / OSL × decode power W - decode energy only. See decode time p50 s Appendix H.3 for full formula. Full breakdown across all 12 ctx × gen combinations: Appendix I . Table 7: Canonical cell comparison ctx=2048, gen=512 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.5 | 16.4 | 24.1 | 26.1 | 4.27 | 4.99 | 5.24 | 4.77 | 25W | | Bonsai-4B | 3.5 | 9.2 | 13.5 | 14.6 | 2.19 | 2.51 | 2.64 | 2.37 | 25W | | Bonsai-8B | 3.6 | 9.9 | 14.0 | 15.1 | 1.70 | 1.83 | 1.81 | 1.63 | 15W | | Ternary-Bonsai-1.7B | 9.0 | 23.4 | 34.7 | 38.0 | 4.64 | 4.94 | 5.18 | 4.55 | 25W | | Ternary-Bonsai-4B | 4.1 | 11.4 | 16.9 | 18.6 | 2.08 | 2.25 | 2.30 | 2.06 | 25W | 3.2 The 25W Sweet Spot 25W is the best mode for output tok/J and tok/sec for all sub-4B Bonsai models, and is near-parity for Bonsai-8B. The arithmetic is: - Going from 15W → 25W : output tok/s rises 47–48 % GPU clock 612 → 820 MHz , while power rises ~38–46 % . Net output tok/J gain: +1 to +6 % for 1.7B–4B models including Ternary-Bonsai-4B: 2.25 → 2.30 ; −1 % for Bonsai-8B memory-bandwidth bound, clock gain wiped by power overhead . - Going from 25W → MAXN : output tok/s gains +8–11 % decode is memory-bandwidth bound, not compute bound , while power rises ~20–25 % . Net output tok/J loss: −10 to −11 % across all models. The GPU clock ceiling at 15W 612 MHz leaves significant decode throughput on the table for sub-8B models. Raising it to 820 MHz at 25W captures most of the available improvement at modest extra power. The final jump to 1020 MHz at MAXN costs disproportionate power for marginal gains. 3.3 Best use cases for each power mode Table 8: Recommended power mode by use case | Use case | Recommended mode | |---|---| | Always-on inference sub-4B | 25W: best TTFT | output tok/J appendix-h3 and highest tok/sec; vs 25W: +8-10% tok/s, +10% prefill time glossary TTFT , -9-12% glossary TTFT output tok/J appendix-h3 , -2-5% glossary TTFT output tok/J appendix-h3 , -6-14% glossary TTFT output tok/J appendix-h3 ; lowest power budgetAll above numbers are for sub-4B models tested. For 8B models performance, please check the individual sections for your interest. 3.4 Throughput Speedup Summary All figures are median p50 across the full prompt × gen sweep 12 combos per model ; throughput uses median p50 tok/s . Table 9: Output throughput speedup ratios - all pairwise mode comparisons | Model | 25W / 15W | MAXN / 15W | 15W / 7W | 25W / 7W | MAXN / 7W | MAXN / 25W | |---|---|---|---|---|---|---| | Bonsai-1.7B | 1.47x | 1.59x | 2.57x | 3.78x | 4.08x | 1.08x | | Bonsai-4B | 1.48x | 1.59x | 2.65x | 3.90x | 4.22x | 1.08x | | Bonsai-8B | 1.48x | 1.64x | 2.79x | 4.11x | 4.57x | 1.11x | | Ternary-Bonsai-1.7B | 1.48x | 1.62x | 2.64x | 3.92x | 4.28x | 1.09x | | Ternary-Bonsai-4B | 1.48x | 1.64x | 2.79x | 4.13x | 4.56x | 1.10x | 25W delivers a consistent ~1.47–1.48x speedup vs 15W across all models. 15W gives about 2.5–2.8x boost vs 7W ; Bonsai-8B and Ternary-4B show the largest gain at 2.78x. MAXN/25W is consistently ~1.08–1.10x for all models - extra compute headroom over an already-fast 25W baseline translates to only modest gains, since decode is memory-bandwidth bound. MAXN/7W reaches 4.56x for Ternary-Bonsai-4B - the largest speedup in the sweep. 25W/7W for Ternary-Bonsai-4B is 4.13x , the highest 25W gain of any model.- For Bonsai models 1.4B/4B , the throughput is consistenly higher compared to Ternary models. Figure 14: Output throughput speedup vs 15W baseline - all models and modes 3.5 Latency Characteristics At ctx=256 a model like Bonsai-1.7B prefills in ~170 ms 25W ; at ctx=2048 that grows to ~1353 ms. TTFT scales near-linearly with prompt across all modes. Inter-token latency is the median per-token decode cost. ITL p50 heatmaps per power mode all 5 models, all 12 prompt×gen combos are in glossary ITL - see Figures F.2a–F.2d. At the canonical ctx=2048, gen=512: appendix-f2 Appendix F.2 7W | 15W | 25W | MAXN | Figure 10a: ITL p50 heatmaps - all 4 power modes rows = gen length, cols = prompt length is driven primarily by model size and GPU clock or power mode but also loosely on prompt length too, like between ctx length=256 to ctx length=2048, the ITL increases because of the larger context prompt + generation - kv cache scans increases too . ITL Bonsai models have lowerthan ITL Ternary models at every mode.- The Ternary-Bonsai-1.7B achieves lower than Bonsai-1.7B at every mode despite larger file size, consistent with ternary weights being faster to load from DRAM per decode step. ITL Decode time s p50 is the time spent generating output tokens. Table 10a: Decode time speedup ratios - all pairwise mode comparisons | Model | 25W vs 15W | MAXN vs 15W | 15W vs 7W | 25W vs 7W | MAXN vs 7W | MAXN vs 25W | |---|---|---|---|---|---|---| | Bonsai-1.7B | 1.47x | 1.58x | 2.57x | 3.78x | 4.08x | 1.08x | | Bonsai-4B | 1.48x | 1.59x | 2.65x | 3.91x | 4.22x | 1.08x | | Bonsai-8B | 1.50x | 1.64x | 2.79x | 4.17x | 4.57x | 1.10x | | Ternary-Bonsai-1.7B | 1.48x | 1.62x | 2.64x | 3.92x | 4.28x | 1.09x | | Ternary-Bonsai-4B | 1.48x | 1.64x | 2.79x | 4.13x | 4.56x | 1.10x | See Appendix H.9 for the full speedup formula. decode time = p50 − RL p50; median over all 12 prompt × gen combos. TTFT - Decode time for Ternary models is lower than those of Bonsai models. Why? - The reason lies in 1-bit vs 1.58-bit model structure. Bonsai’s 1-bit quantization requires more complex bit unpacking and on-the-fly dequantization during decode, which adds overhead per token. Ternary’s 2-bit structure with optimized ternary-CUDA kernels allows for more efficient memory access patterns and simpler decode logic, reducing per-token latency despite the larger file size. Thinking with Claude, there is this amusing reason, since 1.58-bits -1,0,1} has ‘0’ as a valid weight, it can skip the multiply-accumulate for those zero weights during dequant stage conversion to fp16 or bf16 for GEMM , while 1-bit quantization has to do the full compute for every weight bit, even if many are effectively zero after unpacking. This leads to more efficient decoding for the ternary models. - median TTFT speedup TTFT baseline / median TTFT mode over all 12 prompt × gen combos see H.9 appendix-h9 : Table 11: TTFT speedup ratios - all pairwise mode comparisons| Model | 25W vs 15W | MAXN vs 15W | 15W vs 7W | 25W vs 7W | MAXN vs 7W | MAXN vs 25W | |---|---|---|---|---|---|---| | Bonsai-1.7B | 1.46x | 1.60x | 2.71x | 3.95x | 4.34x | 1.10x | | Bonsai-4B | 1.47x | 1.61x | 2.86x | 4.21x | 4.62x | 1.10x | | Bonsai-8B | 1.29x | 1.56x | 2.84x | 3.66x | 4.42x | 1.21x | | Ternary-Bonsai-1.7B | 1.46x | 1.60x | 2.75x | 4.02x | 4.41x | 1.10x | | Ternary-Bonsai-4B | 1.48x | 1.62x | 3.08x | 4.54x | 4.98x | 1.10x | Bonsai-8B shows a smallerimprovement at 25W vs 15W 1.29x compared to the 4B models 1.47–1.48x . This confirms the prefill is also becoming memory-bandwidth bound for the larger model. TTFT MAXN/7W reaches 4.98x for Ternary-Bonsai-4B prefill - the largestspeedup in the sweep. TTFT 25W/7W is 4.54x for Ternary-Bonsai-4B, also the highest across all models at that comparison. Request latency E2E speedup - median RL glossary p50 at baseline / median p50 at mode over all 12 prompt × gen combos see glossary RL H.9 appendix-h9 : Table 12: Request latency E2E speedup ratios - all pairwise mode comparisons | Model | 25W vs 15W | MAXN vs 15W | 15W vs 7W | 25W vs 7W | MAXN vs 7W | MAXN vs 25W | |---|---|---|---|---|---|---| | Bonsai-1.7B | 1.47x | 1.59x | 2.57x | 3.78x | 4.09x | 1.08x | | Bonsai-4B | 1.47x | 1.60x | 2.66x | 3.92x | 4.24x | 1.08x | | Bonsai-8B | 1.51x | 1.60x | 2.79x | 4.21x | 4.47x | 1.06x | | Ternary-Bonsai-1.7B | 1.48x | 1.62x | 2.64x | 3.91x | 4.28x | 1.09x | | Ternary-Bonsai-4B | 1.48x | 1.64x | 2.81x | 4.16x | 4.59x | 1.10x | - Mirrors the speedup trends since prefill dominates request latency at these context sizes. TTFT 3.6 Model Size vs Efficiency The relationship is clear: smaller quantized models always win on total tok/J , not just tok/s. Table 13: Best total tok/J ranked by model size | Model | Params | GGUF | Best total tok/J | At mode / ctx / gen | |---|---|---|---|---| | Bonsai-1.7B | 1.7B | ~237 MB | 62.5 | 25W / 2048 / 128 | | Ternary-Bonsai-1.7B | 1.7B | ~300 MB | 59.6 | 25W / 2048 / 128 | | Bonsai-4B | 4B | ~540 MB | 28.7 | 25W / 2048 / 128 | | Ternary-Bonsai-4B | 4B | ~700 MB | 25.5 | 25W / 2048 / 128 | | Bonsai-8B | 8B | ~1.1 GB | 18.8 | 15W / 2048 / 128 | Total tok/J = + ISL / OSL × p50 power W p50 s - see RL Appendix H.6 for the full formula. Peaks at ctx=2048, gen=128 for every model because the long prompt dominates the numerator while short gen minimises decode energy. All 48 mode × ctx × gen combinations were searched. Figure 13: Best output tok/J per model - all power modes Bonsai-1.7B at 25W achieves 62.5 total tok/J , more than 3× more efficient than Bonsai-8B 18.8 across the full request. Notably, Bonsai-1.7B edges Ternary-Bonsai-1.7B on total tok/J 62.5 vs 59.6 despite having fewer parameters - the Q1 0 Bonsai-1.7B is slightly lighter on memory bandwidth than the Q2 0 ternary variant. Ternary-Bonsai-1.7B wins on output tok/s and output tok/J at the canonical cell instead. 3.7 Energy Efficiency: Decode tok/J and Total tok/J Table 14: Request energy split – prefill vs decode % of total request energy median over all 12 prompt x gen combos | Model | Phase | 7W | 15W | 25W | MAXN | |---|---|---|---|---|---| | Bonsai-1.7B | Prefill | 7% | 7% | 6% | 6% | | Bonsai-1.7B | Decode | 93% | 93% | 94% | 94% | | Bonsai-4B | Prefill | 9% | 9% | 9% | 9% | | Bonsai-4B | Decode | 91% | 91% | 91% | 91% | | Bonsai-8B | Prefill | 11% | 12% | 15% | 12% | | Bonsai-8B | Decode | 89% | 88% | 85% | 88% | | Ternary-Bonsai-1.7B | Prefill | 9% | 9% | 9% | 9% | | Ternary-Bonsai-1.7B | Decode | 91% | 91% | 91% | 91% | | Ternary-Bonsai-4B | Prefill | 10% | 10% | 10% | 10% | | Ternary-Bonsai-4B | Decode | 90% | 90% | 90% | 90% | = prefill J x prefill power W p50 s; TTFT = decode J x decode power W ; decode time p50 s prefill% = / prefill J + prefill J . Phase power from tegrastats samples assigned to exact phase windows via per-request nanosecond timestamps. See decode J H.5 . Two complementary tok/J lenses on energy efficiency - see H.6 appendix-h6 for formulas: See Figure 7b figure-7b decode tok/J vs prompt length and Figure 7c figure-7c total tok/J vs prompt length in section 2.2. Full combinations: D.2 Decode appendix-d2 · D.3 Total appendix-d3 . Key findings: - 25W wins on both metrics for sub-4B models at every prompt and gen length. The exception is Bonsai-8B, where 15W and 25W are almost the same on tok/J 1.81 vs 1.83 output tok/J at canonical cell . - The 1.7B models reach ~5 tok/J decode vs ~2.5–2.7 tok/J for the 4B models and ~1.8 tok/J for Bonsai-8B. Smaller models are dramatically more energy-efficient per output token. - Decode dominates request energy. As clearly visible in Table 14 table-14 , the decode phase accounts for ~90–94 % of total request energy across all models and modes. Prefill is a smaller fraction. The reason is the amount of time spent. Prefill is a one-time cost at the start of the request, while decode is an ongoing cost that accumulates with every generated token. Even though prefill power can be similar to decode power, the much longer duration of decode means it contributes more to total energy. - The ternary 1.7B model has slightly lower decode tok/J than Bonsai-1.7B standard despite higher raw throughput - the Q2 0 format requires more DRAM bandwidth per token than Q1 0, which slightly increases decode energy relative to throughput. - Ternary-Bonsai models draw 10–20 % more decode power than same-size Bonsai-1bit despite both dequantizing to fp16 GEMM. The difference is fully explained by memory bandwidth - not compute intensity: | Bonsai Q1 0 1.125 bpw | Ternary-Bonsai Q2 0 2.06 bpw | | |---|---|---| | Model size 4B | 540 MiB | 1,020 MiB | | Bits per weight | 1.125 | 2.06 | | Dequant op | sign × scale trivial | 2-bit lookup → {-1,0,+1} × scale | | Bytes read per token | 540 MiB | 1,020 MiB 1.9× more | | Power impact | baseline | +10–20 % DRAM traffic dominates | Neither format runs natively on GPU hardware - Q1 0 and Q2 0 both dequantize to fp16 before GEMM. The power gap comes from 1.9× more weight bytes moved through the memory controllerin the Ternary variant, not from a difference in compute. Phase power by mode W – median over all 12 prompt × gen combos: | Model | Phase | 7W | 15W | 25W | MAXN | |---|---|---|---|---|---| | Bonsai-1.7B | Prefill | 2.07 | 4.52 | 5.81 | 7.30 | | Bonsai-1.7B | Decode | 1.48 | 3.22 | 4.50 | 5.41 | | Bonsai-4B | Prefill | 2.15 | 5.44 | 7.70 | 9.09 | | Bonsai-4B | Decode | 1.60 | 3.59 | 5.05 | 6.10 | | Bonsai-8B | Prefill | 2.13 | 5.97 | 8.47 | 10.56 | | Bonsai-8B | Decode | 2.05 | 5.41 | 8.03 | 9.91 | | Ternary-Bonsai-1.7B | Prefill | 2.05 | 5.04 | 7.08 | 8.80 | | Ternary-Bonsai-1.7B | Decode | 1.95 | 4.75 | 6.71 | 8.42 | | Ternary-Bonsai-4B | Prefill | 2.05 | 5.49 | 7.90 | 9.62 | | Ternary-Bonsai-4B | Decode | 1.99 | 5.06 | 7.41 | 9.04 | Table 14a: Prefill power ratios – all pairwise mode comparisons | Model | 25W/15W | MAXN/15W | 15W/7W | 25W/7W | MAXN/7W | MAXN/25W | |---|---|---|---|---|---|---| | Bonsai-1.7B | 1.29x | 1.62x | 2.18x | 2.80x | 3.52x | 1.26x | | Bonsai-4B | 1.42x | 1.67x | 2.53x | 3.57x | 4.22x | 1.18x | | Bonsai-8B | 1.42x | 1.77x | 2.80x | 3.97x | 4.95x | 1.25x | | Ternary-Bonsai-1.7B | 1.40x | 1.75x | 2.46x | 3.45x | 4.29x | 1.24x | | Ternary-Bonsai-4B | 1.44x | 1.75x | 2.68x | 3.85x | 4.69x | 1.22x | Table 14b: Decode power ratios – all pairwise mode comparisons | Model | 25W/15W | MAXN/15W | 15W/7W | 25W/7W | MAXN/7W | MAXN/25W | |---|---|---|---|---|---|---| | Bonsai-1.7B | 1.40x | 1.68x | 2.17x | 3.04x | 3.65x | 1.20x | | Bonsai-4B | 1.41x | 1.70x | 2.25x | 3.16x | 3.81x | 1.21x | | Bonsai-8B | 1.48x | 1.83x | 2.64x | 3.91x | 4.83x | 1.23x | | Ternary-Bonsai-1.7B | 1.41x | 1.77x | 2.43x | 3.44x | 4.31x | 1.25x | | Ternary-Bonsai-4B | 1.46x | 1.79x | 2.54x | 3.72x | 4.54x | 1.22x | Ratios show how much more power mode A draws vs mode B ratio 1 = A draws more . Computed from tegrastats samples assigned to prefill/decode phase windows via per-request nanosecond timestamps; median over all 12 prompt x gen combos. See VDD CPU GPU CV H.5 for phase assignment methodology. Key observations: Prefill draws more power than decode at every mode for sub-4B models prefill is compute-bound; decode is memory-bandwidth-bound . Bonsai-8B is the exception – its decode power approaches prefill at higher modes because of higher FLOPS and memory bandwidth demands and so does the energy too. 25W vs 15W adds only 1.29-1.48x power but delivers 1.47-1.50x throughput – the sweet spot where each extra watt of draw returns more than proportional speed. MAXN vs 25W adds just 1.18-1.26x power on prefill and 1.20-1.25x on decode , yet delivers only +8-10% tok/s 1.08-1.10x and loses -9-12% output tok/J – the diminishing-returns zone where the power increase outpaces the out tok/sec and tok/J gain because of single-user, single-request setup not able to fully utilzie the extra GPU headroom. 7W decode power 1.48-2.05 W is remarkably low – the SoC at minimum clocks draws less than a typical USB charger during autoregressive generation. Figure 15: Total energy per request vs output length at 25W, ctx=2048 Figure 16: Decode energy per output token in mJ ctx=2048, gen=512 4. Conclusion What These Numbers Mean for Edge Inference At Ternary-Bonsai-1.7B Q2 0: up to 38.4 tok/s at 25W ctx=256 : real-time fluent generation 0.24 s at ctx=256 25W TTFT ~300 MB on disk : trivially portable 6.83 W under load : runs on a USB-C power bank 5.74 output tok/J ctx=256, gen=256 : best output tok/J for the Ternary-1.7B at 25W Bonsai-1.7B Q1 0 pushes even further: 5.84 output tok/J ctx=256, gen=256 in only 237 MB at 4.51 W average under load, 26.0 tok/s and 0.21 s 25W, ctx=256 . Total tok/J peaks at TTFT 62.5 ctx=2048, gen=128, best in suite where the long prompt dominates the numerator. - The standard Q1 0 models are lighter on disk and memory bandwidth; the Ternary Q2 0 variants generate faster output tokens per second, thus Ternary models are better for latency-sensitive applications while Bonsai models are mostly energy-efficient per output token. The Clear Winner: 25W Mode for sub-4B models 25W nvpmodel -m 1 is the Pareto-optimal power mode for Bonsai-1.7B and Bonsai-4B inference on the Jetson Orin Nano Super. It is the right answer for the majority of deployments: ~47 % more throughput than 15W 1.7B class and ~46-47 % faster prefill time TTFT - Only ~40 % more power than 15W 10–11 % better output tok/J than MAXN 25W: 2.30–5.24 output tok/J across sub-4B models at ctx=2048, gen=512; up to 5.84 at ctx=256 - Low peak power ≤ 7 W for 1.7B–4B models for sustained operation; peak TJ 65.9-66.2 °C – over 28 °C of thermal headroom before the 95 °C hardware throttle threshold For Bonsai-8B : prefer 15W for energy-efficiency-neutral operation at ~43 % lower board power than 25W. Appendix A: Full 4-Mode Comparison ctx=2048, gen=512 Raw numbers from the canonical benchmark cell. All latencies in milliseconds. Power = median over each run window. VDD CPU GPU CV Table 15: Full 4-mode comparison, ctx=2048, gen=512 | Model | Mode | Output Tok/s | TTFT | |---| p50 ms glossary ITL 24.1 1353.1 41.43 5.24 13.5 3133.3 74.04 2.64 1.83 14.0 5502.4 71.48 34.7 1515.4 28.84 5.18 16.9 3569.0 59.29 2.30 Appendix B: Thermal Summary - All Power Modes Power and temperature: median over each model’s full benchmark window all 12 prompt×gen combos . No model triggered thermal throttling at any power mode threshold ≈ 95 °C . Junction temperature TJ is the hottest internal die temperature on the Jetson SoC, reported by tegrastats as tj@ . It is the primary metric for thermal safety: if TJ reaches ~95 °C, the hardware automatically throttles clocks to prevent damage. Peak TJ < 76 °C across all runs means thermal headroom is ample. Table 16: Thermal summary - all power modes | Model | Mode | Avg Power W | Avg CPU °C | Avg GPU °C | Peak TJ °C | Throttled | |---|---|---|---|---|---|---| | Bonsai-1.7B | 7W | 1.48 | 53.6 | 54.9 | 55.8 | No | | Bonsai-1.7B | 15W | 3.24 | 55.6 | 56.8 | 59.0 | No | | Bonsai-1.7B | 25W | 4.51 | 62.5 | 63.7 | 65.9 | No | | Bonsai-1.7B | MAXN | 5.41 | 62.1 | 63.3 | 65.9 | No | | Bonsai-4B | 7W | 1.60 | 53.7 | 55.0 | 57.3 | No | | Bonsai-4B | 15W | 3.59 | 58.3 | 59.5 | 61.7 | No | | Bonsai-4B | 25W | 5.07 | 62.4 | 63.8 | 66.2 | No | | Bonsai-4B | MAXN | 6.10 | 63.4 | 64.7 | 67.7 | No | | Bonsai-8B | 7W | 2.07 | 54.7 | 56.1 | 58.3 | No | | Bonsai-8B | 15W | 5.42 | 61.1 | 62.5 | 64.6 | No | | Bonsai-8B | 25W | 8.03 | 66.3 | 67.9 | 70.4 | No | | Bonsai-8B | MAXN | 9.91 | 69.9 | 71.8 | 75.3 | No | | Ternary-Bonsai-1.7B | 7W | 1.95 | 54.8 | 56.2 | 57.0 | No | | Ternary-Bonsai-1.7B | 15W | 4.75 | 61.2 | 62.5 | 63.8 | No | | Ternary-Bonsai-1.7B | 25W | 6.71 | 64.3 | 65.9 | 69.2 | No | | Ternary-Bonsai-1.7B | MAXN | 8.42 | 68.2 | 69.7 | 72.4 | No | | Ternary-Bonsai-4B | 7W | 1.99 | 54.7 | 56.0 | 57.8 | No | | Ternary-Bonsai-4B | 15W | 5.06 | 60.6 | 62.0 | 63.7 | No | | Ternary-Bonsai-4B | 25W | 7.41 | 65.7 | 67.2 | 69.3 | No | | Ternary-Bonsai-4B | MAXN | 9.05 | 68.4 | 70.0 | 71.8 | No | Appendix C: Full 12-Combination Heatmaps All Power Modes Each heatmap is a 2×3 grid 5 models, 6th panel unused showing all 12 prompt×gen combinations for one power mode and one metric. Rows = gen length 128, 256, 512 tok , columns = prompt length 256, 512, 1024, 2048 tok . Brighter colour = higher value. C.1 Output Tok/s heatmaps Figure C.1a: All 12 combos at 7W Figure C.1b: All 12 combos at 15W Figure C.1c: All 12 combos at 25W Figure C.1d: All 12 combos at MAXN C.2 Output Tok/J heatmaps Figure C.2a: All 12 combos at 7W Figure C.2b: All 12 combos at 15W Figure C.2c: All 12 combos at 25W Figure C.2d: All 12 combos at MAXN C.3 Phase power heatmaps prefill and decode W Observed VDD CPU GPU CV glossary power during the prefill and decode phases separately, across all 12 prompt × gen combinations per power mode. Prefill is compute-heavy batched forward pass ; decode is memory-bandwidth bound one token at a time - the difference in watts between the two phases is visible in every mode. Figure C.3a: Prefill power W at 7W Figure C.3b: Prefill power W at 15W Figure C.3c: Prefill power W at 25W Figure C.3d: Prefill power W at MAXN Figure C.3e: Decode power W at 7W Figure C.3f: Decode power W at 15W Figure C.3g: Decode power W at 25W Figure C.3h: Decode power W at MAXN Appendix D: Prefill / Decode / Total tok/J: All Combinations All charts are 2×3 faceted line plots with a fixed y-scale across all subplots. The canonical combination ctx=2048, gen=512 is also shown in §2.2. D.1 Prefill tok/J input tok / J vs prompt length Figure D.1a: Prefill tok/J vs prompt length: gen=128 Figure D.1b: Prefill tok/J vs prompt length: gen=256 Figure D.1c: Prefill tok/J vs prompt length: gen=512 canonical, also in § 2.2 D.2 Decode tok/J output tok / J - independent of prompt length Decode tok/J depends on the number of output tokens gen length , not input prompt length, since decode happens after prefill completes. These charts show decode tok/J as a function of gen length for each prompt context length. Figure D.2a: Decode tok/J vs gen length: ctx=256 Figure D.2b: Decode tok/J vs gen length: ctx=512 Figure D.2c: Decode tok/J vs gen length: ctx=1024 Figure D.2d: Decode tok/J vs gen length: ctx=2048 D.3 Total tok/J input+output tok / J vs prompt length Figure D.3a: Total tok/J vs prompt length: gen=128 Figure D.3b: Total tok/J vs prompt length: gen=256 Figure D.3c: Total tok/J vs prompt length: gen=512 canonical, also in § 2.2 Appendix E: Latency: All Combinations E.1 Request Latency E2E Request latency E2E p50 - total time from request start to last token received. Line charts show variation with prompt length 2×3 facet, fixed y-scale . E.1a Request latency vs prompt length by gen length Figure E.1a: Request latency vs prompt length: gen=128 Figure E.1b: Request latency vs prompt length: gen=256 Figure E.1c: Request latency vs prompt length: gen=512 canonical E.2 TTFT : All Prompt × Gen Combinations TTFT TTFT glossary p50 median time to first token, ms is driven almost entirely by prompt length - it is the prefill cost. These charts show how it varies across all 12 prompt × gen combinations and across all 4 power modes. E.2a TTFT vs prompt length by gen length TTFT Figure E.2a: TTFT vs prompt length: gen=128 Figure E.2b: TTFT vs prompt length: gen=256 Figure E.2c: TTFT vs prompt length: gen=512 canonical E.3 TTFT heatmaps gen x prompt per power mode TTFT Each cell is TTFT glossary in ms. Rows = gen length, columns = prompt length. Independent of gen length hence the same across rows.| Figure E.3a: TTFT | Figure E.3b: heatmap: 15W glossary TTFT Figure E.3c: heatmap: 25W glossary TTFT Figure E.3d: heatmap: MAXN glossary TTFT Appendix F: ITL : All Combinations ITL Inter-token latency ms = time between consecutive output tokens. It measures decode cost and is driven by model size and GPU clock, not prompt length. F.1 ITL vs prompt length by gen length ITL Figure F.1a: ITL vs prompt length: gen=128 Figure F.1b: ITL vs prompt length: gen=256 Figure F.1c: ITL vs prompt length: gen=512 canonical, also in section 2.3 F.2 ITL heatmaps gen x prompt per power mode ITL | Figure F.2a: ITL | Figure F.2b: heatmap: 15W glossary ITL Figure F.2c: heatmap: 25W glossary ITL Figure F.2d: heatmap: MAXN glossary ITL Appendix G: Prefill Throughput: All Combinations Prefill throughput tok/s measures how fast the model processes input tokens. It scales with prompt length longer prompts hit peak GPU utilisation and GPU clock speed. G.1 Prefill throughput vs prompt length by gen length Figure G.1a: Prefill throughput vs prompt length: gen=128 Figure G.1b: Prefill throughput vs prompt length: gen=256 Prefill throughput is independent of gen length, so gen=128 and gen=256 show the same trend. G.2 Prefill throughput heatmaps gen x prompt per power mode | Figure G.2a: Prefill throughput heatmap: 7W | Figure G.2b: Prefill throughput heatmap: 15W Figure G.2c: Prefill throughput heatmap: 25W Figure G.2d: Prefill throughput heatmap: MAXN Appendix H: All Metrics, Formulas, and Calculation Methods This appendix documents every metric reported in this benchmark, its formula, its source, and any caveats. H.1 Raw inputs from aiperf and tegrastats | Symbol | Source | Definition | |---|---|---| ISL | aiperf JSON input sequence length.avg | Actual input tokens processed per request may differ from target due to tokenizer rounding | OSL | aiperf JSON output sequence length.avg | Actual output tokens generated per request | TTFT | aiperf JSON time to first token.p50 ms | Median time from request sent to first output token received; proxy for prefill duration. p50 used not avg to avoid skew from occasional slow requests | ITL | aiperf JSON inter token latency.p50 ms | Median time between consecutive output tokens; per-token decode cost. p50 used for robustness against outliers | RL | aiperf JSON request latency.p50 ms | Median total wall time per request: TTFT | tok s output token throughput per user.p50 / RL in steady state glossary OSL prefill tput prefill throughput per user.p50 t0 , t1 start time , end time ISO 8601 mW i field mW glossary VDD CPU GPU CV i All aiperf metrics are averages over 20 requests per combo. Percentile variants p50, p90, p99 are also available in the raw JSON but not reproduced here. H.2 Power p50 power W = median mW i for all tegrastats samples where t0 <= sample time <= t1 / 1000 covers the CPU, GPU, and Computer Vision engine rail VDD CPU GPU CV - Does NOT include board overhead fan, storage, USB which is on VDD IN VDD IN is ~1.5-3 W higher thanduring inference VDD CPU GPU CV - Tegrastats interval: 500 ms H.3 Output tok/J main efficiency metric output tok J = OSL / decode J = OSL / decode power W RL p50 s - TTFT p50 s decode power W is the median power across tegrastats samples assigned to exact decode windows using per-request nanosecond timestamps from profile export.jsonl see H.5 appendix-h5 . The denominator is decode-phase energy only - prefill excluded because no output tokens are generated during prefill. Higher is better. This is the primary metric of the benchmark. Note: because decode time and decode power are both roughly independent of prompt length, output tok/J is also roughly independent of prompt length. H.4 Request latency energy total J = p50 power W RL p50 / 1000 Energy consumed by one median request from first byte sent to last token received. p50 power W is the median tegrastats sample over the full run window. Accurate for all cells regardless of TTFT glossary . H.5 Prefill and decode energy prefill J = prefill power W TTFT / 1000 decode J = decode power W RL - TTFT / 1000 total J = prefill J + decode J prefill % = prefill J / total J 100 Exact per-request phase power from profile export.jsonl. aiperf writes one JSON record per request to profile export.jsonl , with nanosecond-precision timestamps: request start ns - when request was sent request ack ns - when first output token was received = prefill end request end ns - when last output token was received = decode end For each request i , phase windows in wall-clock seconds are: prefill start i = t0 + request start ns i - request start ns 0 / 1e9 prefill end i = t0 + request ack ns i - request start ns 0 / 1e9 decode end i = t0 + request end ns i - request start ns 0 / 1e9 where t0 is the aiperf start time ISO timestamp and request start ns 0 is the first request’s start. Each tegrastats sample at wall-clock time ep is assigned to whichever request’s phase window it falls in. prefill power W = median of samples in any prefill window across all 20 requests. decode power W = median of samples in any decode window across all 20 requests. Energy uses exact p50 median phase durations from the per-request data: p50 ttft s = median request ack ns i - request start ns i / 1e9 over all 20 requests p50 decode s = median request end ns i - request ack ns i / 1e9 over all 20 requests Why this matters: prefill draws significantly more power than decode on Bonsai models - prefill is a batched forward pass compute-heavy , decode is one token at a time memory-bandwidth bound . At 25W the ratio is ~1.6x for 1.7B/4B models, ~1.1x for 8B. Using full-run average power for both phases would underestimate decode power W and therefore overestimate decode J , making output tok J too low. Residual approximation: a 500 ms tegrastats sample that straddles a prefill→decode boundary within a single request is assigned to one phase entirely. This affects only samples near the request ack ns boundary and is negligible across 20 requests totalling hundreds of samples. H.6 Phase tok/J metrics prefill tok J = ISL / prefill J = ISL / prefill power W TTFT s decode tok J = OSL / decode J = OSL / decode power W RL s - TTFT s total tok J = ISL + OSL / total J = ISL + OSL / prefill J + decode J Where TTFT s = TTFT glossary / 1000 , RL s = RL / 1000 . : input tokens per joule of prefill energy, using phase-specific prefill tok J prefill power W .: identical to decode tok J - the primary benchmark metric, using phase-specific output tok J decode power W .: all tokens in + out per joule of total request energy. total tok J H.7 mJ per output token mJ per output tok = decode J / OSL 1000 = 1000 / decode tok J Millijoules per generated output token decode J appendix-h5 is in joules, ×1000 converts to mJ for readability . Carries the same caveat as I.5 for cells where < 500 ms. glossary TTFT H.8 Prefill throughput prefill tput tok/s = aiperf JSON prefill throughput per user.p50 Directly from aiperf. Measures how fast input tokens are processed during the prefill phase. Scales with prompt length longer prompts hit peak GPU utilisation and GPU clock. H.9 Speedup ratio methodology Tables 9, 10a, 11, 12 All speedup ratios use median over all 12 prompt × gen combos 4 ctx lengths × 3 gen lengths . “A vs B” means A is the faster mode; ratio 1 means A is faster than B. Table 9 - Output throughput tok/s : speedup A vs B = median tok s A over 12 combos / median tok s B over 12 combos tok s = output token throughput per user.p50 aiperf Table 10a - Decode time: decode time s = request latency.p50 - time to first token.p50 / 1000 speedup A vs B = median decode time s B over 12 combos / median decode time s A over 12 combos Table 11 - TTFT prefill time : speedup A vs B = median TTFT B over 12 combos / median TTFT A over 12 combos TTFT = time to first token.p50 aiperf, ms Table 12 - Request latency E2E : speedup A vs B = median RL B over 12 combos / median RL A over 12 combos RL = request latency.p50 aiperf, ms H.10 Best total tok/J per model Table 13 best total tok J model = max total tok J mode, model, gen, ctx over all modes in {7W, 15W, 25W, MAXN} and all gen in {128, 256, 512} and all ctx in {256, 512, 1024, 2048} total tok J = ISL + OSL / p50 power W RL p50 s The single highest total tok/J value observed for that model across all 48 combinations. Peaks at ctx=2048, gen=128 for every model because the long prompt dominates the ISL glossary + numerator. glossary OSL H.11 TTFT , ITL , RL percentiles TTFT ITL RL All percentile variants come directly from aiperf JSON without further computation: TTFT = time to first token.p50 canonical; p50 used everywhere TTFT p90 = time to first token.p90 TTFT p99 = time to first token.p99 ITL = inter token latency.p50 canonical; p50 used everywhere ITL p99 = inter token latency.p99 RL = request latency.p50 canonical; p50 used everywhere RL p99 = request latency.p99 H.12 Energy caveat: which metrics are accurate vs approximate | Metric | Accurate? | Condition | |---|---|---| output tok J | total J decode J decode tok J total tok J which is accurate appendix-h4 total J prefill J = 500 ms only glossary TTFT prefill tok J = 500 ms only glossary TTFT prefill J prefill % = 500 ms only glossary TTFT prefill J mJ per output tok decode J Phase energies use phase-specific power from tegrastats samples assigned to exact prefill/decode windows via per-request nanosecond timestamps from profile export.jsonl see I.5 . The residual approximation is only the few tegrastats samples that straddle a phase boundary near each request’s TTFT glossary transition - negligible across 20 requests. is computable for 239 of 240 cells; the one exception Bonsai-8B / 25W / ctx=256 / gen=512 is a complete benchmark failure - all 20 aiperf requests errored - unrelated to power measurement. appendix-h3 output tok J H.13 Power and temperature p50 power W = median tegrastats.VDD CPU GPU CV mW / 1000 for all samples where aiperf t0 <= sample time <= aiperf t1 Power is the median CPU+GPU+CV rail from VDD CPU GPU CV tegrastats sampled at 500 ms intervals, over each model’s active inference window only idle/cool-down between models excluded . Median is used instead of mean to suppress rare outlier samples e.g. OS scheduling spikes that do not reflect sustained inference power. Junction temperature TJ is the hottest internal die temperature on the Jetson SoC, reported by tegrastats as tj@ . The hardware automatically throttles GPU/CPU clocks when TJ reaches ~95 °C to prevent damage. Peak TJ < 76 °C across all runs confirms ample thermal headroom at every power mode. | Symbol | Source | Definition | |---|---|---| VDD CPU GPU CV | cpu@ gpu@ tj@ p50 power W over active inference window W glossary VDD CPU GPU CV avg cpu C avg gpu C peak tj C Throttling is flagged when peak tj C 85 °C leaving a 10 °C safety margin below the hardware limit . Appendix I: All ctx x gen Combinations - tok/s and tok/J Full breakdown of output tok/s and output tok/J for all 5 models across all 4 power modes, repeated for every combination of prompt context length ctx and generation length gen . Each table mirrors Table 7 table-7 the canonical ctx=2048, gen=512 cell but at a different operating point. Bold values indicate the peak tok/J mode for that model row; when two modes tie on tok/J to 2 decimal places , the higher-throughput mode is bolded. All values use p50 median over 20 requests. gen=128 tok Table I.1: ctx=256, gen=128 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.8 | 17.7 | 26.0 | 28.0 | 4.64 | 5.53 | 5.81 | 5.30 | 25W | | Bonsai-4B | 3.6 | 9.6 | 14.2 | 15.3 | 2.22 | 2.71 | 2.83 | 2.55 | 25W | | Bonsai-8B | 3.7 | 10.4 | 15.5 | 17.1 | 1.77 | 1.93 | 1.93 | 1.74 | 25W | | Ternary-Bonsai-1.7B | 9.7 | 25.9 | 38.2 | 41.9 | 5.12 | 5.50 | 5.60 | 5.05 | 25W | | Ternary-Bonsai-4B | 4.3 | 12.1 | 18.0 | 19.8 | 2.19 | 2.39 | 2.44 | 2.22 | 25W | Table I.2: ctx=512, gen=128 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.8 | 17.5 | 25.7 | 27.7 | 4.60 | 5.47 | 5.80 | 5.16 | 25W | | Bonsai-4B | 3.6 | 9.6 | 14.1 | 15.2 | 2.51 | 2.69 | 2.79 | 2.52 | 25W | | Bonsai-8B | 3.7 | 10.3 | 15.3 | 16.9 | 1.80 | 1.91 | 1.92 | 1.72 | 25W | | Ternary-Bonsai-1.7B | 9.6 | 25.6 | 37.7 | 41.3 | 5.07 | 5.48 | 5.49 | 4.93 | 25W | | Ternary-Bonsai-4B | 4.3 | 12.0 | 17.8 | 19.7 | 2.18 | 2.39 | 2.42 | 2.19 | 25W | Table I.3: ctx=1024, gen=128 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.7 | 17.1 | 25.2 | 27.2 | 4.54 | 5.30 | 5.58 | 5.00 | 25W | | Bonsai-4B | 3.6 | 9.4 | 13.9 | 15.0 | 2.42 | 2.66 | 2.76 | 2.48 | 25W | | Bonsai-8B | 3.7 | 10.2 | 15.1 | 16.7 | 1.75 | 1.90 | 1.90 | 1.70 | 25W | | Ternary-Bonsai-1.7B | 9.4 | 24.9 | 36.6 | 40.2 | 4.87 | 5.28 | 5.35 | 4.80 | 25W | | Ternary-Bonsai-4B | 4.2 | 11.8 | 17.5 | 19.3 | 2.15 | 2.35 | 2.38 | 2.15 | 25W | Table I.4: ctx=2048, gen=128 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.5 | 16.5 | 24.3 | 26.2 | 4.43 | 5.04 | 5.29 | 4.75 | 25W | | Bonsai-4B | 3.5 | 9.2 | 13.5 | 14.7 | 2.21 | 2.56 | 2.66 | 2.41 | 25W | | Bonsai-8B | 3.6 | 9.9 | 14.0 | 15.1 | 1.72 | 1.84 | 1.84 | 1.65 | 25W | | Ternary-Bonsai-1.7B | 9.1 | 23.5 | 34.8 | 38.1 | 4.68 | 4.99 | 5.17 | 4.55 | 25W | | Ternary-Bonsai-4B | 4.1 | 11.4 | 16.9 | 18.7 | 2.14 | 2.29 | 2.32 | 2.09 | 25W | gen=256 tok Table I.5: ctx=256, gen=256 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.8 | 17.7 | 26.0 | 28.0 | 4.62 | 5.51 | 5.84 | 5.20 | 25W | | Bonsai-4B | 3.6 | 9.6 | 14.2 | 15.3 | 2.27 | 2.70 | 2.82 | 2.54 | 25W | | Bonsai-8B | 3.7 | 10.4 | 15.5 | 17.1 | 1.87 | 1.92 | 1.93 | 1.73 | 25W | | Ternary-Bonsai-1.7B | 9.7 | 25.9 | 38.4 | 41.9 | 5.10 | 5.48 | 5.74 | 4.99 | 25W | | Ternary-Bonsai-4B | 4.3 | 12.1 | 18.0 | 19.8 | 2.18 | 2.40 | 2.42 | 2.19 | 25W | Table I.6: ctx=512, gen=256 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.8 | 17.5 | 25.7 | 27.7 | 4.58 | 5.45 | 5.77 | 5.14 | 25W | | Bonsai-4B | 3.6 | 9.6 | 14.1 | 15.2 | 2.31 | 2.68 | 2.80 | 2.52 | 25W | | Bonsai-8B | 3.7 | 10.3 | 15.3 | 16.9 | 1.83 | 1.90 | 1.91 | 1.71 | 25W | | Ternary-Bonsai-1.7B | 9.6 | 25.5 | 37.8 | 41.3 | 4.94 | 5.40 | 5.68 | 4.92 | 25W | | Ternary-Bonsai-4B | 4.3 | 12.0 | 17.8 | 19.6 | 2.30 | 2.38 | 2.41 | 2.17 | 25W | Table I.7: ctx=1024, gen=256 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.7 | 17.1 | 25.2 | 27.2 | 4.52 | 5.28 | 5.62 | 5.01 | 25W | | Bonsai-4B | 3.6 | 9.4 | 13.9 | 15.0 | 2.24 | 2.62 | 2.76 | 2.47 | 25W | | Bonsai-8B | 3.7 | 10.2 | 15.1 | 16.7 | 1.84 | 1.89 | 1.89 | 1.69 | 25W | | Ternary-Bonsai-1.7B | 9.4 | 24.8 | 36.8 | 40.1 | 4.85 | 5.25 | 5.53 | 4.78 | 25W | | Ternary-Bonsai-4B | 4.2 | 11.8 | 17.5 | 19.3 | 2.28 | 2.34 | 2.37 | 2.14 | 25W | Table I.8: ctx=2048, gen=256 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.5 | 16.5 | 24.3 | 26.2 | 4.41 | 5.02 | 5.31 | 4.76 | 25W | | Bonsai-4B | 3.5 | 9.2 | 13.5 | 14.6 | 2.20 | 2.55 | 2.65 | 2.38 | 25W | | Bonsai-8B | 3.6 | 9.9 | 14.0 | 15.1 | 1.71 | 1.84 | 1.82 | 1.64 | 15W | | Ternary-Bonsai-1.7B | 9.1 | 23.5 | 34.8 | 38.2 | 4.66 | 4.97 | 5.15 | 4.60 | 25W | | Ternary-Bonsai-4B | 4.1 | 11.4 | 16.9 | 18.7 | 2.18 | 2.28 | 2.30 | 2.08 | 25W | gen=512 tok Table I.9: ctx=256, gen=512 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.8 | 17.6 | 25.9 | 27.9 | 4.59 | 5.47 | 5.80 | 5.20 | 25W | | Bonsai-4B | 3.6 | 9.6 | 14.1 | 15.3 | 2.26 | 2.69 | 2.80 | 2.53 | 25W | | Bonsai-8B | 3.7 | 10.4 | - | 17.0 | 1.90 | 1.91 | - | 1.72 | 15W | | Ternary-Bonsai-1.7B | 9.7 | 25.7 | 38.1 | 41.7 | 5.06 | 5.38 | 5.69 | 4.97 | 25W | | Ternary-Bonsai-4B | 4.3 | 12.1 | 17.9 | 19.7 | 2.17 | 2.39 | 2.41 | 2.18 | 25W | Table I.10: ctx=512, gen=512 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.7 | 17.4 | 25.6 | 27.6 | 4.56 | 5.41 | 5.73 | 5.15 | 25W | | Bonsai-4B | 3.6 | 9.5 | 14.0 | 15.2 | 2.25 | 2.64 | 2.79 | 2.49 | 25W | | Bonsai-8B | 3.7 | 10.3 | 15.3 | 16.9 | 1.86 | 1.91 | 1.90 | 1.71 | 15W | | Ternary-Bonsai-1.7B | 9.6 | 25.4 | 37.6 | 41.1 | 5.01 | 5.35 | 5.62 | 4.91 | 25W | | Ternary-Bonsai-4B | 4.3 | 12.0 | 17.7 | 19.6 | 2.24 | 2.37 | 2.40 | 2.17 | 25W | Table I.11: ctx=1024, gen=512 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | |---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.7 | 17.1 | 25.0 | 27.1 | 4.38 | 5.31 | 5.58 | 5.02 | 25W | | Bonsai-4B | 3.6 | 9.4 | 13.9 | 15.0 | 2.28 | 2.58 | 2.75 | 2.46 | 25W | | Bonsai-8B | 3.7 | 10.2 | 14.4 | 16.6 | 1.84 | 1.88 | 1.85 | 1.69 | 15W | | Ternary-Bonsai-1.7B | 9.4 | 24.6 | 36.6 | 40.0 | 4.81 | 5.20 | 5.46 | 4.77 | 25W | | Ternary-Bonsai-4B | 4.2 | 11.7 | 17.4 | 19.2 | 2.13 | 2.33 | 2.36 | 2.13 | 25W | Table I.12: ctx=2048, gen=512 | Model | 7W tok/s | 15W tok/s | 25W tok/s | MAXN tok/s | 7W tok/J | 15W tok/J | 25W tok/J | MAXN tok/J | Peak tok/J | | |---|---|---|---|---|---|---|---|---|---|---| | Bonsai-1.7B | 6.5 | 16.4 | 24.1 | 26.1 | 4.27 | 4.99 | 5.24 | 4.77 | 25W | | | Bonsai-4B | 3.5 | 9.2 | 13.5 | 14.6 | 2.19 | 2.51 | 2.64 | 2.37 | 25W | | | Bonsai-8B | 3.6 | 9.9 | 14.0 | 15.1 | 1.70 | 1.83 | 1.81 | 1.63 | 15W | | | Ternary-Bonsai-1.7B | 9.0 | 23.4 | 34.7 | 38.0 | 4.64 | 4.94 | 5.18 | 4.55 | 25W | | | Ternary-Bonsai-4B | 4.1 | 11.4 | 16.9 | 18.6 | 2.08 | 2.25 | 2.30 | 2.06 | 25W | — | Trivia: Why Ternary Beats 1-bit on Jetson Despite Being Larger Running the ternary models faster than the Q1 0 Bonsai models seems counterintuitive: ternary weights are stored at 2 bits each 4 per byte while Q1 0 is 1 bit each 8 per byte , so Q1 0 has a smaller file and less DRAM pressure per decode step. Yet ternary wins on every mode. Here is why. Storage vs compute type are separate things. | Format | Storage | Compute path | Tensor core support on GA10B | |---|---|---|---| | Ternary {-1, 0, +1} | 2 bits/weight | unpack to INT8, use INT8 WMMA | Yes | | Q1 0 {-1, +1} | 1 bit/weight | needs XNOR+popcount BMMA / INT1 | No | True 1-bit compute uses XNOR + popcount via BMMA instructions. XOR the packed weight bits with packed activation bits, then count the ones with popcount. Extremely fast when supported. NVIDIA added INT1 tensor cores BMMA on A100 GA100 and select Turing chips. The Jetson Orin Nano runs the GA10B Ampere die, which does not include BMMA. So Q1 0 falls back to slow CUDA core emulation. Ternary unpacks to INT8 {-1, 0, 1}, and INT8 WMMA tensor cores are present on GA10B. Ternary gets hardware-accelerated matmul; Q1 0 does not. Zero-weight sparsity is a second real advantage. BitNet 1.58 models carry roughly 30-50% zero weights. Those accumulations are skipped entirely, saving a genuine fraction of flops even in the FP16/INT8 accumulation path. Q1 0 binary weights have no zeros, so there is no sparsity to exploit. What good 1-bit kernels would look like. A proper XNOR+popcount kernel on hardware that supports BMMA would in theory beat ternary: smaller DRAM footprint AND fast bitwise arithmetic. The reason 1-bit loses here is not an inherent property of 1-bit quantization – it is a hardware availability gap on this specific Jetson die, combined with immature llama.cpp kernel support for Q1 0. Kernel quality is about more than the arithmetic. Even when the per-weight operation is just add/subtract, the CUDA kernel controls: memory coalescing whether 32 warp threads read contiguous addresses into a single DRAM transaction or scatter into 32 separate ones ; shared memory tiling loading weight blocks into fast on-chip SRAM before accumulation vs hitting DRAM on every access ; warp divergence from the zero-weight skip a naive if/skip branch serializes the warp – a good kernel uses predicated execution or precomputed bitmasks ; and occupancy register pressure determines how many warps stay in flight to hide memory latency . The arithmetic is one clock cycle. Everything around it determines whether the GPU is actually executing that cycle or stalling. ⚠ Why Models with Weights ~1 GB Were Not Tested and even so with higher ctx/gen lengths All models in this benchmark have GGUF weights ≤ 958 MB. Larger models e.g. Gemma3-4B Q4 K M at 2.4 GB fail to load on JetPack R36.4.7 L4T 36.4 regardless of power mode or available memory.This is a known regression in the CUDA IOVA / NvMap contiguous-memory allocator introduced in this JetPack release - not a simple “out of RAM” failure. Root cause:On Jetson platforms, the CUDA driver allocates device-mapped memory through the NvMap kernel driver, which requires asingle contiguous blockin the IOVA I/O Virtual Address space. Unlike a general-purpose allocator that can stitch together scattered pages, NvMap must find one unbroken IOVA range large enough for the entire allocation in a single call. For a 2.4 GB GGUF like Gemma3-4B, that means requesting a contiguous block of roughly2.4 GB plus KV-cache and CUDA runtime overhead before any other tensor or buffer is placed in the address space.The allocation fails immediately with NvMapMemAllocInternalTagged: error 12 ENOMEM - errno 12 is ENOMEM , “not enough memory” in the contiguous-mapping sense, not the total-capacity sense. What this means in practice:Any GGUF model requiring more than roughly1.1 GBof contiguous CUDA buffers is blocked at load time on this JetPack version. This is why the benchmark is limited to models under ~1 GB GGUF size. Smaller models load fine because their contiguous IOVA requirement fits within what the fragmented address space can still provide. Affected platform:NVIDIA Jetson Orin Nano Super 8GB running JetPack R36.4.7 L4T 36.4.7 / Ubuntu 22.04 . The same board onJetPack 6.2.2 L4T 36.5 resolves this regression.