cd /news/machine-learning/klongpy-pytorch-back-end-and-autogra… · home topics machine-learning article
[ARTICLE · art-14429] src=klongpy.org pub= topic=machine-learning verified=true sentiment=· neutral

KlongPy: PyTorch Back End and Autograd

KlongPy now supports a PyTorch backend that enables GPU acceleration and automatic differentiation for gradient-based computations. The torch backend outperforms NumPy by up to 8x on large arrays and provides exact gradients via the `:>` autograd operator, while the `∇` operator always uses numeric differentiation regardless of backend. Users can enable the PyTorch backend through the `--backend torch` flag at the command line or by setting `backend="torch"` when creating a KlongInterpreter.

read8 min publishedMay 26, 2026

KlongPy supports multiple array backends. The PyTorch backend enables GPU acceleration and automatic differentiation (autograd) for gradient-based computations.

Enabling the PyTorch Backend #

Command Line

kgpy --backend torch

kgpy --backend torch --device cuda

Programmatically

from klongpy import KlongInterpreter

klong = KlongInterpreter(backend="torch")
print(klong._backend.name)  # 'torch'

klong = KlongInterpreter(backend="torch", device="cuda")

Backend Comparison #

Feature NumPy Backend PyTorch Backend
Default Yes No (use --backend torch )
Object dtype Yes No
String operations Yes Not supported
GPU acceleration No Yes (CUDA/MPS)
Autograd Numeric only Native autograd
Small array performance Faster Slightly slower
Large array performance Good Better (especially on GPU)

Performance #

The torch backend excels with large arrays:

Benchmark              NumPy      Torch      Winner
---------------------------------------------------------
vector_add_100K        0.04ms     0.08ms     NumPy (2x)
vector_add_1M          0.36ms     0.07ms     Torch (5x)
compound_expr_1M       0.61ms     0.07ms     Torch (8x)
grade_up_100K          0.59ms     0.19ms     Torch (3x)

For small arrays (<100K elements), NumPy is slightly faster due to lower dispatch overhead. For larger arrays, torch wins significantly.

Automatic Differentiation #

KlongPy provides several gradient and differentiation operators:

Typing Special Characters

Symbol Name Mac Windows
Nabla Character Viewer (Ctrl+Cmd+Space) Alt+8711
Partial Option + d
Alt+8706

On Mac,

can be typed directly with Option + d. For

, use the Character Viewer or copy-paste.

:>

Autograd Operator (Recommended)

The :>

operator uses PyTorch autograd for exact gradients:

f::{x^2}         :" Define f(x) = x^2
f:>3             :" Compute f'(3) = 6.0

The syntax is function:>point

where:

  • function

is a scalar-valued function (must return a single number)

  • point

is the input at which to compute the gradient

Numeric Gradient Operator

The

operator always uses numeric differentiation (finite differences), regardless of backend:

f::{x^2}         :" Define f(x) = x^2
3∇f              :" Compute f'(3) ≈ 6.0

The syntax is point∇function

(note: reversed order from :>

).

How They Work

Operator Method Precision Speed
:> with torch
PyTorch autograd Exact Fast
:> without torch
Numeric ~1e-6 error Slower
(any backend)
Always numeric ~1e-6 error Slower

With the torch backend (--backend torch

or backend='torch'

), prefer :>

for:

  • Exact gradients (no floating-point approximation error)
  • Complex computational graphs
  • Better performance on large arrays

Examples

Scalar function:

f::{x^3}          :" f(x) = x^3
f:>2              :" f'(2) = 3*4 = 12.0

Polynomial:

p::{((3*x^4)-(2*x^2))+x}   :" p(x) = 3x^4 - 2x^2 + x
p:>1                        :" p'(1) = 12 - 4 + 1 = 9.0

Vector function (sum of squares):

g::{+/x^2}             :" g(x) = sum(x_i^2)
g:>[1.0 2.0 3.0]       :" [2 4 6] = 2*x

Gradient descent:

f::{x^2}
x::5.0
lr::0.1

:" Update rule: x = x - lr * grad
x::x-(lr*f:>x)

Multi-Parameter Gradients

Compute gradients for multiple parameters simultaneously using a list of symbols:

w::2.0
b::3.0
loss::{(w^2)+(b^2)}

:" Compute gradients for both w and b
grads::loss:>[w b]    :" [4.0 6.0] = [2w, 2b]

This is especially useful for neural network training:

w::1.0
b::0.0
X::[1 2 3]
Y::[3 5 7]

:" MSE loss
loss::{(+/((w*X)+b-Y)^2)%3}

:" Compute both gradients in one call
grads::loss:>[w b]

Jacobian Computation

Compute the Jacobian matrix (matrix of partial derivatives) using the

operator or .jacobian()

function:

f::{x^2}                 :" Element-wise square

:" Using ∂ operator (point∂function)
[1 2]∂f                  :" [[2 0] [0 4]] diagonal matrix

:" Using .jacobian() function
.jacobian(f;[1 2])       :" Same result

For vector-valued functions f: R^n -> R^m, the Jacobian is an m x n matrix where J[i,j] = df_i/dx_j.

Multi-Parameter Jacobians

Just like gradients, you can compute Jacobians with respect to multiple parameters using a list of symbols:

w::[1.0 2.0]
b::[3.0 4.0]
f::{w^2}                 :" Returns [w0^2, w1^2]

:" Compute Jacobians for both w and b
jacobians::[w b]∂f       :" Returns [J_w, J_b]

This returns a list of Jacobian matrices, one per parameter. Useful for analyzing how vector-valued functions depend on multiple parameter sets.

Custom Optimizers

KlongPy provides the gradient primitives (:>

,

, .jacobian()

). For optimizers, use the example classes in examples/autograd/optimizers.py

which you can copy to your project and customize.

Manual gradient descent (no optimizer needed):

w::10.0
loss::{w^2}
lr::0.1

:" Update rule: w = w - lr * gradient
{w::w-(lr*loss:>w)}'!50
w                        :" Close to 0

Using a custom optimizer class:

  • Copy examples/autograd/optimizers.py

to your project directory - Import with .pyf()

:

:" Import the optimizer class
.pyf("optimizers";"SGDOptimizer")

:" Setup parameters and loss
w::10.0
loss::{w^2}

:" Create optimizer with learning rate 0.1
opt::SGDOptimizer(klong;["w"];:{["lr" 0.1]})

:" Run optimization steps
{opt(loss)}'!50
w                        :" Close to 0

Available example optimizers:

  • SGDOptimizer

  • Stochastic Gradient Descent with optional momentum

  • AdamOptimizer

  • Adam optimizer with adaptive learning rates

SGD with momentum:

.pyf("optimizers";"SGDOptimizer")
opt::SGDOptimizer(klong;["w"];:{["lr" 0.01 "momentum" 0.9]})

Adam optimizer:

.pyf("optimizers";"AdamOptimizer")
opt::AdamOptimizer(klong;["w" "b"];:{["lr" 0.001]})

Training loop example:

.pyf("optimizers";"AdamOptimizer")

w::1.0;b::0.0
X::[1 2 3];Y::[3 5 7]
loss::{(+/((w*X)+b-Y)^2)%3}
opt::AdamOptimizer(klong;["w" "b"];:{["lr" 0.1]})

:" Train for 500 steps
{opt(loss)}'!500

Creating your own optimizer:

The example optimizers use multi_grad_of_fn

from klongpy.autograd

to compute gradients for multiple parameters. Copy and modify the optimizer classes to implement custom update rules (RMSprop, AdaGrad, learning rate schedules, etc.).

GPU Acceleration #

When CUDA or Apple MPS is available, tensors automatically use GPU:

from klongpy import KlongInterpreter

klong = KlongInterpreter(backend='torch')
print(klong._backend.device)  # 'cuda:0', 'mps:0', or 'cpu'

Device Selection

The backend automatically selects the best available device: 1. CUDA (NVIDIA GPU) - if available 2. MPS (Apple Silicon) - if available 3. CPU - fallback

MPS Limitations

Apple's MPS backend has some limitations: - No float64 support (uses float32) - Some operations fall back to CPU

Mixing with Python #

Access torch tensors directly:

from klongpy import KlongInterpreter

klong = KlongInterpreter(backend='torch')

result = klong('2*1+!1000000')
print(type(result))  # <class 'torch.Tensor'>
print(result.device)  # cuda:0, mps:0, or cpu

import numpy as np
np_result = result.cpu().numpy()

Best Practices #

Use torch for large computations: Switch to torch backend for arrays >100K elements - Keep data as tensors: Avoid unnecessary conversions between numpy and torch - Batch operations: Combine operations to minimize dispatch overhead - Use autograd for gradients: Native autograd is faster and more accurate than numeric differentiation

Function Compilation #

The torch backend supports compiling Klong functions for optimized execution using torch.compile

:

.compile(fn;input)

  • Compile Function

Compiles a function for faster execution:

f::{x^2}
cf::.compile(f;3.0)      :" Returns compiled function
cf(5.0)                   :" 25.0 (optimized)

The compiled function runs significantly faster for complex computations.

.export(fn;input;path)

  • Export Computation Graph

Exports the function's computation graph to a file for inspection:

f::{(x^3)+(2*x^2)+x}
info::.export(f;2.0;"model.pt2")
.p(info@"graph")         :" Print computation graph

Returns a dictionary with:

  • "compiled_fn"

  • The compiled function

  • "export_path"

  • Path where graph was saved

  • "graph"

  • String representation of computation graph

The exported .pt2

file can be loaded with torch.export.load()

in Python.

.compilex(fn;input;options)

  • Extended Compilation

Compile with advanced options for mode and backend:

f::{x^2}

:" Fast compilation for development
cf::.compilex(f;3.0;:{["mode" "reduce-overhead"]})

:" Maximum optimization for production
cf::.compilex(f;3.0;:{["mode" "max-autotune"]})

:" Debug mode (no compilation)
cf::.compilex(f;3.0;:{["backend" "eager"]})

Options dictionary:

  • "mode"

  • Compilation mode (see table below)

  • "backend"

  • Compilation backend (see table below)

  • "fullgraph"

  • Set to 1 to require full graph compilation

  • "dynamic"

  • Set to 1 for dynamic shapes, 0 for static

.cmodes()

  • Query Compilation Modes

Get information about available modes and backends:

info::.cmodes()
.p(info@"modes")          :" Available compilation modes
.p(info@"backends")       :" Available backends
.p(info@"recommendations") :" Suggested settings

Compilation Mode Comparison

Mode Compile Time Runtime Speed Best For
default
Medium Good General use
reduce-overhead
Fast Moderate Development/testing
max-autotune
Slow Best Production

Backend Comparison

Backend Description
inductor
Default - C++/Triton code generation (fastest)
eager
No compilation - runs original Python (debugging)
aot_eager
Ahead-of-time eager (debugging + autograd)
cudagraphs
CUDA graphs - reduces GPU kernel launch overhead

Note: Compilation requires a C++ compiler on your system. Use "backend" "eager"

to bypass compilation for debugging. If compilation fails, an error message will indicate the issue.

Gradient Verification #

Use .gradcheck()

to verify that autograd gradients are correct:

.gradcheck(fn;inputs)

  • Verify Gradients

Verifies autograd gradients against numeric gradients:

f::{x^2}
.gradcheck(f;3.0)        :" Returns 1 if correct

g::{+/x^2}
.gradcheck(g;[1.0 2.0 3.0])  :" Returns 1

This uses torch.autograd.gradcheck

internally for rigorous verification.

Use cases:

  • Verifying custom gradient implementations
  • Debugging gradient computation issues
  • Ensuring numerical stability

Troubleshooting #

"PyTorch backend does not support object dtype"

The torch backend cannot handle mixed-type arrays or nested structures with varying shapes. Use the numpy backend for these cases.

MPS float64 errors

MPS doesn't support float64. The backend automatically converts to float32, but some precision-sensitive operations may behave differently.

Slow small array operations

For arrays <10K elements, numpy may be faster. Consider using numpy backend for small array workloads or batching small operations together.

torch.compile errors

If .compile()

fails with C++ errors, ensure you have:

  • A C++ compiler installed (clang++ or g++)
  • The required header files (may need Xcode Command Line Tools on macOS)
── more in #machine-learning 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/klongpy-pytorch-back…] indexed:0 read:8min 2026-05-26 ·