# KlongPy: PyTorch Back End and Autograd

> Source: <http://www.klongpy.org/torch_backend/>
> Published: 2026-05-26 12:51:00+00:00

# PyTorch Backend and Autograd[¶](#pytorch-backend-and-autograd)

KlongPy supports multiple array backends. The PyTorch backend enables GPU acceleration and automatic differentiation (autograd) for gradient-based computations.

## Enabling the PyTorch Backend[¶](#enabling-the-pytorch-backend)

### Command Line[¶](#command-line)

```
# Use --backend flag
kgpy --backend torch

# With GPU device selection
kgpy --backend torch --device cuda
```

### Programmatically[¶](#programmatically)

``` python
from klongpy import KlongInterpreter

# Create interpreter with torch backend
klong = KlongInterpreter(backend="torch")
print(klong._backend.name)  # 'torch'

# With specific device
klong = KlongInterpreter(backend="torch", device="cuda")
```

## Backend Comparison[¶](#backend-comparison)

| Feature | NumPy Backend | PyTorch Backend |
|---|---|---|
| Default | Yes | No (use `--backend torch` ) |
| Object dtype | Yes | No |
| String operations | Yes | Not supported |
| GPU acceleration | No | Yes (CUDA/MPS) |
| Autograd | Numeric only | Native autograd |
| Small array performance | Faster | Slightly slower |
| Large array performance | Good | Better (especially on GPU) |

## Performance[¶](#performance)

The torch backend excels with large arrays:

```
Benchmark              NumPy      Torch      Winner
---------------------------------------------------------
vector_add_100K        0.04ms     0.08ms     NumPy (2x)
vector_add_1M          0.36ms     0.07ms     Torch (5x)
compound_expr_1M       0.61ms     0.07ms     Torch (8x)
grade_up_100K          0.59ms     0.19ms     Torch (3x)
```

For small arrays (<100K elements), NumPy is slightly faster due to lower dispatch overhead. For larger arrays, torch wins significantly.

## Automatic Differentiation[¶](#automatic-differentiation)

KlongPy provides several gradient and differentiation operators:

### Typing Special Characters[¶](#typing-special-characters)

| Symbol | Name | Mac | Windows |
|---|---|---|---|
`∇` |
Nabla | Character Viewer (Ctrl+Cmd+Space) | Alt+8711 |
`∂` |
Partial | Option + d |
Alt+8706 |

On Mac, `∂`

can be typed directly with **Option + d**. For `∇`

, use the Character Viewer or copy-paste.

`:>`

Autograd Operator (Recommended)[¶](#autograd-operator-recommended)

The `:>`

operator uses PyTorch autograd for exact gradients:

```
f::{x^2}         :" Define f(x) = x^2
f:>3             :" Compute f'(3) = 6.0
```

The syntax is `function:>point`

where:
- `function`

is a scalar-valued function (must return a single number)
- `point`

is the input at which to compute the gradient

`∇`

Numeric Gradient Operator[¶](#numeric-gradient-operator)

The `∇`

operator **always** uses numeric differentiation (finite differences), regardless of backend:

```
f::{x^2}         :" Define f(x) = x^2
3∇f              :" Compute f'(3) ≈ 6.0
```

The syntax is `point∇function`

(note: reversed order from `:>`

).

### How They Work[¶](#how-they-work)

| Operator | Method | Precision | Speed |
|---|---|---|---|
`:>` with torch |
PyTorch autograd | Exact | Fast |
`:>` without torch |
Numeric | ~1e-6 error | Slower |
`∇` (any backend) |
Always numeric | ~1e-6 error | Slower |

With the torch backend (`--backend torch`

or `backend='torch'`

), prefer `:>`

for:
- Exact gradients (no floating-point approximation error)
- Complex computational graphs
- Better performance on large arrays

### Examples[¶](#examples)

**Scalar function:**

```
f::{x^3}          :" f(x) = x^3
f:>2              :" f'(2) = 3*4 = 12.0
```

**Polynomial:**

```
p::{((3*x^4)-(2*x^2))+x}   :" p(x) = 3x^4 - 2x^2 + x
p:>1                        :" p'(1) = 12 - 4 + 1 = 9.0
```

**Vector function (sum of squares):**

```
g::{+/x^2}             :" g(x) = sum(x_i^2)
g:>[1.0 2.0 3.0]       :" [2 4 6] = 2*x
```

**Gradient descent:**

```
f::{x^2}
x::5.0
lr::0.1

:" Update rule: x = x - lr * grad
x::x-(lr*f:>x)
```

### Multi-Parameter Gradients[¶](#multi-parameter-gradients)

Compute gradients for multiple parameters simultaneously using a list of symbols:

```
w::2.0
b::3.0
loss::{(w^2)+(b^2)}

:" Compute gradients for both w and b
grads::loss:>[w b]    :" [4.0 6.0] = [2w, 2b]
```

This is especially useful for neural network training:

```
w::1.0
b::0.0
X::[1 2 3]
Y::[3 5 7]

:" MSE loss
loss::{(+/((w*X)+b-Y)^2)%3}

:" Compute both gradients in one call
grads::loss:>[w b]
```

### Jacobian Computation[¶](#jacobian-computation)

Compute the Jacobian matrix (matrix of partial derivatives) using the `∂`

operator or `.jacobian()`

function:

```
f::{x^2}                 :" Element-wise square

:" Using ∂ operator (point∂function)
[1 2]∂f                  :" [[2 0] [0 4]] diagonal matrix

:" Using .jacobian() function
.jacobian(f;[1 2])       :" Same result
```

For vector-valued functions f: R^n -> R^m, the Jacobian is an m x n matrix where J[i,j] = df_i/dx_j.

### Multi-Parameter Jacobians[¶](#multi-parameter-jacobians)

Just like gradients, you can compute Jacobians with respect to multiple parameters using a list of symbols:

```
w::[1.0 2.0]
b::[3.0 4.0]
f::{w^2}                 :" Returns [w0^2, w1^2]

:" Compute Jacobians for both w and b
jacobians::[w b]∂f       :" Returns [J_w, J_b]
```

This returns a list of Jacobian matrices, one per parameter. Useful for analyzing how vector-valued functions depend on multiple parameter sets.

### Custom Optimizers[¶](#custom-optimizers)

KlongPy provides the gradient primitives (`:>`

, `∂`

, `.jacobian()`

). For optimizers, use the example classes in `examples/autograd/optimizers.py`

which you can copy to your project and customize.

**Manual gradient descent (no optimizer needed):**

```
w::10.0
loss::{w^2}
lr::0.1

:" Update rule: w = w - lr * gradient
{w::w-(lr*loss:>w)}'!50
w                        :" Close to 0
```

**Using a custom optimizer class:**

- Copy
`examples/autograd/optimizers.py`

to your project directory - Import with
`.pyf()`

:

```
:" Import the optimizer class
.pyf("optimizers";"SGDOptimizer")

:" Setup parameters and loss
w::10.0
loss::{w^2}

:" Create optimizer with learning rate 0.1
opt::SGDOptimizer(klong;["w"];:{["lr" 0.1]})

:" Run optimization steps
{opt(loss)}'!50
w                        :" Close to 0
```

**Available example optimizers:**
- `SGDOptimizer`

- Stochastic Gradient Descent with optional momentum
- `AdamOptimizer`

- Adam optimizer with adaptive learning rates

**SGD with momentum:**

```
.pyf("optimizers";"SGDOptimizer")
opt::SGDOptimizer(klong;["w"];:{["lr" 0.01 "momentum" 0.9]})
```

**Adam optimizer:**

```
.pyf("optimizers";"AdamOptimizer")
opt::AdamOptimizer(klong;["w" "b"];:{["lr" 0.001]})
```

**Training loop example:**

```
.pyf("optimizers";"AdamOptimizer")

w::1.0;b::0.0
X::[1 2 3];Y::[3 5 7]
loss::{(+/((w*X)+b-Y)^2)%3}
opt::AdamOptimizer(klong;["w" "b"];:{["lr" 0.1]})

:" Train for 500 steps
{opt(loss)}'!500
```

**Creating your own optimizer:**

The example optimizers use `multi_grad_of_fn`

from `klongpy.autograd`

to compute gradients for multiple parameters. Copy and modify the optimizer classes to implement custom update rules (RMSprop, AdaGrad, learning rate schedules, etc.).

## GPU Acceleration[¶](#gpu-acceleration)

When CUDA or Apple MPS is available, tensors automatically use GPU:

``` python
from klongpy import KlongInterpreter

klong = KlongInterpreter(backend='torch')
print(klong._backend.device)  # 'cuda:0', 'mps:0', or 'cpu'
```

### Device Selection[¶](#device-selection)

The backend automatically selects the best available device: 1. CUDA (NVIDIA GPU) - if available 2. MPS (Apple Silicon) - if available 3. CPU - fallback

### MPS Limitations[¶](#mps-limitations)

Apple's MPS backend has some limitations: - No float64 support (uses float32) - Some operations fall back to CPU

## Mixing with Python[¶](#mixing-with-python)

Access torch tensors directly:

``` python
from klongpy import KlongInterpreter

klong = KlongInterpreter(backend='torch')

# KlongPy operations return torch tensors
result = klong('2*1+!1000000')
print(type(result))  # <class 'torch.Tensor'>
print(result.device)  # cuda:0, mps:0, or cpu

# Convert to numpy when needed
import numpy as np
np_result = result.cpu().numpy()
```

## Best Practices[¶](#best-practices)

-
**Use torch for large computations**: Switch to torch backend for arrays >100K elements -
**Keep data as tensors**: Avoid unnecessary conversions between numpy and torch -
**Batch operations**: Combine operations to minimize dispatch overhead -
**Use autograd for gradients**: Native autograd is faster and more accurate than numeric differentiation

## Function Compilation[¶](#function-compilation)

The torch backend supports compiling Klong functions for optimized execution using `torch.compile`

:

`.compile(fn;input)`

- Compile Function[¶](#compilefninput-compile-function)

Compiles a function for faster execution:

```
f::{x^2}
cf::.compile(f;3.0)      :" Returns compiled function
cf(5.0)                   :" 25.0 (optimized)
```

The compiled function runs significantly faster for complex computations.

`.export(fn;input;path)`

- Export Computation Graph[¶](#exportfninputpath-export-computation-graph)

Exports the function's computation graph to a file for inspection:

```
f::{(x^3)+(2*x^2)+x}
info::.export(f;2.0;"model.pt2")
.p(info@"graph")         :" Print computation graph
```

Returns a dictionary with:
- `"compiled_fn"`

- The compiled function
- `"export_path"`

- Path where graph was saved
- `"graph"`

- String representation of computation graph

The exported `.pt2`

file can be loaded with `torch.export.load()`

in Python.

`.compilex(fn;input;options)`

- Extended Compilation[¶](#compilexfninputoptions-extended-compilation)

Compile with advanced options for mode and backend:

```
f::{x^2}

:" Fast compilation for development
cf::.compilex(f;3.0;:{["mode" "reduce-overhead"]})

:" Maximum optimization for production
cf::.compilex(f;3.0;:{["mode" "max-autotune"]})

:" Debug mode (no compilation)
cf::.compilex(f;3.0;:{["backend" "eager"]})
```

**Options dictionary:**
- `"mode"`

- Compilation mode (see table below)
- `"backend"`

- Compilation backend (see table below)
- `"fullgraph"`

- Set to 1 to require full graph compilation
- `"dynamic"`

- Set to 1 for dynamic shapes, 0 for static

`.cmodes()`

- Query Compilation Modes[¶](#cmodes-query-compilation-modes)

Get information about available modes and backends:

```
info::.cmodes()
.p(info@"modes")          :" Available compilation modes
.p(info@"backends")       :" Available backends
.p(info@"recommendations") :" Suggested settings
```

### Compilation Mode Comparison[¶](#compilation-mode-comparison)

| Mode | Compile Time | Runtime Speed | Best For |
|---|---|---|---|
`default` |
Medium | Good | General use |
`reduce-overhead` |
Fast | Moderate | Development/testing |
`max-autotune` |
Slow | Best | Production |

### Backend Comparison[¶](#backend-comparison_1)

| Backend | Description |
|---|---|
`inductor` |
Default - C++/Triton code generation (fastest) |
`eager` |
No compilation - runs original Python (debugging) |
`aot_eager` |
Ahead-of-time eager (debugging + autograd) |
`cudagraphs` |
CUDA graphs - reduces GPU kernel launch overhead |

**Note:** Compilation requires a C++ compiler on your system. Use `"backend" "eager"`

to bypass compilation for debugging. If compilation fails, an error message will indicate the issue.

## Gradient Verification[¶](#gradient-verification)

Use `.gradcheck()`

to verify that autograd gradients are correct:

`.gradcheck(fn;inputs)`

- Verify Gradients[¶](#gradcheckfninputs-verify-gradients)

Verifies autograd gradients against numeric gradients:

```
f::{x^2}
.gradcheck(f;3.0)        :" Returns 1 if correct

g::{+/x^2}
.gradcheck(g;[1.0 2.0 3.0])  :" Returns 1
```

This uses `torch.autograd.gradcheck`

internally for rigorous verification.

**Use cases:**
- Verifying custom gradient implementations
- Debugging gradient computation issues
- Ensuring numerical stability

## Troubleshooting[¶](#troubleshooting)

### "PyTorch backend does not support object dtype"[¶](#pytorch-backend-does-not-support-object-dtype)

The torch backend cannot handle mixed-type arrays or nested structures with varying shapes. Use the numpy backend for these cases.

### MPS float64 errors[¶](#mps-float64-errors)

MPS doesn't support float64. The backend automatically converts to float32, but some precision-sensitive operations may behave differently.

### Slow small array operations[¶](#slow-small-array-operations)

For arrays <10K elements, numpy may be faster. Consider using numpy backend for small array workloads or batching small operations together.

### torch.compile errors[¶](#torchcompile-errors)

If `.compile()`

fails with C++ errors, ensure you have:
- A C++ compiler installed (clang++ or g++)
- The required header files (may need Xcode Command Line Tools on macOS)