{"slug": "klongpy-pytorch-back-end-and-autograd", "title": "KlongPy: PyTorch Back End and Autograd", "summary": "KlongPy now supports a PyTorch backend that enables GPU acceleration and automatic differentiation for gradient-based computations. The torch backend outperforms NumPy by up to 8x on large arrays and provides exact gradients via the `:>` autograd operator, while the `∇` operator always uses numeric differentiation regardless of backend. Users can enable the PyTorch backend through the `--backend torch` flag at the command line or by setting `backend=\"torch\"` when creating a KlongInterpreter.", "body_md": "# PyTorch Backend and Autograd[¶](#pytorch-backend-and-autograd)\n\nKlongPy supports multiple array backends. The PyTorch backend enables GPU acceleration and automatic differentiation (autograd) for gradient-based computations.\n\n## Enabling the PyTorch Backend[¶](#enabling-the-pytorch-backend)\n\n### Command Line[¶](#command-line)\n\n```\n# Use --backend flag\nkgpy --backend torch\n\n# With GPU device selection\nkgpy --backend torch --device cuda\n```\n\n### Programmatically[¶](#programmatically)\n\n``` python\nfrom klongpy import KlongInterpreter\n\n# Create interpreter with torch backend\nklong = KlongInterpreter(backend=\"torch\")\nprint(klong._backend.name)  # 'torch'\n\n# With specific device\nklong = KlongInterpreter(backend=\"torch\", device=\"cuda\")\n```\n\n## Backend Comparison[¶](#backend-comparison)\n\n| Feature | NumPy Backend | PyTorch Backend |\n|---|---|---|\n| Default | Yes | No (use `--backend torch` ) |\n| Object dtype | Yes | No |\n| String operations | Yes | Not supported |\n| GPU acceleration | No | Yes (CUDA/MPS) |\n| Autograd | Numeric only | Native autograd |\n| Small array performance | Faster | Slightly slower |\n| Large array performance | Good | Better (especially on GPU) |\n\n## Performance[¶](#performance)\n\nThe torch backend excels with large arrays:\n\n```\nBenchmark              NumPy      Torch      Winner\n---------------------------------------------------------\nvector_add_100K        0.04ms     0.08ms     NumPy (2x)\nvector_add_1M          0.36ms     0.07ms     Torch (5x)\ncompound_expr_1M       0.61ms     0.07ms     Torch (8x)\ngrade_up_100K          0.59ms     0.19ms     Torch (3x)\n```\n\nFor small arrays (<100K elements), NumPy is slightly faster due to lower dispatch overhead. For larger arrays, torch wins significantly.\n\n## Automatic Differentiation[¶](#automatic-differentiation)\n\nKlongPy provides several gradient and differentiation operators:\n\n### Typing Special Characters[¶](#typing-special-characters)\n\n| Symbol | Name | Mac | Windows |\n|---|---|---|---|\n`∇` |\nNabla | Character Viewer (Ctrl+Cmd+Space) | Alt+8711 |\n`∂` |\nPartial | Option + d |\nAlt+8706 |\n\nOn Mac, `∂`\n\ncan be typed directly with **Option + d**. For `∇`\n\n, use the Character Viewer or copy-paste.\n\n`:>`\n\nAutograd Operator (Recommended)[¶](#autograd-operator-recommended)\n\nThe `:>`\n\noperator uses PyTorch autograd for exact gradients:\n\n```\nf::{x^2}         :\" Define f(x) = x^2\nf:>3             :\" Compute f'(3) = 6.0\n```\n\nThe syntax is `function:>point`\n\nwhere:\n- `function`\n\nis a scalar-valued function (must return a single number)\n- `point`\n\nis the input at which to compute the gradient\n\n`∇`\n\nNumeric Gradient Operator[¶](#numeric-gradient-operator)\n\nThe `∇`\n\noperator **always** uses numeric differentiation (finite differences), regardless of backend:\n\n```\nf::{x^2}         :\" Define f(x) = x^2\n3∇f              :\" Compute f'(3) ≈ 6.0\n```\n\nThe syntax is `point∇function`\n\n(note: reversed order from `:>`\n\n).\n\n### How They Work[¶](#how-they-work)\n\n| Operator | Method | Precision | Speed |\n|---|---|---|---|\n`:>` with torch |\nPyTorch autograd | Exact | Fast |\n`:>` without torch |\nNumeric | ~1e-6 error | Slower |\n`∇` (any backend) |\nAlways numeric | ~1e-6 error | Slower |\n\nWith the torch backend (`--backend torch`\n\nor `backend='torch'`\n\n), prefer `:>`\n\nfor:\n- Exact gradients (no floating-point approximation error)\n- Complex computational graphs\n- Better performance on large arrays\n\n### Examples[¶](#examples)\n\n**Scalar function:**\n\n```\nf::{x^3}          :\" f(x) = x^3\nf:>2              :\" f'(2) = 3*4 = 12.0\n```\n\n**Polynomial:**\n\n```\np::{((3*x^4)-(2*x^2))+x}   :\" p(x) = 3x^4 - 2x^2 + x\np:>1                        :\" p'(1) = 12 - 4 + 1 = 9.0\n```\n\n**Vector function (sum of squares):**\n\n```\ng::{+/x^2}             :\" g(x) = sum(x_i^2)\ng:>[1.0 2.0 3.0]       :\" [2 4 6] = 2*x\n```\n\n**Gradient descent:**\n\n```\nf::{x^2}\nx::5.0\nlr::0.1\n\n:\" Update rule: x = x - lr * grad\nx::x-(lr*f:>x)\n```\n\n### Multi-Parameter Gradients[¶](#multi-parameter-gradients)\n\nCompute gradients for multiple parameters simultaneously using a list of symbols:\n\n```\nw::2.0\nb::3.0\nloss::{(w^2)+(b^2)}\n\n:\" Compute gradients for both w and b\ngrads::loss:>[w b]    :\" [4.0 6.0] = [2w, 2b]\n```\n\nThis is especially useful for neural network training:\n\n```\nw::1.0\nb::0.0\nX::[1 2 3]\nY::[3 5 7]\n\n:\" MSE loss\nloss::{(+/((w*X)+b-Y)^2)%3}\n\n:\" Compute both gradients in one call\ngrads::loss:>[w b]\n```\n\n### Jacobian Computation[¶](#jacobian-computation)\n\nCompute the Jacobian matrix (matrix of partial derivatives) using the `∂`\n\noperator or `.jacobian()`\n\nfunction:\n\n```\nf::{x^2}                 :\" Element-wise square\n\n:\" Using ∂ operator (point∂function)\n[1 2]∂f                  :\" [[2 0] [0 4]] diagonal matrix\n\n:\" Using .jacobian() function\n.jacobian(f;[1 2])       :\" Same result\n```\n\nFor vector-valued functions f: R^n -> R^m, the Jacobian is an m x n matrix where J[i,j] = df_i/dx_j.\n\n### Multi-Parameter Jacobians[¶](#multi-parameter-jacobians)\n\nJust like gradients, you can compute Jacobians with respect to multiple parameters using a list of symbols:\n\n```\nw::[1.0 2.0]\nb::[3.0 4.0]\nf::{w^2}                 :\" Returns [w0^2, w1^2]\n\n:\" Compute Jacobians for both w and b\njacobians::[w b]∂f       :\" Returns [J_w, J_b]\n```\n\nThis returns a list of Jacobian matrices, one per parameter. Useful for analyzing how vector-valued functions depend on multiple parameter sets.\n\n### Custom Optimizers[¶](#custom-optimizers)\n\nKlongPy provides the gradient primitives (`:>`\n\n, `∂`\n\n, `.jacobian()`\n\n). For optimizers, use the example classes in `examples/autograd/optimizers.py`\n\nwhich you can copy to your project and customize.\n\n**Manual gradient descent (no optimizer needed):**\n\n```\nw::10.0\nloss::{w^2}\nlr::0.1\n\n:\" Update rule: w = w - lr * gradient\n{w::w-(lr*loss:>w)}'!50\nw                        :\" Close to 0\n```\n\n**Using a custom optimizer class:**\n\n- Copy\n`examples/autograd/optimizers.py`\n\nto your project directory - Import with\n`.pyf()`\n\n:\n\n```\n:\" Import the optimizer class\n.pyf(\"optimizers\";\"SGDOptimizer\")\n\n:\" Setup parameters and loss\nw::10.0\nloss::{w^2}\n\n:\" Create optimizer with learning rate 0.1\nopt::SGDOptimizer(klong;[\"w\"];:{[\"lr\" 0.1]})\n\n:\" Run optimization steps\n{opt(loss)}'!50\nw                        :\" Close to 0\n```\n\n**Available example optimizers:**\n- `SGDOptimizer`\n\n- Stochastic Gradient Descent with optional momentum\n- `AdamOptimizer`\n\n- Adam optimizer with adaptive learning rates\n\n**SGD with momentum:**\n\n```\n.pyf(\"optimizers\";\"SGDOptimizer\")\nopt::SGDOptimizer(klong;[\"w\"];:{[\"lr\" 0.01 \"momentum\" 0.9]})\n```\n\n**Adam optimizer:**\n\n```\n.pyf(\"optimizers\";\"AdamOptimizer\")\nopt::AdamOptimizer(klong;[\"w\" \"b\"];:{[\"lr\" 0.001]})\n```\n\n**Training loop example:**\n\n```\n.pyf(\"optimizers\";\"AdamOptimizer\")\n\nw::1.0;b::0.0\nX::[1 2 3];Y::[3 5 7]\nloss::{(+/((w*X)+b-Y)^2)%3}\nopt::AdamOptimizer(klong;[\"w\" \"b\"];:{[\"lr\" 0.1]})\n\n:\" Train for 500 steps\n{opt(loss)}'!500\n```\n\n**Creating your own optimizer:**\n\nThe example optimizers use `multi_grad_of_fn`\n\nfrom `klongpy.autograd`\n\nto compute gradients for multiple parameters. Copy and modify the optimizer classes to implement custom update rules (RMSprop, AdaGrad, learning rate schedules, etc.).\n\n## GPU Acceleration[¶](#gpu-acceleration)\n\nWhen CUDA or Apple MPS is available, tensors automatically use GPU:\n\n``` python\nfrom klongpy import KlongInterpreter\n\nklong = KlongInterpreter(backend='torch')\nprint(klong._backend.device)  # 'cuda:0', 'mps:0', or 'cpu'\n```\n\n### Device Selection[¶](#device-selection)\n\nThe backend automatically selects the best available device: 1. CUDA (NVIDIA GPU) - if available 2. MPS (Apple Silicon) - if available 3. CPU - fallback\n\n### MPS Limitations[¶](#mps-limitations)\n\nApple's MPS backend has some limitations: - No float64 support (uses float32) - Some operations fall back to CPU\n\n## Mixing with Python[¶](#mixing-with-python)\n\nAccess torch tensors directly:\n\n``` python\nfrom klongpy import KlongInterpreter\n\nklong = KlongInterpreter(backend='torch')\n\n# KlongPy operations return torch tensors\nresult = klong('2*1+!1000000')\nprint(type(result))  # <class 'torch.Tensor'>\nprint(result.device)  # cuda:0, mps:0, or cpu\n\n# Convert to numpy when needed\nimport numpy as np\nnp_result = result.cpu().numpy()\n```\n\n## Best Practices[¶](#best-practices)\n\n-\n**Use torch for large computations**: Switch to torch backend for arrays >100K elements -\n**Keep data as tensors**: Avoid unnecessary conversions between numpy and torch -\n**Batch operations**: Combine operations to minimize dispatch overhead -\n**Use autograd for gradients**: Native autograd is faster and more accurate than numeric differentiation\n\n## Function Compilation[¶](#function-compilation)\n\nThe torch backend supports compiling Klong functions for optimized execution using `torch.compile`\n\n:\n\n`.compile(fn;input)`\n\n- Compile Function[¶](#compilefninput-compile-function)\n\nCompiles a function for faster execution:\n\n```\nf::{x^2}\ncf::.compile(f;3.0)      :\" Returns compiled function\ncf(5.0)                   :\" 25.0 (optimized)\n```\n\nThe compiled function runs significantly faster for complex computations.\n\n`.export(fn;input;path)`\n\n- Export Computation Graph[¶](#exportfninputpath-export-computation-graph)\n\nExports the function's computation graph to a file for inspection:\n\n```\nf::{(x^3)+(2*x^2)+x}\ninfo::.export(f;2.0;\"model.pt2\")\n.p(info@\"graph\")         :\" Print computation graph\n```\n\nReturns a dictionary with:\n- `\"compiled_fn\"`\n\n- The compiled function\n- `\"export_path\"`\n\n- Path where graph was saved\n- `\"graph\"`\n\n- String representation of computation graph\n\nThe exported `.pt2`\n\nfile can be loaded with `torch.export.load()`\n\nin Python.\n\n`.compilex(fn;input;options)`\n\n- Extended Compilation[¶](#compilexfninputoptions-extended-compilation)\n\nCompile with advanced options for mode and backend:\n\n```\nf::{x^2}\n\n:\" Fast compilation for development\ncf::.compilex(f;3.0;:{[\"mode\" \"reduce-overhead\"]})\n\n:\" Maximum optimization for production\ncf::.compilex(f;3.0;:{[\"mode\" \"max-autotune\"]})\n\n:\" Debug mode (no compilation)\ncf::.compilex(f;3.0;:{[\"backend\" \"eager\"]})\n```\n\n**Options dictionary:**\n- `\"mode\"`\n\n- Compilation mode (see table below)\n- `\"backend\"`\n\n- Compilation backend (see table below)\n- `\"fullgraph\"`\n\n- Set to 1 to require full graph compilation\n- `\"dynamic\"`\n\n- Set to 1 for dynamic shapes, 0 for static\n\n`.cmodes()`\n\n- Query Compilation Modes[¶](#cmodes-query-compilation-modes)\n\nGet information about available modes and backends:\n\n```\ninfo::.cmodes()\n.p(info@\"modes\")          :\" Available compilation modes\n.p(info@\"backends\")       :\" Available backends\n.p(info@\"recommendations\") :\" Suggested settings\n```\n\n### Compilation Mode Comparison[¶](#compilation-mode-comparison)\n\n| Mode | Compile Time | Runtime Speed | Best For |\n|---|---|---|---|\n`default` |\nMedium | Good | General use |\n`reduce-overhead` |\nFast | Moderate | Development/testing |\n`max-autotune` |\nSlow | Best | Production |\n\n### Backend Comparison[¶](#backend-comparison_1)\n\n| Backend | Description |\n|---|---|\n`inductor` |\nDefault - C++/Triton code generation (fastest) |\n`eager` |\nNo compilation - runs original Python (debugging) |\n`aot_eager` |\nAhead-of-time eager (debugging + autograd) |\n`cudagraphs` |\nCUDA graphs - reduces GPU kernel launch overhead |\n\n**Note:** Compilation requires a C++ compiler on your system. Use `\"backend\" \"eager\"`\n\nto bypass compilation for debugging. If compilation fails, an error message will indicate the issue.\n\n## Gradient Verification[¶](#gradient-verification)\n\nUse `.gradcheck()`\n\nto verify that autograd gradients are correct:\n\n`.gradcheck(fn;inputs)`\n\n- Verify Gradients[¶](#gradcheckfninputs-verify-gradients)\n\nVerifies autograd gradients against numeric gradients:\n\n```\nf::{x^2}\n.gradcheck(f;3.0)        :\" Returns 1 if correct\n\ng::{+/x^2}\n.gradcheck(g;[1.0 2.0 3.0])  :\" Returns 1\n```\n\nThis uses `torch.autograd.gradcheck`\n\ninternally for rigorous verification.\n\n**Use cases:**\n- Verifying custom gradient implementations\n- Debugging gradient computation issues\n- Ensuring numerical stability\n\n## Troubleshooting[¶](#troubleshooting)\n\n### \"PyTorch backend does not support object dtype\"[¶](#pytorch-backend-does-not-support-object-dtype)\n\nThe torch backend cannot handle mixed-type arrays or nested structures with varying shapes. Use the numpy backend for these cases.\n\n### MPS float64 errors[¶](#mps-float64-errors)\n\nMPS doesn't support float64. The backend automatically converts to float32, but some precision-sensitive operations may behave differently.\n\n### Slow small array operations[¶](#slow-small-array-operations)\n\nFor arrays <10K elements, numpy may be faster. Consider using numpy backend for small array workloads or batching small operations together.\n\n### torch.compile errors[¶](#torchcompile-errors)\n\nIf `.compile()`\n\nfails with C++ errors, ensure you have:\n- A C++ compiler installed (clang++ or g++)\n- The required header files (may need Xcode Command Line Tools on macOS)", "url": "https://wpnews.pro/news/klongpy-pytorch-back-end-and-autograd", "canonical_source": "http://www.klongpy.org/torch_backend/", "published_at": "2026-05-26 12:51:00+00:00", "updated_at": "2026-05-26 13:10:04.022181+00:00", "lang": "en", "topics": ["machine-learning", "neural-networks", "ai-tools", "ai-infrastructure", "ai-chips"], "entities": ["KlongPy", "PyTorch", "NumPy", "CUDA", "MPS", "GPU"], "alternates": {"html": "https://wpnews.pro/news/klongpy-pytorch-back-end-and-autograd", "markdown": "https://wpnews.pro/news/klongpy-pytorch-back-end-and-autograd.md", "text": "https://wpnews.pro/news/klongpy-pytorch-back-end-and-autograd.txt", "jsonld": "https://wpnews.pro/news/klongpy-pytorch-back-end-and-autograd.jsonld"}}