{"slug": "the-fastest-python-struct", "title": "The Fastest Python Struct?", "summary": "JP Hutchins benchmarks Python struct definition speed, finding that metaprogramming approaches like decorators and metaclasses incur higher upfront runtime costs than manual type definitions. The analysis focuses on compile-time performance for CLI tools and build systems, where startup time is critical.", "body_md": "[The Fastest Python Struct?](https://www.crumpledpaper.tech/2026-06-21-python-struct-profiling)\n\n[JP Hutchins](/)\n\n# The Fastest Python Struct?\n\nAll posts written without LLM assistance unless otherwise noted.\n\nPython is *fast enough*. Python programmers tend to understand the [Python Cost Model](https://ocw.mit.edu/courses/6-006-introduction-to-algorithms-fall-2011/pages/readings/python-cost-model/), Python’s strengths and weaknesses, libraries that give compiled performance, and when to use a compiled language from the start.\n\nSo why do I care? Why do I get obsessed enough to coerce Claude into running these\nbenchmarks and writing these Plotly charts? I do not know.[1](#user-content-fn-1)\n\nBut! I do know *what* I care about (for now) - and today (and some of the past weekend, and perhaps some of the next one),\nit’s definitely the ** cost of defining (ideally immutable) record types (AKA structs) in Python**.\n\nSo let’s get this out of the way: this write up is about benchmarking “Python type speed” (informally: compile-time), it is NOT about benchmarking\n\n- serialization\n- instantiation\n- attribute access\n- validation\n- memory\n\nRight, so that’s what Python programmers often care about, because they are probably working on long running programs, like apps, servers or pipelines, where the cost of defining a **type** is paid upfront, one time, whereas the cost of allocation, instantiation, validation, and serialization is paid repeatedly. So yeah, if that’s what you care about, this post is not for you.\n\nBut I did include\n\n[instance cost]benchmarks if you’re curious. 😻\n\nIf you already know you care about type definition speed, then jump straight to the [analysis](#structs-under-test-suts), otherwise keep reading for my motivation and context on this subject.\n\n[#](#how-fast-to---help)”how fast to `--help`\n\n”\n\nI tend to work on CLIs for developers and tooling for build systems or test suites where the time from program start to end is what we’re measuring. Perhaps you’ve noticed that running a command from a CLI may be near instant in a compiled program, but in Python, it can easily be hundreds of milliseconds: perceptible for UX, noticeable in CI/CD, and amplified by repeated calls as part of build system tooling.\n\nUnlike in a compiled language, Python type definitions are not free (free in the sense that they were paid for during compilation *ahem, Rust*). They are code to be executed on every startup. And that includes imports of libraries and their type trees and dependents trees. We’ll see in the benchmarks that (evil-runtime-) metaprogramming, like decorators, metaclasses, or worse, have more of an upfront *runtime type generation* cost than manual type definitions.\n\nCan we get the best of everything: a Pythonic type definition style, complete static typing and [match](https://docs.python.org/3/reference/datamodel.html#object.__match_args__), with the speed of a hand-written C struct, and the startup time of a compiled extension? I *think* so. *seriously, I’m not sure, need to do more work, but I have good preliminary data*\n\nBut why not use a compiled language and framework like Rust + [clap](https://docs.rs/clap/latest/clap/)? I certainly do, but what can I say? I love the Python ecosystem, build tooling libraries, and the rapidly evolving type system. And I believe that the type system can continue evolving so that we can offload a lot of the correctness to the type checker, and reap runtime speed benefits. That’s what this post is about.\n\n[#](#ok-ok-whatever-but-why-structs)OK, OK, whatever, but why “structs”?\n\nI’ll confess that I am an advocate of functional programming (FP), with little compromise. But the tortured kind, that can’t be bothered to learn Haskell, or study Lisp, and seems to end up rewriting the same handful of patterns in every language. So, it’s not the *structs alone* that I am after. It’s the **sum types and pattern matching**.\n\nLong story short, I use sum types and pattern matching everywhere, all the time, from Rust to embedded C, from Typescript to Python, from JSON to CBOR. Even if your not an FP…enthusiast, you’ve likely used them in Python without thinking of them as such, when reaching for `MyType | None`\n\n(an `Option`\n\nor `Maybe`\n\ntype).\n\nThis example imagines that some immutable device info burned onto a ROM is versioned V1 and V2. V1 guaranteed presence of the serial number, but not the manufactured date. V2 guarantees both and adds a bootloader SHA.\n\n``` python\nfrom typing import NamedTuple\n\nclass DeviceInfoV1(NamedTuple):\n\tserial_number: str\n\tmanufactured_utc_ms: int | None\n\nclass DeviceInfoV2(NamedTuple):\n\tserial_number: str\n\tmanufactured_utc_ms: int\n\tbootloader_sha: int\n\ntype DeviceInfo = DeviceInfoV1 | DeviceInfoV2\n```\n\n`DeviceInfo`\n\nis a *sum type* of two *product types*,`DeviceInfoV1`\n\nand`DeviceInfoV2`\n\n, and there are only two representable states, each validated by the type system, not at runtime. Here’s what the naive product type would look like:\n\n```\nclass DeviceInfo(NamedTuple):\n\tserial_number: str\n\tmanufactured_utc_ms: int | None\n\tbootloader_sha: int | None\n```\n\nInvalid runtime states are now possible: `DeviceInfo(serial_number=\"abc\", manufactured_utc_ms=None, bootloader_sha=123)`\n\nis a valid instance of the naive product type, but it is not a valid `DeviceInfoV1`\n\nor `DeviceInfoV2`\n\n.\n\nUsing a product type instead of a sum type shifts the burden of correctness from the type system to the runtime.\n\n[#](#aw-f)Aw, f&*#!\n\nI promised myself I wouldn’t evangelize FP (Day 583). 💀\n\nIt’s not really about FP, that just happens to be *my* motivation. There are plenty of different ways to utilize Abstract Data Types (ADTs) in Python, and if you care about Python startup time, then I think you’ll enjoy these benchmark results.\n\nBesides, this *can’t* be about FP, because functional programmers don’t care about performance, memory, or know anything about compilers and instruction sets.\n\n“Functional programming, strictly defined, is dumb…the way you manage mutable state is by making an entire copy of the data structure with the changes in the new copy of the data structure…here’s the problem: computers, they’re all bags of mutable state.”\n\nChris Lattner,[Creator Of Swift On Functional Programming (YouTube)]\n\nOdd for the creator of LLVM, Clang, Swift, and Mojo to mischaracterize FP as anything other than an abstraction. I wasn’t aware of the “functional” instruction sets competing with x86 and ARM.\n\n[#](#wtf-are-we-testing-again)WTF are we testing again?\n\nI use `NamedTuple`\n\nall the time, mostly because it means I don’t have to add `@dataclass(frozen=true)`\n\neverywhere, but in the back of my mind I have always *believed* that `NamedTuple`\n\nmust be super efficient and compact, like `const struct`\n\nin C or `struct`\n\nin Rust. Once I realized that I’d been carrying on with this belief for years, I decided to setup this benchmark to understand how much I was truly paying for my types.\n\n[#](#the-contenders)THE CONTENDERS\n\n*author’s commentary italicized to avoid bias\n\n- manual python slotted class:\n**“Native Final Slots”*** ewwwwwwwwwwww* - manual python slotted class (Brett Cannon’s manual\n`record-type`\n\n):**“Manual Record Type”*** oh god that’s even worse, this IS a waste of time, we’re going to turn Python into Java or something* - from Python’s standard library,\n`typing`\n\nmodule:`NamedTuple`\n\n*pewwwwww pew pew pewwwwwwoooo* - also from the standard library,\n`dataclasses`\n\n:`dataclass(frozen=True)`\n\n*boooooooooo metaprogramming suuuuuuuuuuucks…unless it’s rust’s bs…or constexpr…at least it’s not C macros…but booooooooooooooooooo* - from legendary Python core developer Br Br Bre Brett Cannon, iiiiiiiit’s\n`record-type`\n\n*a new hope* - from a 20 minute Claude hallucination that rips off\n`msgspec`\n\nand`record-type`\n\n*WITH ATTRIBUTIONwait, I’m calling it!?!?!`record-type (C)`\n\n*hey, 20 minutes is not bad, it usually costs me $200 to get slop CPython!* - weighing in at 11 years of development, the original\n🎺`attrs`\n\n*medieval horns, but in tune*🎺 - fast AF and only ~14.3% vowels iiiiiiiiiiiiiiit’s\n`msgspec`\n\n*what is JSON for anymore?*\n\n[#](#structs-under-test-suts)Structs Under Test (SUTs)\n\n| Implementation | Description |\n|---|---|\n|\n\n`Final`\n\n-annotated fields, the closest thing to a naive native record.[manual record](#manual-record)`record-type`\n\n[NamedTuple](https://typing.python.org/en/latest/spec/namedtuples.html#named-tuples)`typing`\n\nmodule[dataclass](https://docs.python.org/3/library/dataclasses.html)`dataclasses`\n\nmodule[frozen dataclass](https://docs.python.org/3/library/dataclasses.html#frozen-instances)`@dataclass(frozen=true)`\n\n[record-type](https://pypi.org/project/record-type/)`record`\n\ntype for Python[record-type (C)](#record-type-c)`record-type`\n\nand `msgspec`\n\n[attrs](https://www.attrs.org/en/stable/)[msgspec](https://github.com/msgspec/msgspec)Each of these implementations will be evaluated with and without [mypyc](https://mypyc.readthedocs.io/) compilation, and as a cold start (no bytecode cache) and warm start (bytecode cache present), when relevant.\nAll of the implementations are tested on a struct type of three ints:\n\n```\nstruct StructUnderTest {\n\ta: int\n\tb: int\n\tc: int\n}\n```\n\nRefer to the [methodology](#methodology) section for details on how the benchmarks were run.\n\n[#](#module-cost)Module Cost\n\nWhen you import your base type or decorator, you also must pay a one time cost, regardless of how many types you define, for that module’s source tree. The cold import is roughly 6-8× a warm one, because the whole transitive source tree has to be recompiled to bytecode.\n\n[#](#type-cost)Type Cost\n\nSo, how much does it cost to *define* a type? Remember that this cost is paid once on every **program start**, or at least when it is first **imported**.\n\nMany of these benefit greatly from a warm start, which is the most common use of a Python program. Cold start is included because it’s the first impression that a user gets: *“how fast to --help?”*\n\nLooking just at the warm start, we can start to see 3 performance tiers:\n\n- ~7-12 µs:\n`native slots`\n\n,`record-type (C)`\n\n,`msgspec`\n\n, and`manual record`\n\n- ~76-96 µs (~8× slower):\n`NamedTuple`\n\n,`record-type`\n\n- ~200-370 µs (~20-30× slower):\n`dataclass`\n\n,`dataclass(frozen=True)`\n\n,`attrs`\n\nThe tiers come down to [how many methods each implementation has to generate when the\ntype is defined](#why-three-type-cost-tiers).\n\nUse the table below to sort relative performance.\n\n| implementation | |||\n|---|---|---|---|\n| 0.10× | 0.60× | 0.11× | |\n| 0.11× | 0.35× | — | |\n| 0.13× | 0.42× | — | |\n| 0.15× | 2.1× | 0.18× | |\n| 1.00× | 1.00× | 1.00× | |\n| 1.3× | 1.2× | — | |\n| 3.0× | 2.5× | 3.0× | |\n| 4.0× | 3.2× | — | |\n| 4.9× | 3.8× | 5.2× |\n\nPer-type cost — each cell is × the baseline row (NamedTuple by default; click any row to re-base). Lower is faster to define. Click a column header to sort.\n\n[#](#so-whats-the-fastest-startup)So what’s the fastest startup?\n\nTotal startup () is calculated as the fixed dependency import (the [module cost](#module-cost), ) plus the number of types () × the per-type definition cost ().\n\nThe interactive chart below shows the startup time on the Y axis and the number of types defined on the X axis. The scales can be toggled together between log (Y log10, X log2 from 1 to 4,096 types) and linear (Y clipped to 0ms–1,000ms, X from 0 to 4,096 types). For each implementation, the solid line is the warm time, and the dotted line is the cold time. Click on a name in the legend to toggle, double click to isolate, and double click on a disabled name to reset.\n\n[#](#conclusion)Conclusion\n\nFor my purposes, I can draw a few conclusions from this.\n\n`NamedTuple`\n\n(my goto) is sorta in the middle and is probably not dragging start times too much. But, it’s per-type cost is ~8× the native/C implementations, so as the program grows, it will start to add up.`msgspec`\n\nis faster than`NamedTuple`\n\nabove ~256 (warm) type definitions. But this assumes absolute**dependency discipline** that negates some of the upsides of Python’s ecosystem. If you import`msgspec`\n\n, or`dataclass`\n\n, anywhere, or if any of your dependencies have a high module or type cost, then`NamedTuple`\n\n’s low module cost is dwarfed and you may as well have started with a cheaper struct implementation.- The decorator-based implementations (\n`dataclass`\n\n,`record-type`\n\n, and`attrs`\n\n) all have a high type cost, but with that comes (evil-runtime-) metaprogamming capabilities. - The C implementation of record-type is good enough (wins by every metric) that I’ll be rewriting it and getting it under a test suite.\n- I will update this article once I have a tested implementation!*It may be too good to be true* - I will definitely be trying out\n`msgspec`\n\nin the future. I wasn’t familiar with it before working on this report, but it’s very exciting to see these numbers, not to mention that it has de/serialization on top of being a basic struct. I’d love to see[CDDL/CBOR 🔥](https://datatracker.ietf.org/doc/html/rfc8610)and[postcard ✉️](https://github.com/jamesmunns/postcard)de/serializers!\n\n[#](#appendix)Appendix\n\nHere lives more stuff that wasn’t directly relevant to my goal of assessing startup time, but is still fun.\n\n[#](#instance-cost)Instance Cost\n\nWhat can I say, since the benchmark suite was setup, I couldn’t resist. The instance costs are relevant to the program speed once it’s begun, and you’ll see that they are quite a bit tighter than the module and type cost comparisons. There’s a total spread of under 4x, from ~60ns up to ~220ns per instance.\n\n[#](#construction)Construction\n\n| implementation | ||\n|---|---|---|\n| 0.44× | — | |\n| 0.45× | — | |\n| 0.63× | 0.53× | |\n| 0.63× | 0.77× | |\n| 1.00× | 1.00× | |\n| 1.5× | — | |\n| 1.6× | 0.55× | |\n| 1.6× | 1.6× | |\n| 1.6× | — |\n\nPer-instance construction cost — each cell is × the baseline row (NamedTuple by default; click any row to re-base). Lower is faster. Click a column header to sort.\n\n[#](#memory)Memory\n\nMemory is driven by object layout. Freezing a type never changes its footprint —\n`frozen=True`\n\nonly changes the write path, not the storage. mypyc trades a few\nbytes per instance (one pointer to its method table, akin to a C++ vtable) for\nspeed, 2 and gives every compiled class a fixed layout even without\n\n`__slots__`\n\n.\n\n[3](#user-content-fn-3)[#](#the-cost-of-immutability)The cost of immutability\n\nImmutability sometimes costs time or space and is never more efficient.\n\n[#](#native-slots)`native slots`\n\nA plain slotted class with `Final`\n\nfields.\n\n``` python\nfrom typing import Final\n\nclass NativeFinal:\n\t__slots__ = (\"a\", \"b\", \"c\")\n\n\tdef __init__(self, a: int, b: int, c: int) -> None:\n\t\tself.a: Final = a\n\t\tself.b: Final = b\n\t\tself.c: Final = c\n```\n\nThe `Final`\n\nis for the static checker, meaning that it has zero runtime cost.[4](#user-content-fn-9)`mypy`\n\nrejects `o.a = 99`\n\n, but the assignment succeeds anyway, on the\ninterpreted class *and* the compiled `.so`\n\n. So this is the closest thing to a\nnative record mypyc can produce — a compact slotted object (64 bytes; 72\ncompiled) whose `__init__`\n\nit lowers to C-level slot stores, but it is *not*\nactually immutable at runtime (zero cost abstraction).\n\n[#](#manual-record)`manual record`\n\n`native slots`\n\nis cheap precisely because it\ndoes *less*. It has no `__eq__`\n\n, `__hash__`\n\n, or `__repr__`\n\n, and — as we saw — it\nisn’t even immutable. Every other record here gives you all of that. So here is\n[Brett Cannon’s record-type pattern](https://github.com/brettcannon/record-type): a complete, genuinely-immutable hand-written record with\n`__slots__`\n\n, `__match_args__`\n\n, a real `__setattr__`\n\nguard, and\n`__eq__`\n\n/`__hash__`\n\n/`__repr__`\n\n:\n\n```\nclass ManualRecord:\n\t__slots__ = (\"a\", \"b\", \"c\")\n\t__match_args__ = (\"a\", \"b\", \"c\")\n\n\tdef __init__(self, a: int, b: int, c: int) -> None:\n\t\tobject.__setattr__(self, \"a\", a)\n\t\tobject.__setattr__(self, \"b\", b)\n\t\tobject.__setattr__(self, \"c\", c)\n\n\tdef __setattr__(self, _attr, _val):\n\t\traise TypeError(\"immutable\")\n\n\tdef __eq__(self, other):\n\t\tif not isinstance(other, type(self)):\n\t\t\treturn NotImplemented\n\t\treturn self.a == other.a and self.b == other.b and self.c == other.c\n\n\tdef __hash__(self):\n\t\treturn hash((self.a, self.b, self.c))\n```\n\n[#](#record-type-c)record-type (C)\n\nThe manual record marks the pure-Python performance ceiling: complete and immutable, with\nnear-zero import, but either slow to construct (222 ns) or — once mypyc lowers its\n`object.__setattr__`\n\ninit — fast (78 ns) yet *larger* (96 bytes). `msgspec.Struct`\n\nshows C clears that ceiling: compact (64 bytes), immutable, ~62 ns construction,\n~10 µs/type. Its one catch is the module cost. `import msgspec`\n\nruns\n**~19 ms**, because it’s a serialization library and you can’t get just the struct without importing the whole kitchen sink.[5](#user-content-fn-8)\n\nCan you get msgspec’s record qualities *without* its import tax? A **research\nprototype** (*read: LLM slop*) on a [branch of Brett Cannon’s record-type](https://github.com/JPHutchins/record-type/pull/1)\nanswers yes. It’s a ~600-(slop)-line C extension: an inheritable\n\n`Record`\n\nbase you\nsubclass (subtype) exactly like `NamedTuple`\n\n:\n\n``` python\nfrom native_record import Record\n\nclass Point(Record):\n\ta: int\n\tb: int\n\tc: int\n```\n\nA C *metaclass* reads the class-body annotations directly (no `inspect`\n\n, no\n`exec`\n\n) and builds a frozen, slotted type whose constructor is a C-level\nvectorcall, borrowing msgspec’s type-creation trick, with none of its codec machinery.\nAnd you saw in the charts above that it wins in every category.\n\n[#](#buuuuuuuuuuuuut)buuuuuuuuuuuuut…\n\nIt’s a **research prototype**, not a release. It lives on a\nPR branch, not PyPI. And there’s one real semantic limit: a class body can’t\nexpress Python’s full parameter grammar (positional-only, keyword-only, `*args`\n\n,\n`**kwargs`\n\n) the way `@record`\n\n’s function signature can — fine for the\nrecord-shaped common case, but not literally 1:1 with the decorator. (Per-type\nhere is measured exactly like every other construct — module self-time ÷ K, which\nincludes the ~7 µs the bare `class`\n\nstatement costs regardless — so it is directly\ncomparable to the figures above.)\n\n[#](#why-three-type-cost-tiers)Why three type-cost tiers?\n\n- fastest:\n`native slots`\n\n,`record-type (C)`\n\n,`msgspec`\n\n, and`manual record`\n\n- ~8× slower:\n`NamedTuple`\n\n,`record-type`\n\n- ~20-30× slower:\n`dataclass`\n\n,`dataclass(frozen=True)`\n\n,`attrs`\n\nThe single best predictor turned out to\nbe how many methods each construct has to **generate at class-creation**: zero,\none, or several. (Trace it yourself with\n[ codegen_probe.py](https://github.com/JPHutchins/python-struct-profiling/blob/d8acfd5f63824b87b24e820e6f6859e0194da4c6/codegen_probe.py),\nwhich captures every\n\n`exec`\n\n/ `eval`\n\n/ `compile`\n\na single definition triggers.)[#](#tier-1--nothing-generated)Tier 1 — nothing generated.\n\n`native slots`\n\nand `manual record`\n\nare\nhand-written, so their methods compile *once* into the `.pyc`\n\nand the `class`\n\nstatement only has to build the type. `msgspec`\n\nand `record-type (C)`\n\ngenerate no\n*Python* either. A C metaclass assembles the type directly.\n\n[#](#tier-2--one-generated-method)Tier 2 — one generated method.\n\n[ collections.namedtuple](https://github.com/python/cpython/blob/v3.14.0/Lib/collections/__init__.py#L361)\nbuilds a\n\n`tuple`\n\nsubclass — a descriptor per field and a single `eval`\n\n’d `__new__`\n\n:\n\n```\nlambda _cls, a, b, c: _tuple_new(_cls, (a, b, c))\n```\n\nwith [ typing.NamedTuple](https://github.com/python/cpython/blob/v3.14.0/Lib/typing.py#L3027)\nadding\n\n[PEP 649](https://peps.python.org/pep-0649/)annotation handling on top.\n\n`record-type`\n\n’s [takes the other road —](https://github.com/brettcannon/record-type/blob/2023.1/records.py#L86)\n\n`@record`\n\n`inspect.signature`\n\nto read the fields, then one `exec`\n\n’d\nclass whose only generated logic is the `__init__`\n\n(`__eq__`\n\n/ `__hash__`\n\n/\n`__repr__`\n\ncome from a `Record`\n\nbase):\n\n``` python\nclass C(Record):\n\t__slots__ = ('a', 'b', 'c')\n\n\tdef __init__(self, /, a, b, c) -> None:\n\t\tobject.__setattr__(self, 'a', a)\n\t\tobject.__setattr__(self, 'b', b)\n\t\tobject.__setattr__(self, 'c', c)\n```\n\nA metaclass-plus-factory and a decorator-plus-`inspect`\n\n: different machinery, the\nsame one-method’s-worth of work, the same tier.\n\n[#](#tier-3--several-generated-methods-plus-field-work-and-a-rebuild)Tier 3 — several generated methods, plus field work and a rebuild.\n\n[ dataclass](https://github.com/python/cpython/blob/v3.14.0/Lib/dataclasses.py#L934)\nturns the annotations into\n\n`Field`\n\nobjects and generates `__init__`\n\n, `__repr__`\n\n,\nand `__eq__`\n\nin one shot (a factory that returns the three):\n\n``` python\ndef __create_fn__(\n\t__dataclass_type_a__,\n\t__dataclass_type_b__,\n\t__dataclass_type_c__,\n\t__dataclass_HAS_DEFAULT_FACTORY__,\n\t__dataclass_builtins_object__,\n\t__dataclass___init___return_type__,\n\t__dataclasses_recursive_repr\n):\n\tdef __init__(\n\t\tself,\n\t\ta:__dataclass_type_a__,\n\t\tb:__dataclass_type_b__,\n\t\tc:__dataclass_type_c__\n\t) -> __dataclass___init___return_type__:\n\t\tself.a=a\n\t\tself.b=b\n\t\tself.c=c\n\t@__dataclasses_recursive_repr()\n\tdef __repr__(self):\n\t\treturn f\"{self.__class__.__qualname__}(a={self.a!r}, b={self.b!r}, c={self.c!r})\"\n\tdef __eq__(self,other):\n\t\tif self is other:\n\t\t\treturn True\n\t\tif other.__class__ is self.__class__:\n\t\t\treturn self.a==other.a and self.b==other.b and self.c==other.c\n\t\treturn NotImplemented\n\treturn (__init__,__repr__,__eq__,)\n```\n\n`frozen=True`\n\nadds three more: `__setattr__`\n\n, `__delattr__`\n\n, `__hash__`\n\n— and\n`slots=True`\n\n[creates the class a second time](https://github.com/python/cpython/blob/v3.14.0/Lib/dataclasses.py#L1277),\nsince slots can’t be added in place.\n[ attrs](https://github.com/python-attrs/attrs/blob/26.1.0/src/attr/_make.py#L796)\nis a more layered version of the same idea.\n\n[#](#namedtuple-in-mypyc)NamedTuple in mypyc\n\nI was really hoping that mypyc was going to compile NamedTuple to a native struct. [Compiling the module](#interpreted-vs-compiled) changes *almost nothing* about the `NamedTuple`\n\n, while it transforms `native slots`\n\n:\n\n| metric | NamedTuple interpreted | NamedTuple mypyc | native slots interpreted | native slots mypyc |\n|---|---|---|---|---|\n`isinstance(_, tuple)` | yes | yes | no | no |\n| bytes / instance | 88 | 88 | 64 | 72 |\n`__new__` (type) instructions | 7 bytecodes | 7 bytecodes | C | C |\n`__init__` (instance) instructions | C | C | 9 bytecodes | C |\n| instance (ns) | 138 | 142 | 87.5 | 75.7 |\n\nThe NamedTuple columns are identical: same footprint, same construct time. Its `__new__`\n\nis still seven interpreted bytecodes inside the compiled extension module, building a tuple and handing it to `tuple.__new__`\n\n:\n\n```\n1 RESUME                            0\n  LOAD_GLOBAL                       1 (_tuple_new + NULL)\n  LOAD_FAST_BORROW_LOAD_FAST_BORROW 1 (_cls, a)\n  LOAD_FAST_BORROW_LOAD_FAST_BORROW 35 (b, c)\n  BUILD_TUPLE                       3\n  CALL                              2\n  RETURN_VALUE\n```\n\nContrast the native record. It has no `__new__`\n\nat all; its `__init__`\n\nwrites\nthe three fields straight into their slots with `STORE_ATTR`\n\n— no tuple, no\nlength field, no boxed item array. (The `Final`\n\nannotations add *zero* bytecode;\nthey’re a pure type-checker hint, so this is byte-for-byte a plain slotted\nclass.)\n\n```\n11 RESUME                   0\n12 LOAD_FAST_BORROW_LOAD_FAST_BORROW 16 (a, self)\n   STORE_ATTR               0 (a)\n13 LOAD_FAST_BORROW_LOAD_FAST_BORROW 32 (b, self)\n   STORE_ATTR               1 (b)\n14 LOAD_FAST_BORROW_LOAD_FAST_BORROW 48 (c, self)\n   STORE_ATTR               2 (c)\n   LOAD_CONST               0 (None)\n   RETURN_VALUE\n```\n\nmypyc *does* lower this `__init__`\n\nto C — recall its 9 bytecodes became C-level in\nthe compiled column. But for *this* record you barely see it in the construction\nnumbers (87 → 76 ns, within run-to-run noise): the `__init__`\n\nis only three `STORE_ATTR`\n\ns, and the interpreted `timeit`\n\nloop crosses the\ninterpreter↔native boundary on every call, which caps any gain. Where compiling a\nhand-written `__init__`\n\n*does* pay off is when it does real interpreted work —\n[manual record](#manual-record) routes every\nfield through `object.__setattr__`\n\nand drops from 222 to 78 ns once compiled, a\nspeedup a frozen `dataclass`\n\ncan’t get. NamedTuple’s `__new__`\n\n, by contrast, stays\ninterpreted even when compiled and there’s nothing for mypyc to lower at all without\nbreaking the tuple contract.\n\nSo, I’ve been right to reach for `NamedTuple`\n\nas a cheaper immutable type than `dataclass(frozen=True)`\n\n, but I was wrong to think that it was *perfectly* efficient and compact like a C struct.\n\n[#](#further-reading)Further reading\n\n**A first-class record type for Python.** Brett Cannon’s[record-type proposal](https://discuss.python.org/t/introducing-record-types-in-python/34397)(and a terser`struct Point(x: int, y: int)`\n\nspelling), with the proof-of-concept`record`\n\ndecorator already on PyPI. As proposed it standardizes the boilerplate — a concise frozen, slotted dataclass — rather than adding a performance primitive: a decorator’s generated`__init__`\n\nstays interpreted, so it can’t push past the pure-Python floor the[manual record](#manual-record)maps out.**Unboxed value types in mypyc.**[mypyc#841](https://github.com/mypyc/mypyc/issues/841)tracks the performance angle these benchmarks can’t reach: user-defined*unboxed*value types (≈16 bytes vs 40 for a heap object), passed around in native code and boxed only when they enter a Python container. mypyc already does this for native integers (`i64`\n\n/`i32`\n\n) — just not yet for user-defined records. Open since 2021 with no implementation: a direction, not a date, and nothing to benchmark yet.\n\n[#](#methodology)Methodology\n\nAll measurements were taken on a single machine: CPython 3.14.0 (installed and\nmanaged with [uv](https://docs.astral.sh/uv/)), mypy/mypyc 2.1.0, attrs 26.1.0,\nmsgspec 0.21.1, and record-type 2023.1.post1, on x86_64 Linux (WSL2) with gcc\n13.3. The C-backed `record-type (C)`\n\nis built from the\n[branch linked above](https://github.com/JPHutchins/record-type/pull/1) (a\nresearch prototype, not a release). Absolute numbers will differ on your hardware\nand Python build; the *relative* shape is the takeaway. Every struct carries the\nsame three `int`\n\nfields.\n\n[#](#interpreted-vs-compiled)Interpreted vs compiled\n\nThe standard-library constructs (plain classes, slotted, `Final`\n\n-slotted,\n`NamedTuple`\n\n, and the dataclass variants) live in one module that is the unit of\ncompilation: `mypyc containers.py`\n\nproduces a `containers.*.so`\n\n. An interpreted\ndriver imports that module and detects which form it got by testing whether\n`__file__`\n\nends in `.so`\n\n. This mirrors how mypyc is actually used — you compile\nthe *definitions* and call into them from ordinary interpreted code. The `attrs`\n\n,\n`msgspec`\n\n, and both `record-type`\n\nclasses are defined in the driver itself, not in\nthe compiled module, so there is no mypyc form to measure — the charts and tables\nleave their mypyc column empty rather than copy in the interpreted value.\n(`record-type (C)`\n\nis already a compiled C extension, so mypyc has nothing to\nadd — it *is* the native form.)\n\nEven inside the compiled `.so`\n\n, the `@dataclass`\n\ndecorator and the `NamedTuple`\n\nmetaclass run as interpreted CPython, and the `__init__`\n\n/ `__new__`\n\nthey generate\nstay interpreted bytecode: mypyc compiles the module’s own code, not the code those\ntools synthesize at runtime.\n\n[#](#memory-footprint)Memory footprint\n\n`sys.getsizeof`\n\nreports one object’s size but doesn’t follow the `__dict__`\n\npointer, so it understates classes that carry one. 6 The headline figures instead\ncome from a bulk\n\n`tracemalloc`\n\nmeasurement — allocate 200,000 instances and\nsubtract a same-length `[None] * n`\n\nlist measured the same way, so the list’s own\nbacking storage cancels and what remains is the instances’ allocation (GC header\nincluded):\n\n``` python\nimport gc, tracemalloc\n\ndef mem_per_instance(ctor, args, n=200_000):\n\tgc.collect()\n\ttracemalloc.start()\n\tbase = [None] * n\n\tbase_cur, _ = tracemalloc.get_traced_memory()\n\tobjs = [ctor(*args) for _ in range(n)]\n\tcur, _ = tracemalloc.get_traced_memory()\n\ttracemalloc.stop()\n\treturn (cur - base_cur) / n\n```\n\nTreat the per-instance figure as ±one allocator alignment word.\n\n[#](#bytecode)Bytecode\n\nAllocation bytecode is counted with `dis.get_instructions`\n\non `__new__`\n\nand\n`__init__`\n\n(unwrapping the `staticmethod`\n\nthat wraps a NamedTuple’s `__new__`\n\n),\nand disassembled with `dis.dis`\n\nfor the listings above. Deallocation has no\nPython bytecode to count: teardown is C-level `tp_dealloc`\n\n/ `tp_free`\n\nunless a\nclass defines a Python `__del__`\n\n, which none of these do.[7](#user-content-fn-5)\n\n[#](#per-instance-timing)Per-instance timing\n\nConstruction and attribute access are timed with [ timeit](https://docs.python.org/3/library/timeit.html) — the minimum of seven\nrepeats of 1,000,000 iterations for construction, 5,000,000 for access, reported\nas nanoseconds per operation.\n\nThe\n\n[8](#user-content-fn-6)`timeit`\n\nloop is interpreted, so every iteration\ncrosses the interpreter↔native boundary. mypyc’s attribute-access and call\nspeedups land on the *compiled→compiled*path, so an interpreted loop reaching into a compiled class won’t see them (and can read slightly slower) — which is why the compiled instantiation numbers sit on top of the interpreted ones rather than below.\n\n[#](#import--type-construction-time)Import / type-construction time\n\nThe obvious approach — `timeit`\n\non `make_dataclass()`\n\nor `namedtuple()`\n\n—\nmeasures the wrong thing. The dynamic factory forms differ from the `@dataclass`\n\nand `class C(NamedTuple)`\n\nforms you actually write (the functional `NamedTuple(...)`\n\ncall understates the class-statement form by roughly 3×), and `timeit`\n\nis blind to\nboth mypyc and the one-time cost of importing the supporting library, since those\nhappen before the loop starts.\n\nSo every import number comes from a **fresh interpreter** under\n`python -X importtime`\n\n, reading the *self* time attributed to the module — self\ntime excludes child imports, so the supporting library isn’t double-counted:\n\n**Per-type cost.** Generate a module of K = 200 identical-shape classes in the real class-statement form, import it, and read its self-time; the per-type figure is that self-time ÷ 200, the median of five fresh interpreters (this is what the committed`importtime_sweep.py`\n\nreports). Dividing by K folds a small fixed per-module overhead into each figure. What that per-type cost*consists*of — the methods each construct generates at class-creation — is dissected in[Why three type-cost tiers](#why-three-type-cost-tiers).**Cold vs warm.**“Warm” imports with the`__pycache__/*.pyc`\n\nalready written; “cold” deletes`__pycache__`\n\nfirst, so the source is recompiled to bytecode in-process. Their difference is the source→bytecode compile cost (tens of µs/type — ~25–55 here, scaling with each class’s source size).**Dependency import.**`python -X importtime -c \"import LIB\"`\n\nin a fresh interpreter gives the cumulative cost of first-importing a library. The cold variant points`PYTHONPYCACHEPREFIX`\n\nat an empty directory so the whole transitive source tree must recompile.**mypyc axis.** The generated module is compiled with`mypyc`\n\nand the resulting`.so`\n\nimported under the same harness. A compiled extension has no Python source to recompile, so there’s no cold/warm gap — yet its per-type creation cost is barely lower than interpreted, likely because type creation is dominated by CPython’s`PyType_Ready`\n\n, which runs either way.\n\n[#](#the-crossover-model)The crossover model\n\nThe startup chart is a model, not a direct measurement: total startup is taken as\na fixed dependency import plus N × the measured per-type construction cost,\nevaluated for cold and warm. The crossover is where two such lines meet —\n`N = (dep_b - dep_a) / (per_type_a - per_type_b)`\n\n. It assumes a single dependency\nimported once and a linear per-type cost (both hold well here); the cold curves\nroll up shared sub-dependencies, so several of these libraries imported together\ncost less than the sum of their individual lines.\n\n[#](#reporting)Reporting\n\nBytes and counts are integers; timing data is quoted to three significant figures.\nImport timings vary run to run, so each is reported as the median of five fresh\nprocesses; instantiation is the *minimum* of seven `timeit`\n\nrepeats (the\nconventional low-noise estimator). Treat the per-instance nanosecond figures as\n±10% — the construct-to-construct *shape* is what’s robust, not the third digit.\n\n[#](#limitations-and-cross-validation)Limitations and cross-validation\n\n**One machine, no isolation.** Everything ran on a single WSL2 host — which sits on Hyper-V, as does the Windows install beside it, so there’s no bare-metal baseline on this box (and no WSL2-specific virtualization penalty to factor out either)— with no CPU pinning or frequency-scaling control. Repeating on separate hardware, several Python versions, and a second OS would confirm the shape; pinning the CPU steadies the absolute numbers.[9](#user-content-fn-7)**Compiled construction is timed from an interpreted loop.** That measures the common interpreted-caller-into-compiled-class case, not compiled→compiled throughput. A benchmark loop itself compiled with mypyc would show whether its call and attribute speedups close the gap.**“Cold” is a cold** The source stays in the OS page cache between runs, so the cold figures isolate source→bytecode compilation, not first-read I/O.*bytecode*cache, not a cold disk.**Per-type cost is self-time ÷ K.** That folds a small fixed per-module overhead into each figure; a regression over several values of K would separate the fixed cost from the per-type slope (the correction is sub-microsecond for the cheap constructs).There is no struct-only import to isolate — the codec comes with it — so it’s a fair number to report but not a pure struct-definition cost.`msgspec`\n\n’s ~19 ms import is library-wide.[5](#user-content-fn-8)Its numbers may shift once it’s hardened and packaged.`record-type (C)`\n\nis a research prototype.**Five runs is modest.** More repeats, and reporting dispersion alongside the median, would tighten the import figures.\n\n[#](#reproducing)Reproducing\n\nEverything here is reproducible from the\n[ python-struct-profiling](https://github.com/JPHutchins/python-struct-profiling)\nrepository — the data in this post was produced at commit\n\n[. Two committed harnesses produce every number, and a third dissects the type-definition mechanism — all on the same machine, all carrying the identical three-](https://github.com/JPHutchins/python-struct-profiling/tree/b2f2eb7da90762e51957189f130d10f22d2eb77a)\n\n`b2f2eb7`\n\n`int`\n\n-field shape:`bench.py`\n\n— memory (`tracemalloc`\n\n), bytecode (`dis`\n\n), and instantiation (`timeit`\n\n), run once against the interpreted module and once against the`mypyc`\n\n-compiled`containers.so`\n\n.`importtime_sweep.py`\n\n— the import / type-creation axis: it generates a module of K real class-statement / decorator forms per construct, imports it under`python -X importtime`\n\nin a fresh interpreter, and divides the module self-time by K. The figures here are`--k 200 --runs 5`\n\n.`codegen_probe.py`\n\n(added at) — the mechanism behind the`d8acfd5`\n\n[three type-cost tiers](#why-three-type-cost-tiers): it traces the`exec`\n\n/`eval`\n\n/`compile`\n\neach construct runs at class-creation and counts how many methods each one generates (zero, one, or several).\n\n[#](#raw-data)Raw data\n\nEvery figure above is derived from this one table set (the charts and these tables read the same array, so they cannot disagree):\n\n**Table 1 — Import / type-creation cost**, µs per class (median of 5 fresh\n`-X importtime`\n\nruns, K = 200). *mypyc* is the compiled `.so`\n\n;\n“—” means the construct is off the compiled axis (attrs, msgspec, and both record-types are\ndefined outside the compiled module; `record-type (C)`\n\nis already a C extension).\n\n| construct | variant | warm | cold | mypyc |\n|---|---|---|---|---|\n| native slots | mutable | 7.3 | 59.3 | 6.9 |\n| native slots | frozen | 7.4 | 62.2 | 6.9 |\n| manual record | frozen | 11.5 | 214.5 | 11.1 |\n| NamedTuple | frozen | 76.2 | 104.3 | 63.3 |\n| dataclass | mutable | 228.4 | 261.0 | 190.3 |\n| dataclass | frozen | 373.4 | 401.2 | 328.5 |\n| record-type | frozen | 96.4 | 122.4 | — |\n| record-type (C) | frozen | 8.6 | 36.0 | — |\n| attrs | mutable | 264.6 | 288.7 | — |\n| attrs | frozen | 301.4 | 332.2 | — |\n| msgspec | mutable | 10.5 | 40.1 | — |\n| msgspec | frozen | 10.2 | 44.0 | — |\n\n**Table 2 — One-time dependency import**, milliseconds cumulative in a fresh\ninterpreter. Paid once per process regardless of how many types you define. The native\nrecord imports no library.\n\n| library | warm | cold |\n|---|---|---|\n| native (none) | 0.0 | 0.0 |\n| manual (none) | 0.0 | 0.0 |\n| typing | 4.0 | 33.9 |\n| dataclasses | 11.5 | 81.9 |\n| record-type | 12.5 | 91.3 |\n| record-type (C) | 0.2 | 0.2 |\n| attrs | 22.2 | 128.5 |\n| msgspec | 19.1 | 131.7 |\n\n**Table 3 — Per-instance memory**, bytes (tracemalloc, GC header included).\nFreezing never changes the footprint; mypyc adds one 8-byte vtable word to the native\nclasses it compiles.\n\n| construct | variant | interpreted | mypyc |\n|---|---|---|---|\n| native slots | mutable | 64 | 72 |\n| native slots | frozen | 64 | 72 |\n| manual record | frozen | 64 | 96 |\n| NamedTuple | frozen | 88 | 88 |\n| dataclass | mutable | 64 | 72 |\n| dataclass | frozen | 64 | 72 |\n| record-type | frozen | 64 | — |\n| record-type (C) | frozen | 64 | — |\n| attrs | mutable | 80 | — |\n| attrs | frozen | 80 | — |\n| msgspec | mutable | 64 | — |\n| msgspec | frozen | 64 | — |\n\n**Table 4 — Instantiation**, nanoseconds (min of 7 timeit repeats of 1e6\niterations). The timeit loop is interpreted, so a compiled class called from it shows no\nmypyc speedup — and can read *noticeably* slower from the per-call interpreter↔native\nboundary (e.g. mutable dataclass 87.5→109.5). Treat these as ±10%; the construct-to-construct\nshape is the robust signal, not small interpreted-vs-mypyc deltas.\n\n| construct | variant | interpreted | mypyc |\n|---|---|---|---|\n| native slots | mutable | 87.3 | 75.2 |\n| native slots | frozen | 87.5 | 75.7 |\n| manual record | frozen | 222.5 | 78.4 |\n| NamedTuple | frozen | 138.3 | 141.5 |\n| dataclass | mutable | 87.5 | 109.5 |\n| dataclass | frozen | 224.3 | 226.0 |\n| record-type | frozen | 227.0 | — |\n| record-type (C) | frozen | 61.2 | — |\n| attrs | mutable | 88.5 | — |\n| attrs | frozen | 209.1 | — |\n| msgspec | mutable | 63.0 | — |\n| msgspec | frozen | 62.5 | — |\n\n**Table 5 — Construction bytecode**, instruction counts from `dis`\n\n.\n“C” = no Python bytecode (C-level). Freezing is what turns the 9-instruction\n`__init__`\n\ninto 25 (every field routed through `object.__setattr__`\n\n);\nthese counts are unchanged inside the compiled module except the native\n`__init__`\n\n, which mypyc lowers to C.\n\n| construct | `__new__` | `__init__` (mutable) | `__init__` (frozen) |\n|---|---|---|---|\n| native slots | C | 9 | 9 |\n| manual record | C | — | 24 |\n| NamedTuple | 7 | — | C |\n| dataclass | C | 9 | 25 |\n| record-type | C | — | 24 |\n| record-type (C) | C | — | C |\n| attrs | C | 9 | 25 |\n| msgspec | C | C | C |\n\nDerived: the NamedTuple ↔ msgspec startup crossover sits at **229**\ntypes (warm) and **1,622** types (cold), computed from\nTables 1 and 2.\n\n[#](#footnote-label)Footnotes\n\n-\nIf you have any ideas, please LMK so I can explain it to my family.\n\n[↩](#user-content-fnref-1) -\n[“Introduction”](https://mypyc.readthedocs.io/en/stable/introduction.html). mypyc.readthedocs.io. Retrieved 2026-06-21. “Classes are compiled to*C extension classes*. They use vtables for fast method calls and attribute access.”[↩](#user-content-fnref-2) -\n[“Native classes”](https://mypyc.readthedocs.io/en/stable/native_classes.html). mypyc.readthedocs.io. Retrieved 2026-06-21. “Only attributes defined within a class definition (or in a base class) can be assigned to (similar to using`__slots__`\n\n).”[↩](#user-content-fnref-3) -\n[“typing.Final”](https://docs.python.org/3/library/typing.html#typing.Final). docs.python.org. Retrieved 2026-06-21. “There is no runtime checking of these properties.” (See also[PEP 591](https://peps.python.org/pep-0591/).)[↩](#user-content-fnref-9) -\n. github.com/jcrist/msgspec. Retrieved 2026-06-21.`src/msgspec/__init__.py`\n\n`Struct`\n\nis imported from the compiled`._core`\n\nextension, and importing the package eagerly runs`from . import inspect, json, msgpack, structs, toml, yaml`\n\n; the codecs in`json.py`\n\n/`msgpack.py`\n\nre-export from that same`_core`\n\n, so there is no struct-only import to isolate.[↩](#user-content-fnref-8)[↩](#user-content-fnref-8-2)2 -\n[“sys.getsizeof”](https://docs.python.org/3/library/sys.html#sys.getsizeof). docs.python.org. Retrieved 2026-06-21. “Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.”[↩](#user-content-fnref-4) -\n[“tp_dealloc”](https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dealloc). docs.python.org. Retrieved 2026-06-21. “A pointer to the instance destructor function. […] free all memory buffers owned by the instance, and call the type’s`tp_free`\n\nfunction to free the object itself.”[↩](#user-content-fnref-5) -\n[“timeit”](https://docs.python.org/3/library/timeit.html). docs.python.org. Retrieved 2026-06-21. The module “provides a simple way to time small bits of Python code”; the minimum is reported because “the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the`min()`\n\nof the result is probably the only number you should be interested in.”[↩](#user-content-fnref-6) -\n[“Comparing WSL Versions”](https://learn.microsoft.com/en-us/windows/wsl/compare-versions). learn.microsoft.com. Retrieved 2026-06-21. “WSL 2 is running as a Hyper-V virtual machine.” The Windows host beside it is itself a partition on that same hypervisor —[“Hyper-V Architecture”](https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/architecture): “The Microsoft hypervisor must have at least one parent, or root, partition, running Windows … [which] has direct access to hardware devices.”[↩](#user-content-fnref-7)\n\n© 2026 by JP Hutchins. Published under a Creative Commons\nAttribution-NonCommercial 4.0 International\n([CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/))\nlicense.", "url": "https://wpnews.pro/news/the-fastest-python-struct", "canonical_source": "https://www.crumpledpaper.tech/2026-06-21-python-struct-profiling/", "published_at": "2026-06-24 01:19:34+00:00", "updated_at": "2026-06-24 01:44:34.630108+00:00", "lang": "en", "topics": ["developer-tools", "machine-learning"], "entities": ["JP Hutchins", "Python", "Claude", "Plotly", "Rust", "clap"], "alternates": {"html": "https://wpnews.pro/news/the-fastest-python-struct", "markdown": "https://wpnews.pro/news/the-fastest-python-struct.md", "text": "https://wpnews.pro/news/the-fastest-python-struct.txt", "jsonld": "https://wpnews.pro/news/the-fastest-python-struct.jsonld"}}