The Fastest Python Struct?

JP Hutchins benchmarks Python struct definition speed, finding that metaprogramming approaches like decorators and metaclasses incur higher upfront runtime costs than manual type definitions. The analysis focuses on compile-time performance for CLI tools and build systems, where startup time is critical.

The Fastest Python Struct? https://www.crumpledpaper.tech/2026-06-21-python-struct-profiling JP Hutchins / The Fastest Python Struct? All posts written without LLM assistance unless otherwise noted. Python is fast enough . Python programmers tend to understand the Python Cost Model https://ocw.mit.edu/courses/6-006-introduction-to-algorithms-fall-2011/pages/readings/python-cost-model/ , Python’s strengths and weaknesses, libraries that give compiled performance, and when to use a compiled language from the start. So why do I care? Why do I get obsessed enough to coerce Claude into running these benchmarks and writing these Plotly charts? I do not know. 1 user-content-fn-1 But I do know what I care about for now - and today and some of the past weekend, and perhaps some of the next one , it’s definitely the cost of defining ideally immutable record types AKA structs in Python . So let’s get this out of the way: this write up is about benchmarking “Python type speed” informally: compile-time , it is NOT about benchmarking - serialization - instantiation - attribute access - validation - memory Right, so that’s what Python programmers often care about, because they are probably working on long running programs, like apps, servers or pipelines, where the cost of defining a type is paid upfront, one time, whereas the cost of allocation, instantiation, validation, and serialization is paid repeatedly. So yeah, if that’s what you care about, this post is not for you. But I did include instance cost benchmarks if you’re curious. 😻 If you already know you care about type definition speed, then jump straight to the analysis structs-under-test-suts , otherwise keep reading for my motivation and context on this subject. how-fast-to---help ”how fast to --help ” I tend to work on CLIs for developers and tooling for build systems or test suites where the time from program start to end is what we’re measuring. Perhaps you’ve noticed that running a command from a CLI may be near instant in a compiled program, but in Python, it can easily be hundreds of milliseconds: perceptible for UX, noticeable in CI/CD, and amplified by repeated calls as part of build system tooling. Unlike in a compiled language, Python type definitions are not free free in the sense that they were paid for during compilation ahem, Rust . They are code to be executed on every startup. And that includes imports of libraries and their type trees and dependents trees. We’ll see in the benchmarks that evil-runtime- metaprogramming, like decorators, metaclasses, or worse, have more of an upfront runtime type generation cost than manual type definitions. Can we get the best of everything: a Pythonic type definition style, complete static typing and match https://docs.python.org/3/reference/datamodel.html object. match args , with the speed of a hand-written C struct, and the startup time of a compiled extension? I think so. seriously, I’m not sure, need to do more work, but I have good preliminary data But why not use a compiled language and framework like Rust + clap https://docs.rs/clap/latest/clap/ ? I certainly do, but what can I say? I love the Python ecosystem, build tooling libraries, and the rapidly evolving type system. And I believe that the type system can continue evolving so that we can offload a lot of the correctness to the type checker, and reap runtime speed benefits. That’s what this post is about. ok-ok-whatever-but-why-structs OK, OK, whatever, but why “structs”? I’ll confess that I am an advocate of functional programming FP , with little compromise. But the tortured kind, that can’t be bothered to learn Haskell, or study Lisp, and seems to end up rewriting the same handful of patterns in every language. So, it’s not the structs alone that I am after. It’s the sum types and pattern matching . Long story short, I use sum types and pattern matching everywhere, all the time, from Rust to embedded C, from Typescript to Python, from JSON to CBOR. Even if your not an FP…enthusiast, you’ve likely used them in Python without thinking of them as such, when reaching for MyType | None an Option or Maybe type . This example imagines that some immutable device info burned onto a ROM is versioned V1 and V2. V1 guaranteed presence of the serial number, but not the manufactured date. V2 guarantees both and adds a bootloader SHA. python from typing import NamedTuple class DeviceInfoV1 NamedTuple : serial number: str manufactured utc ms: int | None class DeviceInfoV2 NamedTuple : serial number: str manufactured utc ms: int bootloader sha: int type DeviceInfo = DeviceInfoV1 | DeviceInfoV2 DeviceInfo is a sum type of two product types , DeviceInfoV1 and DeviceInfoV2 , and there are only two representable states, each validated by the type system, not at runtime. Here’s what the naive product type would look like: class DeviceInfo NamedTuple : serial number: str manufactured utc ms: int | None bootloader sha: int | None Invalid runtime states are now possible: DeviceInfo serial number="abc", manufactured utc ms=None, bootloader sha=123 is a valid instance of the naive product type, but it is not a valid DeviceInfoV1 or DeviceInfoV2 . Using a product type instead of a sum type shifts the burden of correctness from the type system to the runtime. aw-f Aw, f& I promised myself I wouldn’t evangelize FP Day 583 . 💀 It’s not really about FP, that just happens to be my motivation. There are plenty of different ways to utilize Abstract Data Types ADTs in Python, and if you care about Python startup time, then I think you’ll enjoy these benchmark results. Besides, this can’t be about FP, because functional programmers don’t care about performance, memory, or know anything about compilers and instruction sets. “Functional programming, strictly defined, is dumb…the way you manage mutable state is by making an entire copy of the data structure with the changes in the new copy of the data structure…here’s the problem: computers, they’re all bags of mutable state.” Chris Lattner, Creator Of Swift On Functional Programming YouTube Odd for the creator of LLVM, Clang, Swift, and Mojo to mischaracterize FP as anything other than an abstraction. I wasn’t aware of the “functional” instruction sets competing with x86 and ARM. wtf-are-we-testing-again WTF are we testing again? I use NamedTuple all the time, mostly because it means I don’t have to add @dataclass frozen=true everywhere, but in the back of my mind I have always believed that NamedTuple must be super efficient and compact, like const struct in C or struct in Rust. Once I realized that I’d been carrying on with this belief for years, I decided to setup this benchmark to understand how much I was truly paying for my types. the-contenders THE CONTENDERS author’s commentary italicized to avoid bias - manual python slotted class: “Native Final Slots” ewwwwwwwwwwww - manual python slotted class Brett Cannon’s manual record-type : “Manual Record Type” oh god that’s even worse, this IS a waste of time, we’re going to turn Python into Java or something - from Python’s standard library, typing module: NamedTuple pewwwwww pew pew pewwwwwwoooo - also from the standard library, dataclasses : dataclass frozen=True boooooooooo metaprogramming suuuuuuuuuuucks…unless it’s rust’s bs…or constexpr…at least it’s not C macros…but booooooooooooooooooo - from legendary Python core developer Br Br Bre Brett Cannon, iiiiiiiit’s record-type a new hope - from a 20 minute Claude hallucination that rips off msgspec and record-type WITH ATTRIBUTIONwait, I’m calling it ? ? record-type C hey, 20 minutes is not bad, it usually costs me $200 to get slop CPython - weighing in at 11 years of development, the original 🎺 attrs medieval horns, but in tune 🎺 - fast AF and only ~14.3% vowels iiiiiiiiiiiiiiit’s msgspec what is JSON for anymore? structs-under-test-suts Structs Under Test SUTs | Implementation | Description | |---|---| | Final -annotated fields, the closest thing to a naive native record. manual record manual-record record-type NamedTuple https://typing.python.org/en/latest/spec/namedtuples.html named-tuples typing module dataclass https://docs.python.org/3/library/dataclasses.html dataclasses module frozen dataclass https://docs.python.org/3/library/dataclasses.html frozen-instances @dataclass frozen=true record-type https://pypi.org/project/record-type/ record type for Python record-type C record-type-c record-type and msgspec attrs https://www.attrs.org/en/stable/ msgspec https://github.com/msgspec/msgspec Each of these implementations will be evaluated with and without mypyc https://mypyc.readthedocs.io/ compilation, and as a cold start no bytecode cache and warm start bytecode cache present , when relevant. All of the implementations are tested on a struct type of three ints: struct StructUnderTest { a: int b: int c: int } Refer to the methodology methodology section for details on how the benchmarks were run. module-cost Module Cost When you import your base type or decorator, you also must pay a one time cost, regardless of how many types you define, for that module’s source tree. The cold import is roughly 6-8× a warm one, because the whole transitive source tree has to be recompiled to bytecode. type-cost Type Cost So, how much does it cost to define a type? Remember that this cost is paid once on every program start , or at least when it is first imported . Many of these benefit greatly from a warm start, which is the most common use of a Python program. Cold start is included because it’s the first impression that a user gets: “how fast to --help?” Looking just at the warm start, we can start to see 3 performance tiers: - ~7-12 µs: native slots , record-type C , msgspec , and manual record - ~76-96 µs ~8× slower : NamedTuple , record-type - ~200-370 µs ~20-30× slower : dataclass , dataclass frozen=True , attrs The tiers come down to how many methods each implementation has to generate when the type is defined why-three-type-cost-tiers . Use the table below to sort relative performance. | implementation | ||| |---|---|---|---| | 0.10× | 0.60× | 0.11× | | | 0.11× | 0.35× | — | | | 0.13× | 0.42× | — | | | 0.15× | 2.1× | 0.18× | | | 1.00× | 1.00× | 1.00× | | | 1.3× | 1.2× | — | | | 3.0× | 2.5× | 3.0× | | | 4.0× | 3.2× | — | | | 4.9× | 3.8× | 5.2× | Per-type cost — each cell is × the baseline row NamedTuple by default; click any row to re-base . Lower is faster to define. Click a column header to sort. so-whats-the-fastest-startup So what’s the fastest startup? Total startup is calculated as the fixed dependency import the module cost module-cost , plus the number of types × the per-type definition cost . The interactive chart below shows the startup time on the Y axis and the number of types defined on the X axis. The scales can be toggled together between log Y log10, X log2 from 1 to 4,096 types and linear Y clipped to 0ms–1,000ms, X from 0 to 4,096 types . For each implementation, the solid line is the warm time, and the dotted line is the cold time. Click on a name in the legend to toggle, double click to isolate, and double click on a disabled name to reset. conclusion Conclusion For my purposes, I can draw a few conclusions from this. NamedTuple my goto is sorta in the middle and is probably not dragging start times too much. But, it’s per-type cost is ~8× the native/C implementations, so as the program grows, it will start to add up. msgspec is faster than NamedTuple above ~256 warm type definitions. But this assumes absolute dependency discipline that negates some of the upsides of Python’s ecosystem. If you import msgspec , or dataclass , anywhere, or if any of your dependencies have a high module or type cost, then NamedTuple ’s low module cost is dwarfed and you may as well have started with a cheaper struct implementation.- The decorator-based implementations dataclass , record-type , and attrs all have a high type cost, but with that comes evil-runtime- metaprogamming capabilities. - The C implementation of record-type is good enough wins by every metric that I’ll be rewriting it and getting it under a test suite. - I will update this article once I have a tested implementation It may be too good to be true - I will definitely be trying out msgspec in the future. I wasn’t familiar with it before working on this report, but it’s very exciting to see these numbers, not to mention that it has de/serialization on top of being a basic struct. I’d love to see CDDL/CBOR 🔥 https://datatracker.ietf.org/doc/html/rfc8610 and postcard ✉️ https://github.com/jamesmunns/postcard de/serializers appendix Appendix Here lives more stuff that wasn’t directly relevant to my goal of assessing startup time, but is still fun. instance-cost Instance Cost What can I say, since the benchmark suite was setup, I couldn’t resist. The instance costs are relevant to the program speed once it’s begun, and you’ll see that they are quite a bit tighter than the module and type cost comparisons. There’s a total spread of under 4x, from ~60ns up to ~220ns per instance. construction Construction | implementation | || |---|---|---| | 0.44× | — | | | 0.45× | — | | | 0.63× | 0.53× | | | 0.63× | 0.77× | | | 1.00× | 1.00× | | | 1.5× | — | | | 1.6× | 0.55× | | | 1.6× | 1.6× | | | 1.6× | — | Per-instance construction cost — each cell is × the baseline row NamedTuple by default; click any row to re-base . Lower is faster. Click a column header to sort. memory Memory Memory is driven by object layout. Freezing a type never changes its footprint — frozen=True only changes the write path, not the storage. mypyc trades a few bytes per instance one pointer to its method table, akin to a C++ vtable for speed, 2 and gives every compiled class a fixed layout even without slots . 3 user-content-fn-3 the-cost-of-immutability The cost of immutability Immutability sometimes costs time or space and is never more efficient. native-slots native slots A plain slotted class with Final fields. python from typing import Final class NativeFinal: slots = "a", "b", "c" def init self, a: int, b: int, c: int - None: self.a: Final = a self.b: Final = b self.c: Final = c The Final is for the static checker, meaning that it has zero runtime cost. 4 user-content-fn-9 mypy rejects o.a = 99 , but the assignment succeeds anyway, on the interpreted class and the compiled .so . So this is the closest thing to a native record mypyc can produce — a compact slotted object 64 bytes; 72 compiled whose init it lowers to C-level slot stores, but it is not actually immutable at runtime zero cost abstraction . manual-record manual record native slots is cheap precisely because it does less . It has no eq , hash , or repr , and — as we saw — it isn’t even immutable. Every other record here gives you all of that. So here is Brett Cannon’s record-type pattern https://github.com/brettcannon/record-type : a complete, genuinely-immutable hand-written record with slots , match args , a real setattr guard, and eq / hash / repr : class ManualRecord: slots = "a", "b", "c" match args = "a", "b", "c" def init self, a: int, b: int, c: int - None: object. setattr self, "a", a object. setattr self, "b", b object. setattr self, "c", c def setattr self, attr, val : raise TypeError "immutable" def eq self, other : if not isinstance other, type self : return NotImplemented return self.a == other.a and self.b == other.b and self.c == other.c def hash self : return hash self.a, self.b, self.c record-type-c record-type C The manual record marks the pure-Python performance ceiling: complete and immutable, with near-zero import, but either slow to construct 222 ns or — once mypyc lowers its object. setattr init — fast 78 ns yet larger 96 bytes . msgspec.Struct shows C clears that ceiling: compact 64 bytes , immutable, ~62 ns construction, ~10 µs/type. Its one catch is the module cost. import msgspec runs ~19 ms , because it’s a serialization library and you can’t get just the struct without importing the whole kitchen sink. 5 user-content-fn-8 Can you get msgspec’s record qualities without its import tax? A research prototype read: LLM slop on a branch of Brett Cannon’s record-type https://github.com/JPHutchins/record-type/pull/1 answers yes. It’s a ~600- slop -line C extension: an inheritable Record base you subclass subtype exactly like NamedTuple : python from native record import Record class Point Record : a: int b: int c: int A C metaclass reads the class-body annotations directly no inspect , no exec and builds a frozen, slotted type whose constructor is a C-level vectorcall, borrowing msgspec’s type-creation trick, with none of its codec machinery. And you saw in the charts above that it wins in every category. buuuuuuuuuuuuut buuuuuuuuuuuuut… It’s a research prototype , not a release. It lives on a PR branch, not PyPI. And there’s one real semantic limit: a class body can’t express Python’s full parameter grammar positional-only, keyword-only, args , kwargs the way @record ’s function signature can — fine for the record-shaped common case, but not literally 1:1 with the decorator. Per-type here is measured exactly like every other construct — module self-time ÷ K, which includes the ~7 µs the bare class statement costs regardless — so it is directly comparable to the figures above. why-three-type-cost-tiers Why three type-cost tiers? - fastest: native slots , record-type C , msgspec , and manual record - ~8× slower: NamedTuple , record-type - ~20-30× slower: dataclass , dataclass frozen=True , attrs The single best predictor turned out to be how many methods each construct has to generate at class-creation : zero, one, or several. Trace it yourself with codegen probe.py https://github.com/JPHutchins/python-struct-profiling/blob/d8acfd5f63824b87b24e820e6f6859e0194da4c6/codegen probe.py , which captures every exec / eval / compile a single definition triggers. tier-1--nothing-generated Tier 1 — nothing generated. native slots and manual record are hand-written, so their methods compile once into the .pyc and the class statement only has to build the type. msgspec and record-type C generate no Python either. A C metaclass assembles the type directly. tier-2--one-generated-method Tier 2 — one generated method. collections.namedtuple https://github.com/python/cpython/blob/v3.14.0/Lib/collections/ init .py L361 builds a tuple subclass — a descriptor per field and a single eval ’d new : lambda cls, a, b, c: tuple new cls, a, b, c with typing.NamedTuple https://github.com/python/cpython/blob/v3.14.0/Lib/typing.py L3027 adding PEP 649 https://peps.python.org/pep-0649/ annotation handling on top. record-type ’s takes the other road — https://github.com/brettcannon/record-type/blob/2023.1/records.py L86 @record inspect.signature to read the fields, then one exec ’d class whose only generated logic is the init eq / hash / repr come from a Record base : python class C Record : slots = 'a', 'b', 'c' def init self, /, a, b, c - None: object. setattr self, 'a', a object. setattr self, 'b', b object. setattr self, 'c', c A metaclass-plus-factory and a decorator-plus- inspect : different machinery, the same one-method’s-worth of work, the same tier. tier-3--several-generated-methods-plus-field-work-and-a-rebuild Tier 3 — several generated methods, plus field work and a rebuild. dataclass https://github.com/python/cpython/blob/v3.14.0/Lib/dataclasses.py L934 turns the annotations into Field objects and generates init , repr , and eq in one shot a factory that returns the three : python def create fn dataclass type a , dataclass type b , dataclass type c , dataclass HAS DEFAULT FACTORY , dataclass builtins object , dataclass init return type , dataclasses recursive repr : def init self, a: dataclass type a , b: dataclass type b , c: dataclass type c - dataclass init return type : self.a=a self.b=b self.c=c @ dataclasses recursive repr def repr self : return f"{self. class . qualname } a={self.a r}, b={self.b r}, c={self.c r} " def eq self,other : if self is other: return True if other. class is self. class : return self.a==other.a and self.b==other.b and self.c==other.c return NotImplemented return init , repr , eq , frozen=True adds three more: setattr , delattr , hash — and slots=True creates the class a second time https://github.com/python/cpython/blob/v3.14.0/Lib/dataclasses.py L1277 , since slots can’t be added in place. attrs https://github.com/python-attrs/attrs/blob/26.1.0/src/attr/ make.py L796 is a more layered version of the same idea. namedtuple-in-mypyc NamedTuple in mypyc I was really hoping that mypyc was going to compile NamedTuple to a native struct. Compiling the module interpreted-vs-compiled changes almost nothing about the NamedTuple , while it transforms native slots : | metric | NamedTuple interpreted | NamedTuple mypyc | native slots interpreted | native slots mypyc | |---|---|---|---|---| isinstance , tuple | yes | yes | no | no | | bytes / instance | 88 | 88 | 64 | 72 | new type instructions | 7 bytecodes | 7 bytecodes | C | C | init instance instructions | C | C | 9 bytecodes | C | | instance ns | 138 | 142 | 87.5 | 75.7 | The NamedTuple columns are identical: same footprint, same construct time. Its new is still seven interpreted bytecodes inside the compiled extension module, building a tuple and handing it to tuple. new : 1 RESUME 0 LOAD GLOBAL 1 tuple new + NULL LOAD FAST BORROW LOAD FAST BORROW 1 cls, a LOAD FAST BORROW LOAD FAST BORROW 35 b, c BUILD TUPLE 3 CALL 2 RETURN VALUE Contrast the native record. It has no new at all; its init writes the three fields straight into their slots with STORE ATTR — no tuple, no length field, no boxed item array. The Final annotations add zero bytecode; they’re a pure type-checker hint, so this is byte-for-byte a plain slotted class. 11 RESUME 0 12 LOAD FAST BORROW LOAD FAST BORROW 16 a, self STORE ATTR 0 a 13 LOAD FAST BORROW LOAD FAST BORROW 32 b, self STORE ATTR 1 b 14 LOAD FAST BORROW LOAD FAST BORROW 48 c, self STORE ATTR 2 c LOAD CONST 0 None RETURN VALUE mypyc does lower this init to C — recall its 9 bytecodes became C-level in the compiled column. But for this record you barely see it in the construction numbers 87 → 76 ns, within run-to-run noise : the init is only three STORE ATTR s, and the interpreted timeit loop crosses the interpreter↔native boundary on every call, which caps any gain. Where compiling a hand-written init does pay off is when it does real interpreted work — manual record manual-record routes every field through object. setattr and drops from 222 to 78 ns once compiled, a speedup a frozen dataclass can’t get. NamedTuple’s new , by contrast, stays interpreted even when compiled and there’s nothing for mypyc to lower at all without breaking the tuple contract. So, I’ve been right to reach for NamedTuple as a cheaper immutable type than dataclass frozen=True , but I was wrong to think that it was perfectly efficient and compact like a C struct. further-reading Further reading A first-class record type for Python. Brett Cannon’s record-type proposal https://discuss.python.org/t/introducing-record-types-in-python/34397 and a terser struct Point x: int, y: int spelling , with the proof-of-concept record decorator already on PyPI. As proposed it standardizes the boilerplate — a concise frozen, slotted dataclass — rather than adding a performance primitive: a decorator’s generated init stays interpreted, so it can’t push past the pure-Python floor the manual record manual-record maps out. Unboxed value types in mypyc. mypyc 841 https://github.com/mypyc/mypyc/issues/841 tracks the performance angle these benchmarks can’t reach: user-defined unboxed value types ≈16 bytes vs 40 for a heap object , passed around in native code and boxed only when they enter a Python container. mypyc already does this for native integers i64 / i32 — just not yet for user-defined records. Open since 2021 with no implementation: a direction, not a date, and nothing to benchmark yet. methodology Methodology All measurements were taken on a single machine: CPython 3.14.0 installed and managed with uv https://docs.astral.sh/uv/ , mypy/mypyc 2.1.0, attrs 26.1.0, msgspec 0.21.1, and record-type 2023.1.post1, on x86 64 Linux WSL2 with gcc 13.3. The C-backed record-type C is built from the branch linked above https://github.com/JPHutchins/record-type/pull/1 a research prototype, not a release . Absolute numbers will differ on your hardware and Python build; the relative shape is the takeaway. Every struct carries the same three int fields. interpreted-vs-compiled Interpreted vs compiled The standard-library constructs plain classes, slotted, Final -slotted, NamedTuple , and the dataclass variants live in one module that is the unit of compilation: mypyc containers.py produces a containers. .so . An interpreted driver imports that module and detects which form it got by testing whether file ends in .so . This mirrors how mypyc is actually used — you compile the definitions and call into them from ordinary interpreted code. The attrs , msgspec , and both record-type classes are defined in the driver itself, not in the compiled module, so there is no mypyc form to measure — the charts and tables leave their mypyc column empty rather than copy in the interpreted value. record-type C is already a compiled C extension, so mypyc has nothing to add — it is the native form. Even inside the compiled .so , the @dataclass decorator and the NamedTuple metaclass run as interpreted CPython, and the init / new they generate stay interpreted bytecode: mypyc compiles the module’s own code, not the code those tools synthesize at runtime. memory-footprint Memory footprint sys.getsizeof reports one object’s size but doesn’t follow the dict pointer, so it understates classes that carry one. 6 The headline figures instead come from a bulk tracemalloc measurement — allocate 200,000 instances and subtract a same-length None n list measured the same way, so the list’s own backing storage cancels and what remains is the instances’ allocation GC header included : python import gc, tracemalloc def mem per instance ctor, args, n=200 000 : gc.collect tracemalloc.start base = None n base cur, = tracemalloc.get traced memory objs = ctor args for in range n cur, = tracemalloc.get traced memory tracemalloc.stop return cur - base cur / n Treat the per-instance figure as ±one allocator alignment word. bytecode Bytecode Allocation bytecode is counted with dis.get instructions on new and init unwrapping the staticmethod that wraps a NamedTuple’s new , and disassembled with dis.dis for the listings above. Deallocation has no Python bytecode to count: teardown is C-level tp dealloc / tp free unless a class defines a Python del , which none of these do. 7 user-content-fn-5 per-instance-timing Per-instance timing Construction and attribute access are timed with timeit https://docs.python.org/3/library/timeit.html — the minimum of seven repeats of 1,000,000 iterations for construction, 5,000,000 for access, reported as nanoseconds per operation. The 8 user-content-fn-6 timeit loop is interpreted, so every iteration crosses the interpreter↔native boundary. mypyc’s attribute-access and call speedups land on the compiled→compiled path, so an interpreted loop reaching into a compiled class won’t see them and can read slightly slower — which is why the compiled instantiation numbers sit on top of the interpreted ones rather than below. import--type-construction-time Import / type-construction time The obvious approach — timeit on make dataclass or namedtuple — measures the wrong thing. The dynamic factory forms differ from the @dataclass and class C NamedTuple forms you actually write the functional NamedTuple ... call understates the class-statement form by roughly 3× , and timeit is blind to both mypyc and the one-time cost of importing the supporting library, since those happen before the loop starts. So every import number comes from a fresh interpreter under python -X importtime , reading the self time attributed to the module — self time excludes child imports, so the supporting library isn’t double-counted: Per-type cost. Generate a module of K = 200 identical-shape classes in the real class-statement form, import it, and read its self-time; the per-type figure is that self-time ÷ 200, the median of five fresh interpreters this is what the committed importtime sweep.py reports . Dividing by K folds a small fixed per-module overhead into each figure. What that per-type cost consists of — the methods each construct generates at class-creation — is dissected in Why three type-cost tiers why-three-type-cost-tiers . Cold vs warm. “Warm” imports with the pycache / .pyc already written; “cold” deletes pycache first, so the source is recompiled to bytecode in-process. Their difference is the source→bytecode compile cost tens of µs/type — ~25–55 here, scaling with each class’s source size . Dependency import. python -X importtime -c "import LIB" in a fresh interpreter gives the cumulative cost of first-importing a library. The cold variant points PYTHONPYCACHEPREFIX at an empty directory so the whole transitive source tree must recompile. mypyc axis. The generated module is compiled with mypyc and the resulting .so imported under the same harness. A compiled extension has no Python source to recompile, so there’s no cold/warm gap — yet its per-type creation cost is barely lower than interpreted, likely because type creation is dominated by CPython’s PyType Ready , which runs either way. the-crossover-model The crossover model The startup chart is a model, not a direct measurement: total startup is taken as a fixed dependency import plus N × the measured per-type construction cost, evaluated for cold and warm. The crossover is where two such lines meet — N = dep b - dep a / per type a - per type b . It assumes a single dependency imported once and a linear per-type cost both hold well here ; the cold curves roll up shared sub-dependencies, so several of these libraries imported together cost less than the sum of their individual lines. reporting Reporting Bytes and counts are integers; timing data is quoted to three significant figures. Import timings vary run to run, so each is reported as the median of five fresh processes; instantiation is the minimum of seven timeit repeats the conventional low-noise estimator . Treat the per-instance nanosecond figures as ±10% — the construct-to-construct shape is what’s robust, not the third digit. limitations-and-cross-validation Limitations and cross-validation One machine, no isolation. Everything ran on a single WSL2 host — which sits on Hyper-V, as does the Windows install beside it, so there’s no bare-metal baseline on this box and no WSL2-specific virtualization penalty to factor out either — with no CPU pinning or frequency-scaling control. Repeating on separate hardware, several Python versions, and a second OS would confirm the shape; pinning the CPU steadies the absolute numbers. 9 user-content-fn-7 Compiled construction is timed from an interpreted loop. That measures the common interpreted-caller-into-compiled-class case, not compiled→compiled throughput. A benchmark loop itself compiled with mypyc would show whether its call and attribute speedups close the gap. “Cold” is a cold The source stays in the OS page cache between runs, so the cold figures isolate source→bytecode compilation, not first-read I/O. bytecode cache, not a cold disk. Per-type cost is self-time ÷ K. That folds a small fixed per-module overhead into each figure; a regression over several values of K would separate the fixed cost from the per-type slope the correction is sub-microsecond for the cheap constructs .There is no struct-only import to isolate — the codec comes with it — so it’s a fair number to report but not a pure struct-definition cost. msgspec ’s ~19 ms import is library-wide. 5 user-content-fn-8 Its numbers may shift once it’s hardened and packaged. record-type C is a research prototype. Five runs is modest. More repeats, and reporting dispersion alongside the median, would tighten the import figures. reproducing Reproducing Everything here is reproducible from the python-struct-profiling https://github.com/JPHutchins/python-struct-profiling repository — the data in this post was produced at commit . Two committed harnesses produce every number, and a third dissects the type-definition mechanism — all on the same machine, all carrying the identical three- https://github.com/JPHutchins/python-struct-profiling/tree/b2f2eb7da90762e51957189f130d10f22d2eb77a b2f2eb7 int -field shape: bench.py — memory tracemalloc , bytecode dis , and instantiation timeit , run once against the interpreted module and once against the mypyc -compiled containers.so . importtime sweep.py — the import / type-creation axis: it generates a module of K real class-statement / decorator forms per construct, imports it under python -X importtime in a fresh interpreter, and divides the module self-time by K. The figures here are --k 200 --runs 5 . codegen probe.py added at — the mechanism behind the d8acfd5 three type-cost tiers why-three-type-cost-tiers : it traces the exec / eval / compile each construct runs at class-creation and counts how many methods each one generates zero, one, or several . raw-data Raw data Every figure above is derived from this one table set the charts and these tables read the same array, so they cannot disagree : Table 1 — Import / type-creation cost , µs per class median of 5 fresh -X importtime runs, K = 200 . mypyc is the compiled .so ; “—” means the construct is off the compiled axis attrs, msgspec, and both record-types are defined outside the compiled module; record-type C is already a C extension . | construct | variant | warm | cold | mypyc | |---|---|---|---|---| | native slots | mutable | 7.3 | 59.3 | 6.9 | | native slots | frozen | 7.4 | 62.2 | 6.9 | | manual record | frozen | 11.5 | 214.5 | 11.1 | | NamedTuple | frozen | 76.2 | 104.3 | 63.3 | | dataclass | mutable | 228.4 | 261.0 | 190.3 | | dataclass | frozen | 373.4 | 401.2 | 328.5 | | record-type | frozen | 96.4 | 122.4 | — | | record-type C | frozen | 8.6 | 36.0 | — | | attrs | mutable | 264.6 | 288.7 | — | | attrs | frozen | 301.4 | 332.2 | — | | msgspec | mutable | 10.5 | 40.1 | — | | msgspec | frozen | 10.2 | 44.0 | — | Table 2 — One-time dependency import , milliseconds cumulative in a fresh interpreter. Paid once per process regardless of how many types you define. The native record imports no library. | library | warm | cold | |---|---|---| | native none | 0.0 | 0.0 | | manual none | 0.0 | 0.0 | | typing | 4.0 | 33.9 | | dataclasses | 11.5 | 81.9 | | record-type | 12.5 | 91.3 | | record-type C | 0.2 | 0.2 | | attrs | 22.2 | 128.5 | | msgspec | 19.1 | 131.7 | Table 3 — Per-instance memory , bytes tracemalloc, GC header included . Freezing never changes the footprint; mypyc adds one 8-byte vtable word to the native classes it compiles. | construct | variant | interpreted | mypyc | |---|---|---|---| | native slots | mutable | 64 | 72 | | native slots | frozen | 64 | 72 | | manual record | frozen | 64 | 96 | | NamedTuple | frozen | 88 | 88 | | dataclass | mutable | 64 | 72 | | dataclass | frozen | 64 | 72 | | record-type | frozen | 64 | — | | record-type C | frozen | 64 | — | | attrs | mutable | 80 | — | | attrs | frozen | 80 | — | | msgspec | mutable | 64 | — | | msgspec | frozen | 64 | — | Table 4 — Instantiation , nanoseconds min of 7 timeit repeats of 1e6 iterations . The timeit loop is interpreted, so a compiled class called from it shows no mypyc speedup — and can read noticeably slower from the per-call interpreter↔native boundary e.g. mutable dataclass 87.5→109.5 . Treat these as ±10%; the construct-to-construct shape is the robust signal, not small interpreted-vs-mypyc deltas. | construct | variant | interpreted | mypyc | |---|---|---|---| | native slots | mutable | 87.3 | 75.2 | | native slots | frozen | 87.5 | 75.7 | | manual record | frozen | 222.5 | 78.4 | | NamedTuple | frozen | 138.3 | 141.5 | | dataclass | mutable | 87.5 | 109.5 | | dataclass | frozen | 224.3 | 226.0 | | record-type | frozen | 227.0 | — | | record-type C | frozen | 61.2 | — | | attrs | mutable | 88.5 | — | | attrs | frozen | 209.1 | — | | msgspec | mutable | 63.0 | — | | msgspec | frozen | 62.5 | — | Table 5 — Construction bytecode , instruction counts from dis . “C” = no Python bytecode C-level . Freezing is what turns the 9-instruction init into 25 every field routed through object. setattr ; these counts are unchanged inside the compiled module except the native init , which mypyc lowers to C. | construct | new | init mutable | init frozen | |---|---|---|---| | native slots | C | 9 | 9 | | manual record | C | — | 24 | | NamedTuple | 7 | — | C | | dataclass | C | 9 | 25 | | record-type | C | — | 24 | | record-type C | C | — | C | | attrs | C | 9 | 25 | | msgspec | C | C | C | Derived: the NamedTuple ↔ msgspec startup crossover sits at 229 types warm and 1,622 types cold , computed from Tables 1 and 2. footnote-label Footnotes - If you have any ideas, please LMK so I can explain it to my family. ↩ user-content-fnref-1 - “Introduction” https://mypyc.readthedocs.io/en/stable/introduction.html . mypyc.readthedocs.io. Retrieved 2026-06-21. “Classes are compiled to C extension classes . They use vtables for fast method calls and attribute access.” ↩ user-content-fnref-2 - “Native classes” https://mypyc.readthedocs.io/en/stable/native classes.html . mypyc.readthedocs.io. Retrieved 2026-06-21. “Only attributes defined within a class definition or in a base class can be assigned to similar to using slots .” ↩ user-content-fnref-3 - “typing.Final” https://docs.python.org/3/library/typing.html typing.Final . docs.python.org. Retrieved 2026-06-21. “There is no runtime checking of these properties.” See also PEP 591 https://peps.python.org/pep-0591/ . ↩ user-content-fnref-9 - . github.com/jcrist/msgspec. Retrieved 2026-06-21. src/msgspec/ init .py Struct is imported from the compiled . core extension, and importing the package eagerly runs from . import inspect, json, msgpack, structs, toml, yaml ; the codecs in json.py / msgpack.py re-export from that same core , so there is no struct-only import to isolate. ↩ user-content-fnref-8 ↩ user-content-fnref-8-2 2 - “sys.getsizeof” https://docs.python.org/3/library/sys.html sys.getsizeof . docs.python.org. Retrieved 2026-06-21. “Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.” ↩ user-content-fnref-4 - “tp dealloc” https://docs.python.org/3/c-api/typeobj.html c.PyTypeObject.tp dealloc . docs.python.org. Retrieved 2026-06-21. “A pointer to the instance destructor function. … free all memory buffers owned by the instance, and call the type’s tp free function to free the object itself.” ↩ user-content-fnref-5 - “timeit” https://docs.python.org/3/library/timeit.html . docs.python.org. Retrieved 2026-06-21. The module “provides a simple way to time small bits of Python code”; the minimum is reported because “the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min of the result is probably the only number you should be interested in.” ↩ user-content-fnref-6 - “Comparing WSL Versions” https://learn.microsoft.com/en-us/windows/wsl/compare-versions . learn.microsoft.com. Retrieved 2026-06-21. “WSL 2 is running as a Hyper-V virtual machine.” The Windows host beside it is itself a partition on that same hypervisor — “Hyper-V Architecture” https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/architecture : “The Microsoft hypervisor must have at least one parent, or root, partition, running Windows … which has direct access to hardware devices.” ↩ user-content-fnref-7 © 2026 by JP Hutchins. Published under a Creative Commons Attribution-NonCommercial 4.0 International CC BY-NC 4.0 https://creativecommons.org/licenses/by-nc/4.0/ license.