Durable execution, the hard way

A developer has published a step-by-step guide to building a durable execution engine from scratch using Go and Postgres, inspired by Kelsey Hightower's "Kubernetes the hard way." The guide, which implements a minimal but fully-working workflow engine across seven lessons, is designed to help engineers understand how systems like Hatchet and Temporal operate at a deeper level.

Inspired by Kelsey Hightower's Kubernetes the hard way https://github.com/kelseyhightower/kubernetes-the-hard-way , we're going to build a durable execution engine from scratch using Go and Postgres. Durable execution is a mechanism to incrementally checkpoint the state of a function as it makes progress, so that in the case of unexpected failure, the function can recover from where it left off. It's particularly relevant in newer stacks and projects implementing AI agents, which are long-running and stateful. A system which implements durable execution is often called a "workflow engine." This guide uses Go and templated SQL using sqlc https://sqlc.dev/ . The only dependencies are: - Go 1.25+ - Postgres by default, created via Docker - pgx If you are interested in contributing support for other languages, please create a Github issue. I'll be sharing updates new lessons, other languages for this guide on Twitter https://x.com/abelanger5 if you'd like to follow along. You will benefit from this guide if you: - Want to understand how durable execution engines like Hatchet and Temporal work at a deeper level - Are implementing your own workflow engine and would like a simple starting point for your architecture This guide expects that you understand the foundations of SQL databases, can read code, and are familiar with some minimal backend engineering concepts, such as queues. More advanced terminology will be introduced in each lesson. For a motivating guide on durable execution, see the blog post How to think about durable execution https://hatchet.run/blog/durable-execution . Each directory in /lessons /hatchet-dev/durable-execution-the-hard-way/blob/main/lessons is set up with an identical structure: - A README.md file for navigating the lesson - A main.go file for running the example code produced by the lesson, which can be run via go run . - A sql directory which contains a schema.sql file, a queries.sql file, and some files for generating templated queries via sqlc By the final lesson, we'll have a minimal but fully-working workflow engine . Note that these lessons are not focused on developer ergonomics: we'll be building the bare minimum to understand the fundamentals, but won't implement the typical niceties you'd see in a client SDK. Prerequisites /hatchet-dev/durable-execution-the-hard-way/blob/main/lessons/01-prerequisites Simple task queue /hatchet-dev/durable-execution-the-hard-way/blob/main/lessons/02-simple-task-queue Limiting concurrent tasks /hatchet-dev/durable-execution-the-hard-way/blob/main/lessons/03-limiting-concurrent-tasks Task queue improvements /hatchet-dev/durable-execution-the-hard-way/blob/main/lessons/04-task-queue-improvements Durable event log /hatchet-dev/durable-execution-the-hard-way/blob/main/lessons/05-durable-event-log Tracking non-determinism /hatchet-dev/durable-execution-the-hard-way/blob/main/lessons/06-non-determinism Durable tasks /hatchet-dev/durable-execution-the-hard-way/blob/main/lessons/07-durable-tasks This guide is a somewhat opinionated view on durable execution. Specifically, it implements: - Durable execution entirely in Postgres. - Two types of functions: durable tasks and regular tasks. These map directly to durable tasks and tasks in Hatchet, and are akin to Temporal workflows and activities. - Regular tasks invokable as standalone tasks, meaning this guide implements a simple Postgres-backed task queue as well in the first few lessons. - Multiple types of retries and replays, which are treated as distinct: - Retries will retry a durable task without resetting the event history preserving the execution state of the function - Replays will reset a durable task's execution history to start from scratch - Forking will reset a durable task's execution history at a given point in the execution history, effectively creating a "fork" of that task. This will be the subject of a future lesson You can modify the schema, queries, and code in each lesson to experiment. To regenerate the SQL files in each directory, run the following: go run github.com/sqlc-dev/sqlc/cmd/sqlc generate --file sql/sqlc.yaml If you discovered an error in the core logic of any lesson, please file a Github issue. We'd be happy to reward you with a baked good from a bakery near you yes, we're serious . If a bakery isn't available, we'd be happy to send you a Hatchet tee or hat. If you understandably don't want more vendor swag, you'll have my eternal gratitude. AI has not been used to write any prose in this guide. All mistakes and turns of phrase are my own. AI has been used to: - Verify that each lesson of this guide is independently runnable and instructions are easy to follow - Generate mermaid diagrams If there's sufficient interest, I'd be happy to put together additional lessons, such as: - Using Postgres LISTEN / NOTIFY to speed up processing significantly - Durable sleep - Branching and forking the durable event log