# Passing DBs Through Continuations

> Source: <https://remy.wang/blog/cps.html>
> Published: 2026-06-07 01:17:40+00:00

*Dedicated to the Minnowbrook
Analytic Reasoning Seminar with special thanks to Kris Micinski and Michael Ballantyne*

Suppose you want to write a database. You'd probably start by
implementing relational algebra operators — projection, filter, join,
etc. The easy way is to implement them as functions that take in tables
and return tables, and assemble them into a larger expression. That was
how [Prela](https://prela-lang.org) worked in its first
incarnation. The code was clean, but it was hella slow! Which was not
surprising, because every operator materialized every intermediate
result. The standard solution to this is the [iterator
model](https://cs-people.bu.edu/mathan/reading-groups/papers-classics/volcano.pdf), where each operator implements an *Iterator* interface
that streams intermediate tables row by row instead of materializing
them. But implementing the iterator model naively still incurs overhead:
every call to `Iterator.next()`

triggers a dynamic dispatch,
which costs vtable lookups and destroys cache locality. There are two
standard remedies: [vectorization
and compilation](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf). A vectorized database amortizes the overhead by
implementing `Iterator.next_batch()`

which returns a whole
batch of data that can be processed together; a compiled database, well,
compiles the incoming query directly to fast machine code that runs
without any dynamic dispatch. Either approach takes a lot of very smart
people spending their entire working life to build, and it's why systems
like DuckDB and Umbra exist. I'm moderately smart but don't have a lot
of time, so I was looking for a shortcut. The [shortcut](https://dl.acm.org/doi/10.1145/165180.165214) I
stumbled upon was so beautiful that I literally cried[ 1](#fn1)
when I finally understood it, and I hope my explanation below will make
you cry too :' )

To keep things simple, let's suppose we're just dealing with lists of
numbers, and we want to do two very simple things to them:
`inc`

adds 1 to every number, and `dbl`

doubles
them. That's pretty easy to write:2

```
inc(xs) = [x + 1 for x in xs]

dbl(xs) = [2 * x for x in xs]
```

Now, we can chain them together with `dbl(inc(xs))`

which
will do two steps in sequence. Problem is, because each function takes
in a list and returns a list, our program produces an
*intermediate*, namely `inc(xs)`

. This allocates a new
list only to be thrown away by the call to `dbl`

. Things only
gets worse when we chain together multiple calls to `inc`

and
`dbl`

. A more efficient implementation would *fuse*
together the operations:

```
inc_n_dbl(xs) = [2 * (x + 1) for x in xs]
```

Of course, we can't write down every possible combination of operators like this. Is there a way to define each operator modularly, yet still have them compose into tightly fused operations automatically? Yes, if we use a bit of magic from functional compilers — continuation-passing style (CPS).

The key idea of CPS is to define operators that *do* things
instead of *making* things. `inc`

and `dbl`

as defined above each takes in a list and *makes* a list.
Instead, the CPS version of each operator takes in a list and an
additional input `k`

: this `k`

is a function that
the caller passes in, specifying what it wants to do with each element
after the operator's work is done. `k`

is called the
*continuation*. Let's look at some code:

```
function inc(xs, k)
  for x in xs
    k(x + 1)
  end
end
```

Now suppose `k`

is the `print`

function, then
`inc`

as defined above will add 1 to each number, then print
the result. Note that nothing is returned, and `inc`

only
does its job (adding 1) then performs what it's told to (apply
`k`

). As an exercise, you can try and write down
`dbl`

in CPS style.

But currently each of `inc`

and `dbl`

still
takes in a list, and there's no obvious way to compose multiple
operators. To do that, we replace `xs`

with a "child"
operator `op`

:

``` php
inc(op, k) = op(x -> k(x + 1))

dbl(op, k) = op(x -> k(x * 2))

function scan(xs, k)
  for x in xs
    k(x)
  end
end
```

Intuitively, `inc`

now trusts its child `op`

to
do its job, namely, that `op`

will apply the continuation it
receives to each item. So instead of iterating over `xs`

,
`inc`

simply tags the `+ 1`

step onto the
continuation and passes it to `op`

. I've also defined a
"source" operator `scan`

that connects the input list to the
operators. Let's see the code in action.

`inc(scan(xs), print)`

.`inc`

, this will call
`scan(xs, x -> print(x + 1))`

`scan`

, this gets us
`for x in xs; print(x + 1); end`

So chaining together `inc`

and `scan`

indeed
does what we want! Now let's try a longer chain
`dbl(inc(scan(xs)), print)`

:

`dbl`

gets us
`inc(scan(xs), x -> print(x * 2))`

`inc`

gets us
`scan(xs, x -> print((x + 1) * 2))`

`scan`

gets us
`for x in xs; print((x + 1) * 2); end`

Notice how I used the word `expand`

— if we annotate every
operator definition with `@inline`

, the compiler will
actually unfold the code as we did above, and an operator chain gets
compiled down to a fused loop in the end! You can try expanding longer
chains like `dbl(inc(dbl(inc(scan(xs)))), print)`

to get some
practice thinking about CPS. Julia also has handy tools like
`@code_typed`

that lets you inspect the compiled code, or the
aptly named [Cthulhu.jl](https://github.com/JuliaDebug/Cthulhu.jl) that does
that interactively. In summary, the example shows that if we define
operators modularly with CPS, inlining the definitions will
automatically produce tightly fused compiled code.

None of these is really new, and have been known to functional
programmers for decades by the name of [deforestation](https://en.wikipedia.org/wiki/Deforestation_(computer_science)).
But when implemented in Prela, something incredible happens: *a clean
CPS-style interpreter for Prela automagically recovers fast columnar
execution when compiled!*

The central design principle behind Prela is "everything is a
*binary* relation". This means Prela maps cleanly to both a
logical Entity/Relationship data model, as well as to a columnar
physical storage. I won't go into details here, but encourage you to
play with the language to get a feel for that. Practically, this means
we fully normalize every wide table with m attributes to m binary
relations. For example, a table `movie`

with columns
`ID, year, title`

becomes:

`movie`

) over
`ID`

(you can think of this as a unary table over
`ID`

)`ID`

to `year`

`ID`

to `title`

But now the issue is, even for a simple
`SELECT * FROM movie`

we need to join together 3 different
tables! Whereas a column store would simply run:

```
for i in 0:n
  print(id_col[i], year_col[i], title_col[i])
end
```

In other words, the column store *co-iterates* the columns in
one pass to compute the query.

Let's first look at how Prela used to run this query. The most
important operator in Prela is the *relation composition* \rightarrow, which generalizes function
composition the same way relations generalize functions. In standard
relational algebra: R \rightarrow S = \pi_{x,
z}(R \Join_{R.y = S.y} S) where R's schema is over x and y, and
S's schema is over y and z. The second most important Prela operator is
the *product* \times which takes
two binary relations and joins them: R \times
S = R \Join_{R.x = S.x} S where R's schema is over x, y, and S's
schema is over x and z.

So `SELECT * FROM movie`

is spelled \text{movie} \rightarrow \text{year} \times
\text{title} in Prela. Now, this requires first joining
`year`

with `title`

, whose result is joined with
`movie`

. We can make this a bit cheaper if we can assume the
primary key `ID`

s are dense and continuous, in which case we
can just store `year`

as an array of integers and
`title`

as an array of strings; and for `ID`

s, we
only need to store one single number `n`

which says how many
IDs there are. But even with this, we still need to do the work to join
the tables.

Instead, let's define *compose* and *product* in
CPS:

``` php
compose(lhs, rhs, k) = lhs((x, y) -> rhs(y, (z -> k(x, z))))

product(lhs, rhs, x, k) = lhs(x, (y -> rhs(x, (z -> k((y, z))))))

scan_id(n, k) = for i in 0:n; k(i, i); end

probe(col, i, k) = k(col[i])
```

Let's go over each line carefully. Semantically `compose`

is supposed to return a (binary) relation by composing the
`lhs`

and `rhs`

relations. In CPS, its job is to
apply `k`

to every pair in this composition. Its first
argument, `lhs`

, represents a binary relation and applies the
given continuation to every pair. The `rhs`

is a little
different: it represents a relation that supports *lookup*, i.e.,
`rhs(key, k)`

will look up the values associated with
`key`

, then apply `k`

to each such value. Now
going back to `compose`

— we're saying that, for each
`(x, y)`

tuple in the LHS, we will look up `y`

from the RHS, then for each matching `z`

, we apply
`k(x, z)`

. In loops this will be:

```
for (x, y) in lhs
  for z in rhs[y]
    k(x, z)
  end
end
```

Which is exactly a hash join and a projection that throws away
`y`

.

Next, `product`

itself is a relation supporting lookup,
and so are its arguments. To lookup `x`

in a product, we
first look it up from the `lhs`

which gets us a bunch of
`y`

s. Then for each `y`

, we now look up
`x`

from the `rhs`

, getting a bunch of
`z`

s. Finally, for each `(y, z)`

pair, we apply
the continuation `k((y, z))`

. In loops:

```
for y in lhs[x]
  for z in rhs[x]
    k((y, z))
  end
end
```

This is what will happen if you look up `x`

in R \Join_{R.x = S.x} S.

Finally, we have the "source" operators `scan_id`

and
`probe`

. `scan_id`

is `scan`

but
specialized for a dense ID relation where we only store `n`

:
it simply increments `i`

from 0 to `n`

and applies
`k`

to `(i, i)`

. `probe`

represents an
input relation that supports looking up a primary key `i`

and
which is backed by a dense vector, so looking up an ID `i`

simply indexes `col[i]`

and applies `k`

to the
value.

We're now ready to put everything together and pull the trigger: the
Prela query \text{movie} \rightarrow
\text{year} \times \text{title} desugars to
`compose(scan_id(n), product(probe(year), probe(title)))`

where `n`

is the number of movies. Here are the definitions
again for reference:

``` php
compose(lhs, rhs, k) = lhs((x, y) -> rhs(y, (z -> k(x, z))))
product(lhs, rhs, x, k) = lhs(x, (y -> rhs(x, (z -> k((y, z))))))
scan_id(n, k) = for i in 0:n; k(i, i); end
probe(col, i, k) = k(col[i])
```

Expand `compose`

:

``` php
scan_id(n, (x, y) -> product(probe(year), probe(title), y, (z -> k(x, z))))
```

Expand `scan_id`

:

``` php
for i in 0:n
  product(probe(year), probe(title), i, (z -> k(i, z)))
end
```

Expand `product`

:

``` php
for i in 0:n
  probe(year, i, (y -> probe(title, i, (z -> k(i, (y, z))))))
end
```

Expand `probe`

:

```
for i in 0:n
  k(i, (year[i], title[i]))
end
```

And taking `k = print`

, we finally have:

```
for i in 0:n
  print(i, (year[i], title[i]))
end
```

Spectacular!!

To keep the examples small, I've made several simplifications. The
actual Prela implementation defines two methods `drive`

and
`probe`

for each operator which fire depending on how the
operator is accessed — scanned or probed. But that's pretty much it, and
the complete source is around 1000 lines of Julia code in a single file,
supporting select, project, join, groupby, aggregation, CTEs, UDFs, and
with performance matching DuckDB on TPCH and Join Order Benchmark. I
should note that the performance numbers are riding on lots of
assumptions though, the strongest ones being:

Nevertheless, the CPS approach cleanly separates the responsibility of the data engine — which is to map queries to efficient code in some target language — and the responsibility of the compiler which is to produce fast code quickly, and we see there's a lot of room for the compiler to improve. The assumption that PKs are dense can also be relaxed with the help of B-trees, bitmap filters, and other data structures.

But perhaps the biggest strength of the CPS style is that it makes
Prela *extensible*, as users can write their own operator in a
natural way with a few lines of code, with the assurance that it will be
compiled and fused with the rest of the query for efficient
execution.

Long before AI psychosis, there was FP psychosis,
clinically defined as the intense psychological response to
understanding functional programming concepts like recursion, higher
order functions, monads, or in this case, continuation passing style.[↩︎](#fnref1)

All code in this post is in Julia.[↩︎](#fnref2)

`scan(xs)`

stands for
`k -> scan(xs, k)`

, i.e., it is the [curried](https://en.wikipedia.org/wiki/Currying) application of
`scan`

to `xs`

. Similarly for
`inc(scan(xs))`

below which curries `inc`

with
`scan(xs)`

.[↩︎](#fnref3)