Passing DBs Through Continuations

wpnews.pro

Dedicated to the Minnowbrook Analytic Reasoning Seminar with special thanks to Kris Micinski and Michael Ballantyne

Suppose you want to write a database. You'd probably start by implementing relational algebra operators — projection, filter, join, etc. The easy way is to implement them as functions that take in tables and return tables, and assemble them into a larger expression. That was how Prela worked in its first incarnation. The code was clean, but it was hella slow! Which was not surprising, because every operator materialized every intermediate result. The standard solution to this is the iterator model, where each operator implements an Iterator interface that streams intermediate tables row by row instead of materializing them. But implementing the iterator model naively still incurs overhead: every call to Iterator.next()

triggers a dynamic dispatch, which costs vtable lookups and destroys cache locality. There are two standard remedies: vectorization and compilation. A vectorized database amortizes the overhead by implementing Iterator.next_batch()

which returns a whole batch of data that can be processed together; a compiled database, well, compiles the incoming query directly to fast machine code that runs without any dynamic dispatch. Either approach takes a lot of very smart people spending their entire working life to build, and it's why systems like DuckDB and Umbra exist. I'm moderately smart but don't have a lot of time, so I was looking for a shortcut. The shortcut I stumbled upon was so beautiful that I literally cried 1 when I finally understood it, and I hope my explanation below will make you cry too :' )

To keep things simple, let's suppose we're just dealing with lists of numbers, and we want to do two very simple things to them: inc

adds 1 to every number, and dbl

doubles them. That's pretty easy to write:2

inc(xs) = [x + 1 for x in xs]

dbl(xs) = [2 * x for x in xs]

Now, we can chain them together with dbl(inc(xs))

which will do two steps in sequence. Problem is, because each function takes in a list and returns a list, our program produces an intermediate, namely inc(xs)

. This allocates a new list only to be thrown away by the call to dbl

. Things only gets worse when we chain together multiple calls to inc

and dbl

. A more efficient implementation would fuse together the operations:

inc_n_dbl(xs) = [2 * (x + 1) for x in xs]

Of course, we can't write down every possible combination of operators like this. Is there a way to define each operator modularly, yet still have them compose into tightly fused operations automatically? Yes, if we use a bit of magic from functional compilers — continuation-passing style (CPS).

The key idea of CPS is to define operators that do things instead of making things. inc

and dbl

as defined above each takes in a list and makes a list. Instead, the CPS version of each operator takes in a list and an additional input k

: this k

is a function that the caller passes in, specifying what it wants to do with each element after the operator's work is done. k

is called the continuation. Let's look at some code:

function inc(xs, k)
  for x in xs
    k(x + 1)
  end
end

Now suppose k

is the print

function, then inc

as defined above will add 1 to each number, then print the result. Note that nothing is returned, and inc

only does its job (adding 1) then performs what it's told to (apply k

). As an exercise, you can try and write down dbl

in CPS style.

But currently each of inc

and dbl

still takes in a list, and there's no obvious way to compose multiple operators. To do that, we replace xs

with a "child" operator op

:

inc(op, k) = op(x -> k(x + 1))

dbl(op, k) = op(x -> k(x * 2))

function scan(xs, k)
  for x in xs
    k(x)
  end
end

Intuitively, inc

now trusts its child op

to do its job, namely, that op

will apply the continuation it receives to each item. So instead of iterating over xs

, inc

simply tags the + 1

step onto the continuation and passes it to op

. I've also defined a "source" operator scan

that connects the input list to the operators. Let's see the code in action.

inc(scan(xs), print)

.inc

, this will call scan(xs, x -> print(x + 1))

scan

, this gets us for x in xs; print(x + 1); end

So chaining together inc

and scan

indeed does what we want! Now let's try a longer chain dbl(inc(scan(xs)), print)

:

dbl

gets us inc(scan(xs), x -> print(x * 2))

inc

gets us scan(xs, x -> print((x + 1) * 2))

scan

gets us for x in xs; print((x + 1) * 2); end

Notice how I used the word expand

— if we annotate every operator definition with @inline

, the compiler will actually unfold the code as we did above, and an operator chain gets compiled down to a fused loop in the end! You can try expanding longer chains like dbl(inc(dbl(inc(scan(xs)))), print)

to get some practice thinking about CPS. Julia also has handy tools like @code_typed

that lets you inspect the compiled code, or the aptly named Cthulhu.jl that does that interactively. In summary, the example shows that if we define operators modularly with CPS, inlining the definitions will automatically produce tightly fused compiled code.

None of these is really new, and have been known to functional programmers for decades by the name of deforestation. But when implemented in Prela, something incredible happens: a clean CPS-style interpreter for Prela automagically recovers fast columnar execution when compiled!

The central design principle behind Prela is "everything is a binary relation". This means Prela maps cleanly to both a logical Entity/Relationship data model, as well as to a columnar physical storage. I won't go into details here, but encourage you to play with the language to get a feel for that. Practically, this means we fully normalize every wide table with m attributes to m binary relations. For example, a table movie

with columns ID, year, title

becomes:

movie

) over ID

(you can think of this as a unary table over ID

)ID

to year

ID

to title

But now the issue is, even for a simple SELECT * FROM movie

we need to join together 3 different tables! Whereas a column store would simply run:

for i in 0:n
  print(id_col[i], year_col[i], title_col[i])
end

In other words, the column store co-iterates the columns in one pass to compute the query.

Let's first look at how Prela used to run this query. The most important operator in Prela is the relation composition \rightarrow, which generalizes function composition the same way relations generalize functions. In standard relational algebra: R \rightarrow S = \pi_{x, z}(R \Join_{R.y = S.y} S) where R's schema is over x and y, and S's schema is over y and z. The second most important Prela operator is the product \times which takes two binary relations and joins them: R \times S = R \Join_{R.x = S.x} S where R's schema is over x, y, and S's schema is over x and z.

So SELECT * FROM movie

is spelled \text{movie} \rightarrow \text{year} \times \text{title} in Prela. Now, this requires first joining year

with title

, whose result is joined with movie

. We can make this a bit cheaper if we can assume the primary key ID

s are dense and continuous, in which case we can just store year

as an array of integers and title

as an array of strings; and for ID

s, we only need to store one single number n

which says how many IDs there are. But even with this, we still need to do the work to join the tables.

Instead, let's define compose and product in CPS:

compose(lhs, rhs, k) = lhs((x, y) -> rhs(y, (z -> k(x, z))))

product(lhs, rhs, x, k) = lhs(x, (y -> rhs(x, (z -> k((y, z))))))

scan_id(n, k) = for i in 0:n; k(i, i); end

probe(col, i, k) = k(col[i])

Let's go over each line carefully. Semantically compose

is supposed to return a (binary) relation by composing the lhs

and rhs

relations. In CPS, its job is to apply k

to every pair in this composition. Its first argument, lhs

, represents a binary relation and applies the given continuation to every pair. The rhs

is a little different: it represents a relation that supports lookup, i.e., rhs(key, k)

will look up the values associated with key

, then apply k

to each such value. Now going back to compose

— we're saying that, for each (x, y)

tuple in the LHS, we will look up y

from the RHS, then for each matching z

, we apply k(x, z)

. In loops this will be:

for (x, y) in lhs
  for z in rhs[y]
    k(x, z)
  end
end

Which is exactly a hash join and a projection that throws away y

.

Next, product

itself is a relation supporting lookup, and so are its arguments. To lookup x

in a product, we first look it up from the lhs

which gets us a bunch of y

s. Then for each y

, we now look up x

from the rhs

, getting a bunch of z

s. Finally, for each (y, z)

pair, we apply the continuation k((y, z))

. In loops:

for y in lhs[x]
  for z in rhs[x]
    k((y, z))
  end
end

This is what will happen if you look up x

in R \Join_{R.x = S.x} S.

Finally, we have the "source" operators scan_id

and probe

. scan_id

is scan

but specialized for a dense ID relation where we only store n

: it simply increments i

from 0 to n

and applies k

to (i, i)

. probe

represents an input relation that supports looking up a primary key i

and which is backed by a dense vector, so looking up an ID i

simply indexes col[i]

and applies k

to the value.

We're now ready to put everything together and pull the trigger: the Prela query \text{movie} \rightarrow \text{year} \times \text{title} desugars to compose(scan_id(n), product(probe(year), probe(title)))

where n

is the number of movies. Here are the definitions again for reference:

compose(lhs, rhs, k) = lhs((x, y) -> rhs(y, (z -> k(x, z))))
product(lhs, rhs, x, k) = lhs(x, (y -> rhs(x, (z -> k((y, z))))))
scan_id(n, k) = for i in 0:n; k(i, i); end
probe(col, i, k) = k(col[i])

Expand compose

:

scan_id(n, (x, y) -> product(probe(year), probe(title), y, (z -> k(x, z))))

Expand scan_id

:

for i in 0:n
  product(probe(year), probe(title), i, (z -> k(i, z)))
end

Expand product

:

for i in 0:n
  probe(year, i, (y -> probe(title, i, (z -> k(i, (y, z))))))
end

Expand probe

:

for i in 0:n
  k(i, (year[i], title[i]))
end

And taking k = print

, we finally have:

for i in 0:n
  print(i, (year[i], title[i]))
end

Spectacular!!

To keep the examples small, I've made several simplifications. The actual Prela implementation defines two methods drive

and probe

for each operator which fire depending on how the operator is accessed — scanned or probed. But that's pretty much it, and the complete source is around 1000 lines of Julia code in a single file, supporting select, project, join, groupby, aggregation, CTEs, UDFs, and with performance matching DuckDB on TPCH and Join Order Benchmark. I should note that the performance numbers are riding on lots of assumptions though, the strongest ones being:

Nevertheless, the CPS approach cleanly separates the responsibility of the data engine — which is to map queries to efficient code in some target language — and the responsibility of the compiler which is to produce fast code quickly, and we see there's a lot of room for the compiler to improve. The assumption that PKs are dense can also be relaxed with the help of B-trees, bitmap filters, and other data structures.

But perhaps the biggest strength of the CPS style is that it makes Prela extensible, as users can write their own operator in a natural way with a few lines of code, with the assurance that it will be compiled and fused with the rest of the query for efficient execution.

Long before AI psychosis, there was FP psychosis, clinically defined as the intense psychological response to understanding functional programming concepts like recursion, higher order functions, monads, or in this case, continuation passing style.↩︎

All code in this post is in Julia.↩︎

scan(xs)

stands for k -> scan(xs, k)

, i.e., it is the curried application of scan

to xs

. Similarly for inc(scan(xs))

below which curries inc

with scan(xs)

.↩︎

source & further reading

remy.wang — original article

Passing DBs Through Continuations

Run your AI side-project on zahid.host