Presentation: Million PDFs: Building a Modern Document Infrastructure with Rust and Typst

wpnews.pro

Transcript #

Erik Steiger: I will be talking today about PDFs. I know currently we have the era of AI and every other talk has mostly something to do with LLMs. I have to say my background is also in AI and software, but because of the last one-and-a-half years where I've been working in consulting and with companies that do a lot with documents, especially in the compliance area, that has been a pain point of mine. I wrote about it and it struck a nerve. Just to understand where you're sitting or why you are here, I would like to see, who of you came to this talk because you also had something to do with PDFs and have a pain point and it's maybe something you're trying to solve. Maybe some of you are here because it mentioned Rust or Typst, or someone with serverless and so on?

Two Experiences - Banking and Manufacturing #

The two experiences I had last year were one case at the bank who was trying to scale. As you can imagine, like most banks, it's like legacy software and it actually still is COBOL running the show for at least the client where I was. Especially the PDF pipeline that they had was getting too slow. It's like at the point where the customer, you, are buying something like a stock or something and you're waiting for days for the PDF to arrive. It's not only inconvenient, it's regulatorily not allowed. At some point the German Regulatory Institute was like, guys, you cannot do this. They were forced to change. Then they started thinking, what do we do? Obviously, there was this big movement of bringing everything into the cloud. They thought about, yes, can we use AWS Lambda? At this point, I already moved on to another project because I was moved out.

From colleagues who still worked, I heard that it was like one or two years, they were still trying to figure out how can we do PDF rendering in a nice pipeline. I just heard something with Java and pre-compiled, like compiled programs based on every template. I was like, this is going to be a mess. I'm not in contact anymore, but I don't think they have updated since like two years. The second experience was manufacturing. It was not the speed or the latency of every PDF, it was more, how do we manage? Because it was in a regulated industry where you want to know when a truck leaves the facility, it needs the right weighing slip that says, ok, the truck weighs this many tons. What kind of gas, for example, is in there? If it's, for example, medical gas, there are regulatory requirements. In manufacturing, I worked on it for some time, and I can tell you the workflow was somewhat like this. The customer has some problem. The truck driver can't leave because the certificate is just not getting printed. He calls us because we were the service provider for this system. We had first to jump through two VPNs to some remote desktop to see, is it actually an error? Is it a user error? Where's the PDF? Then because of how the system was built up, we had to take a whole backup of the production database, which was like 5, 6, 7 gigabytes, download it through the two VPNs, then put the whole backup file into our test environment, create the valid fake order, because how the system was created, you had to have an order when you want to test if the certificate can be printed, and then try to reproduce it. I felt, because I came from working with startups, that everything was a lot of this, like waiting for a long time.

I came from somewhere where we used modern tools and not these outdated tools. Who of you knows Crystal Reports? Would you speak highly of it? They're like thumbs down in the last row. It was actually really a pain point. For those of you who don't know, it's a software that originated in 1984, which is a very long time. I don't know if we had PCs back there, but seems like it. How it worked is you connected it through like credentials to a database, then it would fetch the schema. There you see like on the left side, there are these little fields where you can then drag and drop it into your PDF. That's how you build it. Now imagine, because we had a lot of factories in different countries, and then there comes like, we have a new factory in Czech. Can you please translate?

This means you click on everything you see, you click on everything, copy, paste it into DPL, or your LLM and replace it. This was horrible. It broke me mentally, I feel like. Other tools, if you look around, how else can you generate PDFs? There are other tools, for example, LaTeX. Do some of you come from a scientific background and know LaTeX? You can think of LaTeX like a programming language that spits out the PDF. I know it because from my studies in math, and if you try to like do formulas in Word, it will definitely break you. LaTeX is something where you just have code for how to do a pi sign and so on. Another thing that's used in the industry is something like Puppeteer. This goes more into web development because you render your PDF through web technologies like Chrome, and you just take the PDF and download it.

Obviously, it has its drawbacks because you have this whole overhead of creating a headless Chrome instance, like a browser that renders it, and then spits out the PDF. The benefits are obviously you can use talents, or if you know CSS, you can probably get around and build a PDF for you. For LaTeX, for example, you would have to learn how to set a section, a break, and so on.

Coming from startups, working with code that was managed with Git, where you had something like reproducibility with Docker and CI/CD, I felt like working with documents in the projects that I worked with was super horrible. It was always like, embrace the chaos and just spend four or five hours until it works. I always felt like there is room for improvement. My early Christmas wishlist for this would be obviously speed. If you're like someone who needs to render a million PDFs by the end of the day, for example, if you're a bank or a broker, but also memory consumption because spinning up a whole browser just to print a PDF is probably not the most efficient way. Version control is something that we as developers, we know and also love and are also dependent. I remember back 15 years when you wouldn't know Git and you start writing your files as final version v3 or for some project.

In general, the modern developer experience of something like syntax highlighting, cross-platform, it was like Crystal Reports, for example, that I showed earlier, it only works on Windows. It also only works on Windows, not even the ARM version. Where I was working, I tried to use my Mac, but because my Mac is like the new Apple Silicon, which is ARM based, Crystal Reports was having none of it when I tried to use a virtual machine. One big thing was, you saw with the workflow, we spent a long time to download a whole database just to get this connection and try to test it. In the end, we actually just wanted the JSON with the data we want to print.

A Modern Typesetter #

For those who know LaTeX, the past few years, there came a new typesetter from a team in Berlin based on a master thesis. What they share, LaTeX and Typst, they are both typesetters. This means it's not like Word where you get a PDF like what you see. It's more like you have the code where you write, this is my text, and format it please very nicely. You have nice paragraph breaks and so on. You see here on the right side that the syntax resembles like some kind of Markdown DSL where you have these equal signs for a section header and so on. The nice thing is it's more lean than LaTeX. For those who worked with LaTeX and tried to install it on their PC, it's 5 gigabytes for the whole distribution. It's very complicated. There's a lot of dependencies. Typst is like a modern rewrite where it's just like you download it, which is I think less than 50 megabytes, and then you just compile the file you want to compile into a PDF. It's not only more modern in the developer experience, it's also faster for larger files and it has very nice error messaging. For LaTeX, for example, if you have something that breaks because the variable isn't right, you would get huge error messages and you're like just lost. Here you get really good error messages. My idea was then, ok, so we have Typst, what can we do with it? In our case, what we want to do, we have a template, something on the right, and we have our data that comes from a machine, that comes from customer data from our database. How can we inject it? Because in the original idea of Typst, you would prepare a template, a fixed file, and you would print it into a PDF. We want something in between.

We want a fixed template that always looks like this. If you have invoices, you always want them to look in the same way. You want to replace the name, the invoice ID, and so on. I wanted something where it was very easy to use as a library in the language you wanted to use. For example, you see down there, in its easiest sense, it's just like render this template. There it's just a string, but obviously, you can get it from the file system, and then just some JSON with the data. That was the goal.

Serverless Rust Implementation #

I wrote a blog post article where I used Typst and bundled it into something that was then deployed on AWS Lambda. It used technologies like AWS Lambda, Terraform to just have infrastructure as code, and it was implemented in Rust. It was like a side project for me to show for me that there's an opportunity for something better to be had than Crystal Reports, what everyone was using. I posted about it, and it got, at least for me, a lot of impressions. Also, I posted on Reddit about it, and there were also some people that definitely felt the pain on generating PDFs, and it surprised me actually a little bit. Also, they're like bad Crystal Reports memories from internship, and it seems like there's obviously a community that shares this pain. The architecture that I went for, for the serverless approach on AWS looked something like this. You would have two Lambdas, two functions.

The first one was actually dumb, because it was just taking a request and putting it into a queue. This is the SQS you see there. The second function was bundling Typst into an AWS function, and it was reading the request, getting the template from S3, and printing the PDF and putting it back into S3 bucket. What made this possible, or what it relies heavily on, was cargo-lambda. It's a way for writing Rust-based AWS Lambdas. You can imagine the result looked something like this, that you have your endpoint. You give it the name of the template you want to print, and the data, and what comes out of it was just a confirmation. A, we queued it. It's going to be rendered. Obviously, at this point, there's no retry and error handling. It was just about, can we use Typst to get the performance rendering? We can.

The results were quite astonishing in the sense that it uses way less memory than traditional approaches, like spinning up a whole browser or having this LaTeX. It's below 50 megabytes, and the rendering is below 100 milliseconds, so in the two-digit milliseconds. If you scale this up, it only costs less than 50 cents to render a million PDFs. In comparison, that's like 20 times less than with other approaches.

This is just speed in a way, and I was also not happy because what this didn't work on was this whole management of PDFs. A nice thing also for you, if you have some side project, I can only recommend put it somewhere, Reddit or so, because they will find the mistakes for you. For example, I didn't know that if you have provisioned concurrency, it generates billing the whole time. There were also some points where I thought, yes, you could actually just remove the three parts and call the Lambda that renders it directly through the Lambda function URL. I did some tracing to find out, what's the actual speed? How much can we get out of it? It turns out if you have some caching in place on the template, you can imagine what Typst does. It gets a template with some variables, and then it lays it out on the PDF, how wide, how big words are.

If you just change a little part below where it's like a new number, the rest can stay compiled. You cache it, and you just replace this part. This makes it so fast that on a very small AWS Lambda, you can get a rendering below 2 milliseconds because most of it is cached. Obviously, there were some improvements. For example, the upload to S3 was still one after another, so this could be improved to reduce another 30%. What we got to this point was a rendering engine that works on simple templates. You can imagine the text templates for Typst, you can open them with your code editor, and you can type it. You can put it into Git, and you see Git differences. The input was JSON, so that's something you can very easily mock up. The developer experience was very good because you could compile it on your PC. I could compile it for ARM, for the Linux environment on AWS. The performance was also really great.

What Was Lacking? #

What was lacking was, what do you do if you're a bank or financial institution, and you want to use the service you already have in your basement? You're not going to use AWS, or maybe you shouldn't. How do you really do version control in these templates? Because creating a Git repository for every template you have is definitely not recommended from my side. Multi-file support was also not there. You can imagine you don't want a massive file where you have like 2,000 to 10,000 lines. You want to structure it. You want to include maybe images, logos, and so on. This was also not in the first version that I worked on. The debugging was also still not great. I thought to myself, what would we need, or where can we draw motivation or inspiration for? It was something like Docker Hub or package managers because they have strict versioning.

For Docker, for example, you not only know, ok, I have this Docker image for Python, you can even specify by a hash which specific version, which layer do you get. Based on this, I draw together some ideas which I wanted to work on, the first thing is content-addressable storage. png, but you hash the content, you get some hash of random numbers, and then you just take this as your file name. You automatically get deduplication with this. If you now have a lot of templates that use the same logo, because it's the same file, they're not going to be duplicated because it's the same file, so you get the same hash, so you don't get multiple files. Something like text that we know from working with Git where you have branches, I wanted something that's first easy to use and to be human-readable, so to say, I have this invoice at latest, so it's the latest version. Maybe you have something like invoice at version 3 for the next rollout for when you have a new corporate design or so. Sometimes, in the regulated industry, you want to be sure that you take the template where the team came and said, we're going to fix it, and I want to be sure that we always take this template. There's also a way to specify a specific template, and not only template because you have to understand, template means not only file, but the bundle, and then we come to the next part. We need a way we can bundle logos, assets like fonts into one package such that the output is always the same. The last part is obviously we want it still to be a nice library that we can use, for example, in a server approach when we later build the next step.

Just to show you how this could look like, imagine you have an endpoint, like your backend where you manage all your PDF templates. We can say, take this where you see main type. This is like the main entry point. This is the main template, and you can attach files to it and add some metadata. What you get back then is a hash, which is representation of all the data you've put in. If you look at it, how it is stored, you would see that it creates a manifest file with this hash that you got back. This gets stored, and the content looks like this. It's similar to, if you know how Git works, it also builds something like a Merkle tree that we know from cryptocurrencies. It builds like a tree here where we reference the data we will later pull into our template by their hashes. We build a JSON document out of it, and then hash it again.

When I say I want a template starting with 39a9, this means, get me these files and put them into a bundle and then render it. If we work now with reference, because you don't always want to remember the whole hash, so we can also create references. This only means that latest points to this hash, and this hash is again this manifest file. The whole flow looks like, if I type in invoice latest, we look at the reference, we take the hash, we take the manifest, and then later pull from storage all the blob files, can be the image, can be the main template.

A New Architecture #

Coming back to the workflow, this meant that when our customer would call us and say, this certificate that we tried to print, it didn't work, this means, because we store not only the template, but also the data, we could look up at renders we did that contained, in this case, ID of the order. We could look up and then we see there's this failed render entry where we get the render ID, we get the template reference, we even get a hash to the data that was used. If you store also the data that was used to render, you could look up, maybe they forgot something. In this case, maybe they probably forgot the expiry date. This means we can reproduce failed renders, like take the data that was used, download it, inspect it, test it, and test it so long until you get it right. To zoom out a bit, the thing I did with the AWS serverless was mainly a rendering engine.

You can go back and build the registry on top using modern techniques we know from something like Docker Hub or package managers, where we have template management. It's more compliant, and you get analytics. Because if you have a large factory that prints a lot of documents, you also want to know which one failed in the last week and then inspect why. Then, if you have something like this, it's pretty easy to build a server on top that you can then horizontally and vertically scale. You have parallelization that pretty nicely uses all your CPU cores without much of a headache on your side. You get strong caching, which reduces the latency even further. You just have a modern data interface where you can use something like JSON.

To give you even like where this could lead or how you can think of it is that you can even put something, I said, like we had this keynote, and this would be like the pre-MCP, like still a normal UI, where you would see the past renders that your faculty did, which one failed, why they failed. You can download the PDF, you can inspect the data. Coming back and thinking where the old tools were, like LaTeX being very heavy, very big dependencies, you would need 5 gigabytes to install it. The Docker images to install LaTeX is huge. In this case, if you bundle it, it's less than 100 megabytes. Then Crystal Reports, it still was dependent on having an active database connection. You also don't need that here. You just have your template and you give it some data. Also, like Puppeteer, I don't know who of you have tried to render something with that?

The nice thing is you get these guarantees that you can design it with CSS, but it consumes a lot of memory because you have long cold start times. If you try to use something like Puppeteer on AWS Lambda, it's going to be quite slow because you need at least a second to start the browser. This was at least the project for me where I saw that you could use modern tooling that was originally meant for scientific work. Typst was meant to print scientific papers. Because of how performant it is, you can use it also for document generation in the industry. If you now take then modern tooling that we see from Docker, from Git, from version control, we even get something where we have strong compliance guarantees. We can manage it very nicely, where we always get the PDF that we want in the beginning.

Resources #

For those of you who are interested or you think this could be relevant for you, all the code and all that I mentioned is open source. Even the original Typst PDF renderer is open source. If it's something for you, look into it and reach out if you have something.

Questions and Answers #

Participant 1: Can you show an example of this document where you did this benchmark with this 1 million PDFs?

Erik Steiger: Yes. I don't know if your question goes into if it was too simple?

Participant 1: Yes. Because my experience is, for example, if you will look at the poster here. The poster consists of six texts and seven images and some vector graphics, and it depends on how many elevens you have and how complicated they are, and how many fonts you need.

Erik Steiger: The template I used looked like this. I tried to actually have an honest comparison. I tried to design it as a trade confirmation for a bank. I know you're more in the invoicing?

Participant 1: Yes. It looks ok. It's one page with a lot of text, some lines, some vector.

Erik Steiger: Exactly.

Participant 1: It's no images. The top left is text or image?

Erik Steiger: I think this was a PNG file, yes. There you see most of the data that probably changes, was randomized. Most of the stuff you see, for example, the table had to be re-layout. Sometimes the results even had two pages based on how long the table were.

Participant 2: The comparison to LaTeX, because on the Docker container size, I couldn't agree more. It's a hell of a thing, LaTeX in a Docker container. Did you run it speed-wise against LaTeX, out of curiosity?

Erik Steiger: No. I remember from my times using it back then. I think the fastest you get is 300 milliseconds to 500 milliseconds. Speaking on that, if you say that, it's not like you know which program to use, because LaTeX is huge. They're different, XeTeX and LuaLaTeX. I think to do a comparison, you can optimize it very heavily. If I would build a PDF rendering engine back then, 5 years ago, I would use LaTeX. I also saw at stock agencies that they were using it with MATLAB, but I always felt it's not very modern because of the big compilation size and the error messaging that was [inaudible 00:32:25], in a way. If you're, for example, interested, there are still the blog posts about it on my website that talks a bit more about the AWS Lambda setup and how you can compile a Rust program into an AWS Lambda function, and also do with the Terraform.

Participant 3: Is everything here that you instruct open source? Every tool or you have something that you may need to pay a license?

Erik Steiger: Everything is open source, so that's the repo where most of the crates are. It's split into three crates, the main library that builds around Typst. Then you have the registry, which works with all this manifest, creating the hashes. Then the last one is more or less just a topper where you have a server. All of it is open source. Like I mentioned, it heavily relies on the PDF rendering engine types. If you want to have a go at it, feel free.

Participant 3: Yes, I'm using it mainly from Java Enterprise work, I'm using just for inputs. We've come close to 100 milliseconds per rendering, because you can now use caching and browsing enables you to start it very fast, not to wait for the VM to start up, but some [inaudible 00:34:21] is still out of reach.

Erik Steiger: Depending on how complicated it is, with caching, you get sub-10 millisecond.

Participant 3: Usually, if there are like large PDFs with all the disclaimers and all the information with this from the customer, and stuff like that. It's large like that.

Erik Steiger: Was speed your main problem?

Participant 3: Amount of PDFs is my main problem, so very large amount of PDFs.

Erik Steiger: To generate them in time.

Participant 3: Yes.

Erik Steiger: Yes, similar to the bank. Yes, try it.

Participant 4: My question was like, so you have text in JSON form, then you have a hash of text in the registry of it, you map it to a template, and then you render the PDF type. That's where I'm like, can you go in reverse? You have the PDF, then you get the template which was used, then you get the JSON back. Is that a possibility to come back to it?

Erik Steiger: You mean you rendered a bunch of PDFs, and then the customer comes back to you with a PDF and says, can you check when it was rendered and with which data?

Participant 4: Yes.

Erik Steiger: If you have it digitally, probably yes, because you can imagine that at some point you process or you generate the PDF. You could store this PDF, or at least the hash in your bucket, and then you would know which manifest was used to render it, and from there you can derive all the rest. It's interesting for compliance areas where you want to know which version was used.

Participant 4: Where I work, we do a lot of PDF processing, so we need to figure out exactly what's in the PDF, essentially, and then you go back to a JSON format for that, and then we create, let's say, more PDFs from that. It's like a circle that you need to do. I was just wondering if you can, let's say, maintain similar hashes, then maybe you can figure out what's the closest hash, and say, ok, this is maybe the format that was used.

Erik Steiger: That's a bit different. If it's visually just a little bit different, even if there's just something different that's not visible in the PDF, this would change the hash. What you're talking was like visual comparison.

Participant 5: In my experience, when you design some documents for instance invoices, the customer often asks for very specific layouts and something. My question would be if you had to set some kind of boundaries or if the Typst was powerful enough to create everything in custom mode?

Erik Steiger: I don't know if Typst does animation, and I know PDF can do animation. I don't know if you want to go that far. I think for most layouts that we see in PDFs, it's possible. Because we are obviously using Typst, you could look at the examples they give. Most of them are from scientific nature, but you can do graphs. You can bring in images. It's actually quite powerful. I would say yes to your question, if it's not very exotic.

See more presentations with transcripts

source & further reading

infoq.com — original article