Podcast: Chasing Efficient Java Development: From 1BRC to Developing Hardwood AI Natively

wpnews.pro

Gunnar Morling, technologist at Confluent and Java Champion, shares his experiences with building high-performance applications in Java, especially in the data space. He shares insights from experiments with building durable execution engines, bootstrapping, and AI natively developing Hardwood - a minimal dependencies Java parser for Apache Parquet.

Key Takeaways

The use of modern Java versions offers substantial out-of-the-box performance benefits, including reduced object memory footprint through compact object headers and improved concurrent garbage collection (e.g., ZGC).
Durable execution engines allow complex, long-running workflows to be defined as plain, end-to-end code, simplifying the implementation of resumable and recoverable processes with minimal infrastructure.
Unlike existing Java parsers, which introduce a large dependency footprint and liabilities like supply chain attack risks and class path conflicts, Hardwood was built as a fast, zero-dependency project to minimise these issues.
Hardwood leverages highly granular page-level parallelisation, which uses Java's Virtual Threads for lightweight, scalable concurrency to maximise the use of all available CPU cores.
The development of Hardwood AI-natively was smooth, mainly due to the format's extended documentation. However, human oversight is critical for ensuring code quality, adherence to design (e.g., design documents, minimal public API), and preventing regressions.

Subscribe on:

Transcript #

Olimpiu Pop: Hello everybody. I'm Olimpiu Pop, an InfoQ editor, and I have in front of me Gunnar Morling. He's one of the team members who knows best what Java does with data because he has an eye for and an interest in these kinds of things. So without any further ado, Gunnar, can you please give us an introduction of yourself?

Gunnar Morling: Yes, Olimpiu, thank you so much for having me. Yes, what you said pretty much describes my interest. I'm very much interested in the intersection between Java and the data space. I've been around for quite a bit. For instance, I used to work on the Hibernate project and on Bean Validation. I used to be the spec lead for Bean Validation 2.0 back in the day, which is definitely a data story in the Java space. Then later on, I used to work on Debezium, which is a tool for change data capture, which is about taking data out of a database such as Postgres or MySQL and putting it into Kafka and then, for instance, enabling real-time data flows from a database into your data warehouse or search index.

It also allows you to do microservice data exchanges and so on. These days, I work as a technologist at Confluent. It's a wide mixture of different things. There's an internally facing part of it. So I do investigations into technologies. Should we invest in certain projects or maybe make a certain acquisition? So I would do some research in that space. Sometimes I'm answering questions for our leadership team. Somebody might ask, "Hey, can somebody explain Flink watermarks to me?", and I'll write a one-pager about that. And then, publicly, I focus a lot on writing on my blog.

I go out to conferences and, yes, also try to do some prototyping and some open-source projects. I'm still involved with Debezium. I'm working on a new project, which we are going to talk about today. So yes, it's a bit of all of that.

Olimpiu Pop: That sounds fun.

Gunnar Morling: It is fun. Yes, absolutely. I definitely enjoy doing it. And I've been in the industry for quite some time, but I still enjoy it every day.

Reflecting on the One Billion Row Challenge's Unexpected Impact [[02:32](javascript:void(0);)] #

Olimpiu Pop: Well, a lot of things change with you, Gunnar, in terms of company and projects, but the gist of it remains the same. Passion for technology and all things, and being curious and stuff like that. So happy to see that you're the old self. So there are two things I remember when people talk about you, and both are related. One of them was the One Billion Row Challenge, which exploded, and you just brought together the whole Java community. And that started like, okay, I think Java can be really fast, going against everything that is going. How do you feel? I think it's been almost three years since that happened.

Gunnar Morling: Yes. It's very interesting that you ask, because it's been two years and a bit. So this happened in January two years ago. And people still ask about it, and people still do pull requests against the GitHub repo. I mean, it's long closed, actually. There's like a readme that says the project for this challenge has been concluded, but people are still curious. Every now and then, somebody would talk about it, maybe they do a podcast or something. So yes, it still interests people. Also, people ask: Will there be a new challenge? And I'm still thinking about it.

For the first year, I definitely didn't feel ready to do it again because it was immensely stressful. I spent the entire January of 2024 running the challenge, and I didn't feel like ever doing it again. And then later on, I kind of recovered, and if there's a good idea that would interest people, I'm happy to do it. The thing is, and I believe this is why it blew up back then, it hit its sweet spot: it was very easy to explain. So you have a file with 1 billion rows and need to aggregate the values. Everybody understands it in a minute. And still it allowed for lots of potential. And people were busy optimising for the entire month, and they would have kept going had I not stopped them at the end of January. And so, I'm still looking for another problem which kind of combines those two characteristics. And then yes, I would be open to doing it. I also want to automate it much more; it was very much manual labour. I just didn't really anticipate that. And so yes, I would spend more time automating it, setting it up. If anyone has a good idea, I'm all for it.

Olimpiu Pop: Great. Thank you. Good luck with your future idea and all.

Gunnar Morling: Yes.

JDK 17 as the New Baseline Provides More Performance Out of the Box [[04:50](javascript:void(0);)] #

Olimpiu Pop: The other thing is, how do you feel Java has evolved in this time span? Because I know that some people went old-fashioned, and they just used things that were the old way of doing stuff that people in the mechanical sympathy space were using at some point. And then there were the other guys who were exploring the new stuff that was just getting started. Now, the new stuff at that point is already part of the JDK; it's part of the LTS. So it's probably in production now. How do you feel the Java ecosystem has changed over the last three years?

Gunnar Morling: Yes, I mean, not all of that actually is stable. So there is the vector API, which people heavily used in the challenge. And I don't even know, it's like in the 11th incubator version or something like that. So it's still progressing. But yes, things like the foreign memory API have been finalised, and people are using them. Yes, Java has come a long way since then. I still think that if you want to reach that super-advanced level of performance, you'd probably have to pull quite a few of those tricks people use.

And then there are new traditional APIs which did not even exist back then. So, for instance, one thing I'm really interested in is the compact object headers JEP which essentially reduces the size of each object on the Java Heap. And this allows your JVM to use less memory and, well, spend fewer cycles on GC because there's a smaller total amount of memory to manage. And this is one of the really cool things about the JVM, in my opinion. You can keep your application as is. And just by upgrading to a new Java version, you would get better performance.

And you would, for instance, also benefit from those new concurrent GC algorithms like ZGC, which have also improved substantially since the challenge. That's why I always recommend everybody not to stay on those ancient versions. I know some people are still on Java 8 or whatever; definitely go and upgrade to the latest versions. It gives you all those performance improvements. And once you've made that leap to a relatively current version, let's say 17, upgrading is also really easy. So after 17, pretty much everything is a drop-in replacement, and it doesn't take long to upgrade to the latest one.

Olimpiu Pop: Okay. So it seems that Java 17 is the new Java 8. I remember that back in the day, when I started with Java, it was 1.4 at the time. And then Java 8 was like an explosion, and that was the epoch moment. And now it's like it seems that 17 is the new-

Gunnar Morling: Kind of baseline.

Olimpiu Pop: Yes. I don't want to talk about Titans, but I have to say it, so it seems that Java 17 is the Java 8 of the Oracle era of Java.

Gunnar Morling: Right. Between 8 and 17, there were quite a few disruptive changes, like locking down the usage of reflection, the removal of APIs like JAXB. And so the module system was introduced, of course. So all those things gave a bit of friction, I would say. But once you are on 17, it's pretty smooth. And going to the latest, it's not a lot of work.

Olimpiu Pop: I had several debates with people, and they said, "Okay, well, if you change the Java runtime environment, you're not taking advantage of the whole change". But even if you change the JVM you're running on, it'll just bring improvements, and that's a good first step. And then you can just look at the features and what you can encapsulate in the code to make it even better.

Gunnar Morling: So that very much hits it home. I would say when people are looking into those upgrades, and maybe they want to make the case with their management team to do it, I would always recommend that they don't focus on language features because I mean, yes, it's nice if you can express certain things in a more concise way or maybe a safer way and so on. But at the end of the day, those people in charge probably care more about hard things like saving money, for instance. And so that's why I think making the case in terms of performance, making the case in terms of observability is going to be more successful.

There's this whole topic about the flight recorder and all the options it enables. That's the better avenue for making the case for upgrades. And then, well, kind of for free, you get to use all those nice language improvements, too.

Durable Execution in Java: Writing Resumable, Recoverable Workflows in Plain Code [[08:59](javascript:void(0);)] #

Olimpiu Pop: You do like to play around, let's say it like that. And luckily for us, the community, whenever you play around, you also share some things on your blog. And the other thing that you played around with not long ago was about the durable execution engine written in Java. And there are a couple of things that people consider Java not to be. And one of them was that Java is not fast. You busted that.

Gunnar Morling: Hopefully.

Olimpiu Pop: And some people are saying that if you need a really efficient language ... but you proved several times that Java is efficient enough to build data tools. Tell us a bit about the durable execution engine. What was the motivation, what you wanted to achieve, and what did you actually achieve?

Gunnar Morlin g: To be clear, the actual state stores for that one are in SQLite (so in C), but it's integrated into Java and the engine itself. But yes, maybe to set this scene, so for the benefit of everybody, what is durable execution and what's the problem it solves? So when we work on enterprise applications, there's always this problem of what we could call workflows or long-running business transactions, right? So there is a certain activity you need to do over here. You may need to send a message to another service.

Maybe there's like a batch job that, for instance, processes your purchase orders and moves them through their life cycle from one step to the next. So we have these long-running activities in our applications. That's a very common situation. The problem is that reasoning about those processes can be very hard when they're implemented across different systems, components, and jobs. It's difficult to understand the end-to-end flow. So, what happens if I receive a purchase order in my system? What actually happens? It gets sent to the shipment service, and some component processes it there.

Maybe something goes wrong. How can I get insight into where the order is stuck and why it hasn't shipped to the customer? The idea of this durable execution is to essentially take a very different look at that problem. And the idea is: okay, let's define our processes essentially as a plain program that you write from start to finish. It could be Java, it could be anything really, but I mean, I'm active in the Java space, right? So that's what interests me. So we write our program end-to-end in plain code.

Then the special twist is that those individual steps in that flow are essentially units of persisting state and of making things resumable. So let's say I have this example with my E-commerce scenario, I want to persist an incoming purchase order, maybe I need to do some sort of customer check. Do they have the right creditworthiness? I need to fulfil the shipment. I need to send it out to the customer. I need to assign stock and so on. And all these steps should happen, of course, in the exact sequence, and they also should happen only once. So you don't want to assign stock twice with the same purchase order if something goes wrong.

The idea is to write those things as a plain Java program, and then have an engine around it that takes those steps and materialises their progress. So, for instance, we could call out to another system to, I don't know, assign stock, then take the result and store it in a local state store. And this is where SQLite comes into the picture. If this flow continues and later fails, it may indicate a processing issue with the shipment. Then we could restart that flow, and our durable execution engine would figure out, "Okay, I have already done those first two steps out of five".

"And so we don't need to run them ourselves, and we don't need to run them again". And then our flow would only resume from the first step which hasn't been run yet. So that's essentially the idea of durable execution: give you a representation of your end-to-end flows, make them resumable and make them recoverable in case of failures.

Olimpiu Pop: Thank you. That sounds like a distributed transaction, but taken to the next level.

Gunnar Morling: In a way, yes, it might use distributed transactions under the hood, but really, the cool thing is that for you as an application programmer, you don't really have to think too much about all those complexities. It's just about defining your flow, and then this engine, whatever it is, takes care of making those guarantees a reality. It's not a new idea. This has been around for quite a while. I mean, you could also think about traditional workflow engines; they kind of are in that same space, but they tend to have less convenient representations of the flow.

So, with durable execution, the idea really is: it's plain code you write, and developers love it. Not like an XML representation or whatever. I was curious about it. And also sometimes there's this sort of complexity consideration around it, and people want to say, "Hey, how does this actually work?" So I have this program, how does it achieve that it can resume from a method invocation somewhere down the line? And so, I was curious how I could do this? Also, let's remove the complexity and reach a point where we don't need much infrastructure. It's a plain Java program.

There's the state store and SQLite, but really, the complexity needed here isn't that high. And that was kind of the idea to get an understanding and then also to share it with other people. Well, it doesn't have to be complex. This concept is relatively easy to implement, and that’s what I showed when building this durable execution engine, called Persistasaurus.

Olimpiu Pop: And just to close the loop on things that Java is not capable of doing, that's something that you didn't target, or, to my knowledge, you didn't. That's TUIs (Textual User Interfaces). And that's a topic to discuss with Max Anderson, because it seems he was very keen on proving that Java can do it.

Gunnar Morling: Absolutely.

Hardwood: Building a Fast, Zero-Dependencies Parquet Parser in Java [[14:52](javascript:void(0);)] #

Olimpiu Pop: Yes. Great. But coming back to your current playground, I spoke about Parquet and the whole stack. The guys from InfluxData are calling it FDAP, which stands for Apache Flight, Data Fusion, Parquet and Arrow. That's what they use for the InfluxData rewrite in Rust. And I'm very happy to hear that this is happening in Java as well, because I'm also looking into the model, and it's nice to have the matching Apache Arrow for the memory model, and then you can just have it basically with Parquet.

Given that it's a professional project, I would expect that this came from somebody else, or actually, it's not the case, and you came up with the idea, and then you guys started. What's the story of Parquet, and well, actually Hardwood?

Gunnar Morling: So Apache Parquet is a widely used file format for storing data in a columnar way. So if you think about a CSV file, it stores data in a row-based fashion. So each record which you want to persist in your CSV file is a new line. And this is good for some use cases, but not so good for others. So think about a situation where you, for instance, aggregate all the values from a given field of your data. So let's say you want to aggregate the entire value of all your purchase orders.

With a CSV file or a row-based file like that, you would have to go through all the lines, find the purchase order field, and sum up those values. So it wouldn't be very fast. And exactly for those kinds of use cases, there's this idea, okay, let's store data not row-based, but column-based. For each column, we store all the values, and we do that for all the columns in our dataset. And now this has a few advantages. We can query that data very selectively. If you are interested in aggregating all the values from the purchase order value column, we can sequentially read the data, just the purchase order value per order.

Since it's very fast to read, we don't even have to do the I/O for the other columns, since we are not interested, so that's very efficient. And then storing that data becomes very efficient that way, because, for instance, if you think about timestamps, maybe your dataset is ordered. So, instead of storing each timestamp as a fully self-contained value, you could just persist the first one. And then for the next one, you just store the delta. Maybe it's just like two milliseconds later. So you just store a two instead of a full timestamp.

So it's very space-efficient, and it also lends itself very well to compression. So, essentially, that's why those columnar file formats are so interesting. In particular, in the context of data analytics, Apache Parquet is, I would say, the default file format in data lakes, and it's heavily used by open table formats like Apache Iceberg, Delta Lake, and several others. And now the thing is, Parquet has been around for quite some time, and there is a widely used parser and writer in Java, I should say, but it's also very dependency-heavy.

So if you use that existing parser, which, again, is great work by the community, you pull in essentially the entire Hadoop stack, and you end up with a huge footprint of dependencies. And so this is where this project, Hardwood, comes in. In terms of the name, Hardwood is similar to Parquet, so it's a play on different kinds of flooring. So the idea for Hardwood is two things. First, let's see whether we can build a new parser for Apache Parquet without any mandatory dependencies. So it's all written from scratch.

And also, I want to make it very fast. This is where we can revisit the One Billion Row Challenge, as many of its learnings apply to building this real-world system. I want to make this multi-threaded; it is multi-threaded, and I want to take advantage of all the CPU cores I have. So that's the idea: building a parser for Apache Parquet in Java, optimised for minimal dependencies and very fast. So that's the high-level gist for it.

Olimpiu Pop: That sounds great. So it should have a very short BOM, right? Because of the CRA Act, we now have to look into that as well.

Gunnar Morling: I mean, that's very real. All this supply chain attack situation, like all those dependencies, they're a liability essentially, and you don't want to have them. And that’s not even talking about things like class path conflicts, you might have different versions. So, really, here, the idea is to minimise the dependency set as much as possible. So, for instance, and this is where I can actually come back to newer Java versions, there's no logging dependency because, well, since version 9, Java has a minimal yet good-enough logging abstraction. So I'm using that.

And then people can just add a binding to whatever logging infrastructure they want to use. In addition, there are a couple of optional dependencies. So, for instance, Parquet can utilise different compression algorithms, such as LZ4 and GZip. And now, I didn't want to be in the business of re-implementing all those compression algorithms. So I'm integrating those as optional dependencies. So if you want to parse a given Parquet file with a specific compression algorithm, you would pull in that compression dependency. And it's the same, for instance, for object storage support.

So you can now also parse files which are stored in an S3 bucket. So in that case, there's an extra module as part of the Hardwood project that you would pull in.

Hardwood's Architecture: Parallelisation and Performance Optimisation [[20:25](javascript:void(0);)] #

Olimpiu Pop: You mentioned that you put a lot of effort into two main goals. One has a small footprint, and the other is fast. What are the tricks? Looking back, what would be the recommendation if somebody else is going in, "Hey, Gunnar, what should I take in mind if I'm building a parser and I want to have this attribute?" What are the architectural takes on that?

Gunnar Morling: Right. So I mean, the first thing coming to mind is parallelisation. So as I mentioned, we have all our machines, and they have so many CPU cores. So on my MacBook, I don't even know, there's like 16 cores or something like that. And if I'm single-threaded, I leave like 15 cores unused, and it's a huge waste of time. So the idea here is to utilise all the CPU cores. And now the problem is what I learned actually as well, parallelizing the parsing of a Parquet file is surprisingly tricky. Because, for instance, you could say that for each of my columns in my file, I just use different threads, one thread by column.

This sounds good, and that's how I started. The problem is that your dataset may not have many columns. So maybe you just have three columns. So you would be using just three of the 16 CPU cores. So it's better than before, but it's still not really good. The idea then was, okay, let's go one step further down, and Parquet actually stores files in what it calls pages. So now the engine actually has this page-level parallelism. So we can utilise essentially all the CPU resources we have, all the pages in our file, which are distributed across our worker threads.

And now the challenge with that one was: depending on the kind of encoding a column has, it might take a different amount of CPU time to decode. I can have slow columns and fast columns. Well, if I have a fixed number of threads that I use to decode the pages of a given column, again, I would leave resources underutilised. So here the idea is to implement some sort of adaptive balancing.

So, essentially, slower columns get more worker threads assigned, and faster columns get fewer. And that way, again, I'm using essentially as much CPU as I can. So that's the entire topic of parallelisation. There is this idea of pre-fetching. If I work on multiple files, which together form a data set, if I'm getting towards the end of my first file, I'm already starting to preload contents from the next file. So just to avoid any cold-start scenario.

There's the idea of avoiding the boxing overhead whenever possible. So as much as I can, I have data in primitive arrays, in arrays, and double arrays. And this ties back to an earlier question you had. So let's say you have this page representation in the Hardwood codebase. And now I need to have, like, an int page, a float page, and a double page, because I cannot have a generic kind of representation that would be backed by those primitive arrays, right? So, currently, if I used generics, I would have an ArrayList of Integer, not of int.

So I would pay the boxing overhead. With the project Valhalla, once we get it, we'll avoid that. So I could have a generic data structure, but that still doesn't pay the object overhead.

Olimpiu Pop: Virtual threads or old-style threads?

Gunnar Morling: Virtual threads, actually, I don't think it makes a huge difference for the particular workload, but yes, I felt, why not? And so that's what I ended up using.

Olimpiu Pop: Thank you. When should I look at Hardwood?

Gunnar Morling: When should you use it? I believe, eventually everyone who wants to parse Parquet files in Java should choose this option. Right now, I mean, we are still very early. I started the project at the beginning of the year. I did a first release, Alpha1, a few weeks ago. I'm going to release a Beta version very soon. Functionally, I would say it's pretty complete for the reading side of things. We support all encodings and all physical and logical types.

We support both row-based and column-based ways of reading the data. We support data mapping into Avro. As I mentioned, we support reading data from an S3 bucket, and there's this entire notion of just fetching the data we need. So, for instance, if we are only interested in a specific column, well, we would only get those bytes from the remote bucket for that column.

There's also this notion of predicate pushdown. So you can say, "I want to have only data which satisfies a certain filter criterion". I don't know, only purchases over a hundred. We can evaluate that filter against row group and page statistics which Parquet supports. So essentially, it also allows you to cut down on the chunks you even read. So we support all that. It's pretty complete read-wise, I would say. We expect to have a stable 1.0 release soon, and shortly thereafter, we will have write support and make it a fully comprehensive Parquet library for both reading and writing.

Olimpiu Pop: So let's say I'm just reading a line or whatever.

Gunnar Morling: Right.

Olimpiu Pop: What will I get back?

Gunnar Morling: There are two modes, essentially. So there's what we call the row reader API. And this essentially allows you to iterate through your dataset. It's like an iterator kind of pattern, and you can access all the columns of your rows. Parquet supports nested data. So you could have like substructures, or you could have lists and then this row-based format that would, for instance, give you like an array of, I don't know, the comments of your blog posts, or whatever your data is about. So there's that, which is nice if you want to work with this data in some sort of object-based way.

Also, just merged this earlier this week; there's support to give you this as an Avro record. So many people in the Parquet space use Apache Avro as a binding format, so we support it as well. And this is good if you want to access complexly structured data in an object-based way, let's say.

And the other alternative is this column reader API, as we call it. And this essentially gives you arrays of data just from a specific column. And this is very interesting because it makes it very fast. Because then, for instance, you could take an entire array of data and feed it into the vector API and process it very efficiently. So we have those two APIs; depending on the use case, you would use one or the other.

Olimpiu Pop: Okay, cool. Any integration with Apache Arrow, since it seems to be an "in-memory" equivalent for Parquet?

Gunnar Morling: Yes. So at this point, not. I thought about using it, but then, well, it comes back to this question of avoiding dependencies, and so far, we don't use Arrow. It's something which maybe we should explore at some point.

Olimpiu Pop: Well, I suppose it's about the community and who uses it. And then I suppose that question will come around if it actually makes sense.

Gunnar Morling: Absolutely.

AI-Native Development: The Massive Productivity Booster that Needs Human Oversight [[27:33](javascript:void(0);)] #

Olimpiu Pop: The other thing I think you underlined in the post announcing the project is that Hardwood essentially is an “AI-native” project. How does it feel? What are the lessons learned from that? Because, obviously, everybody needs to do it.

Gunnar Morling: Absolutely. And also, it actually touches on one of the motivations for starting the project. I mean, yes, I generally felt there is a need for this project, a Parquet parser with minimal dependencies which is very fast. So it just needs to exist. I want to build it. But then, I was also looking to gain real-world experience using AI to build this kind of tool. How far can AI take me in doing this? It's built AI-first. So I use Claude Code extensively for building it. People in the community use it for their contributions. But what I really want to emphasise is that we don't vibe code.

The idea is not to just take whatever Claude comes up with. No, the idea really is that we want to understand the code, guide the agent, and establish certain structures. We want to have a codebase that is well-maintainable. So it's pretty prescriptive about how we use AI. In my CLAUDE.md file, for instance, I tell it very explicitly: always start with a design document. There are certain things we want to keep in mind as we develop our project.

So we want to have a minimal public API, and as much as possible, code should not be in publicly facing packages. We want to avoid duplication, maybe we need to refactor things. So all of those things are in my CLAUDE.md file, and as much as possible, it adheres to that, or if it doesn't, well, then I ask, "Hey, now we have some redundancy here, let's clean it up and let's extract some sort of help or method or whatever it is".

And I mean, in particular, I think for Parquet, it's actually a very good problem to be solved with AI because there's a very well-written spec. It's very clear what we need to build. There are specifications of all the different parts of the file and so on. So it's very clearly defined. And also, there's a very extensive test suite provided by the Parquet community. So there's like, I don't know, hundreds of Parquet files. We're taking the existing Parquet parser, passing all those files to it, and comparing the output to whatever Hardwood gives us; if there's a difference, it's a bug we need to fix. And AI is great for that. So I can tell it “For that one file from the test suite, there's a difference to the upstream parser. So why is it, and can you go fix it?” And by now, actually, we have achieved full parity, so the outcome is exactly the same.

But agentic coding isn’t perfect; for instance, it doesn't include the idea that code needs to be maintainable. So very often, it will be like, okay, let me add another if-else over here and let me, I don't know, duplicate some stuff over there. So you still need to be on top of things, guide it and make sure what it produces is meaningful and good quality. But it's a massive productivity booster. We would not be close to where we are without using those tools.

Olimpiu Pop: Okay. I like your approach of using a design document before implementing something bigger. Are you using any kind of standard? Because I know, as you mentioned, that having something documented publicly makes it a lot easier with these new models. Is it something that you gave it a template, or is it something that you just pointed, okay, do the ADR, whatever?

Gunnar Morling: No, it's not. We don't have a template. Maybe we should actually add one. When we started that, it kind of came up with a good structure by itself, like setting a context, what's the problem we want to achieve? What do we have already? So I felt it kind of makes sense, but yes, I totally agree. It could make sense to establish a template, actually.

Olimpiu Pop: Okay. Well, there is obviously a GitHub repo for that. It's on the ADR side, and what I like is that there are a lot of templates, and it depends on the size. I really enjoyed reading through those, and they should help. Funny enough, because you mentioned that Parquet has a lot of tests and it's very well documented. I had two different conversations with two of your connections: Birgitta Böckeler from ThoughtWorks and Adam Bien, the Air Hacks master of Java. Both of them made similar points. Adam mentioned that he's using the BCE pattern a lot, which is quite old. And what he was mentioning is that Java is quite good for model generation, given that all the processes, all the community processes, are very well documented.

You have this interface, the JSRs, and the implementation. And that allowed the community to build around it. And the other point was that there were three experiments from the guys at Cursor, Anthropic and OpenAI. All of them built something, and they wanted to push longer-running projects. And some of them built a C compiler. The others built a simple tool internally to go deeper into the enterprise space. And actually, the underlying point was that when you are building on tools that are highly specified, and you have a lot of tests, it's a lot easier.

For instance, for the compiler or the browser itself, because you already have things built. So I think that's worth noting. Gunnar Morling: I mean, it's a good safety net against regressions. If we make changes, we'll determine whether something that used to work suddenly stops working. The one question that is still a bit open to me is how we go about performance regressions. Because for me, performance is a top concern. And I spend lots of time with Async Profiler, JFR, and other tools to minimise allocations. By the way, we also pool those object arrays and reuse them. This is top of mind, and I'm also concerned about regressions.

And so I'm having some conversations. There's an Apache project called Otava. It's about continuous performance tracking and identifying regressions in your performance metrics. So that's something I also want to set up to identify whether we made a change and actually made things slower, so we can prevent that earlier.

Olimpiu Pop: It's something that I'm looking into as well, but from a different angle. For us, it's about performance translating to battery life, because in my current role, it's IoT, but it's highly mobile, which means battery. And if you have a performance regression, that means battery life is draining, which means higher cost, and all these are due to the kinds of things coming. And now I remember a point that Luca Mezzalira had. He mentioned that at the point when he started building, he was implementing microfrontends.

He had a container, a simple container limited in terms of performance, memory, and so on. Obviously, it was about the size of the package at that point. And that's what I'm trying to see whether I can create a restraint, something like a digital twin for our devices that allows us to see it virtually, but still a long way because it's custom-built, it's more complex. But nevertheless, I'm very curious to see what the solution that you'll reach is, because that was also in my mind, because you have all these things that are very superficial.

But then, at that point in time, you had this mastery of how to tweak the JVM and make it work. And then if you're really pushing into that, then you discussed mechanical sympathy, and you saw that also in your One Billion Row Challenge, because those people who really wanted to go deep, they really went for the gold, optimising it for the given machine.

Gunnar Morling: Right. I think that's where AI is interesting, because it can help us with things. But right now I also feel, yes, you cannot just let it go on its own. I couldn't say, "Go off and build a Parquet parser for me". This won't work very well. So you have to guide it; you have to stay on top of things.

Now, oftentimes people are also concerned about the impact on developers, and what it means for us. Will we all lose our jobs? And I mean, obviously, I don't have the answer to that, but I think right now my feeling is it's kind of bimodal.

If you are new in the field and maybe you do relatively easy work, it is probably impactful. But also, if you are experienced and have been around for a while, and you know what to build, then it's a massive productivity booster, and you can get things done that you just couldn't get done before because you didn't have the capacity. Yes, I feel like it's having different effects across the developer spectrum in terms of experience. Olimpiu Pop: For a lot of developers our age, it's bringing the fun back, because in senior positions, you were stuck in bureaucracy, and then you were frustrated that the younger folks don't really get it. And then there was the gap between how we would expect things to be and how the newer generations are doing them. But now you have the ability to actually do those kinds of things together with other points and to have that high standard. My only concern, and that's something that I still don't have a true response to it and not even a path, is how are we bringing new folks in?

Gunnar Morling: Absolutely. Yes, that's a massive concern.

The Future of Programming: AI, Cost, and Maintaining Developer Skills [[36:52](javascript:void(0);)] #

Olimpiu Pop: Because I see two distinct trends. Some of them are just, well, I need someone with at least 4 or 5 years of experience. They build some stuff. And then there are the others who say, "Okay, we don't need developers now because I know how to discuss with the AI, and then we have everything". And then, you see their big flaws in the code. These are the points where we need to close the gap and think long-term, and they're an excellent stepping stone toward the evolution of programming. Even though I still have a curiosity.

And I'll just say it out loud, what we're currently doing is what we spend a lot of time as an industry building higher-level languages that make things a lot easier. Now, it's even better than that; you just generate. So now, what we do is go... let's think about Java. We are just writing higher-level code that follows best practices and gets compiled, then interpreted, and so on. You have so many layers until you actually reach the bare metal.

I expect, at some point, constructs that are ahead-of-time compiled, where you're taking bits and pieces of the construct, and we either circumvent the whole code; time will tell.

Gunnar Morling: I mean, it's all ... I think, still up in the air. Right now, nothing of that is deterministic. So if you don't have an intermediary step where we can at least see what is going on, it's going to be very hard to evolve those things and debug them and so on. So I know right now I don't see a world where we just skip that source generation step. But yes, I mean, it's also fast and quickly evolving. So it might look very different in 12 months' time.

Olimpiu Pop: I didn't even want to touch on the cost side of things because for me, the cost is multifaceted, more than the material cost, more than the money that we're paying for. It's about the infrastructure costs; it's about the water we use; it's about electricity and the pollution that comes with it, with CO2. And I think that those are the hard problems to solve, and that's all about us as an industry to go in the right direction.

Gunnar Morling: Yeah, as I mentioned, I use Claude Code all day long, and sometimes I ask it, so, "Hey, can you make that change?" And then I feel like I just cannot write it myself. It will obviously be more efficient in terms of CPU cycles, but it's so easy once you are in that mode where you essentially always work with the coding assistant. So yes, that's definitely a lot on my mind. Will we lose the skill of doing stuff ourselves? And the other day I was sitting on an aeroplane and didn't have access to Claude Code. And then I thought, okay, what should I do? I could start to code some stuff, but I feel it would just be so slow. There's no point in doing it. So then I ended up reading something. So yes, it has all those dynamics, and it's all evolving so much.

Olimpiu Pop: Well, at some point, I read a book called The Glassbox, and it mentioned similar things, but it was about driving a stick or an automatic gearbox. And it was like at a given moment of time, it's about you just taking you out of the comfort zone and try to do different things because in these particular cases, and I think that's about discipline and how to do this stuff, but that's a whole different conversation.

Contributing to Hardwood Community [[40:05](javascript:void(0);)] #

Gunnar Morling: Right. Before we finish, I want to mention just one thing because it's important. We spoke about building Hardwood, but also, actually, a community is forming around it. So I just want to give a big shout-out to Rion, Andres, and a couple of other people who contributed to the project. Without those people, it would just be much less fun, and also, we wouldn't be where we are now. So big kudos to that community. And of course, I hope it continues to grow, and we will have an even more diverse community around Hardwood.

Olimpiu Pop: Great. You mean Andres Almiray?

Gunnar Morlin g: Absolutely. Yes. So he helped a lot with the build and release infrastructure.

Olimpiu Pop: This guy is in all important projects. I was telling him the other week, when we had the recording, that he's probably one of the most prolific people in the open source community in the Java space.

Gunnar Morling: Yes, yes.

Olimpiu Pop: Thank you for your time today. Thank you for putting together Hardwood, and best of luck. We're looking forward to having the 1.0 release.

Gunnar Morling: Awesome. Yes. Thank you so much for having me. This was fun.

Olimpiu Pop: Thank you.

Mentioned:

The One Billion Row Challenge Hardwood Project Website Hardwood Project on GitHub Bean Validation Specification Apache Parquet Website Apache Iceberg Website Apache Arrow Website SQLite Website Open JDK Project Loom Open JDK Project Valhalla Oracle Documentation on JFR Async Profiler on GitHub Apache Otava (Incubator) Project Digital Strategy - Cyber Resilience Act Building a durable execution engine in Java From Java EE to Quarkus and LLMs: Adam Bien’s Playbook for Boring, Future‑Proof Systems

More about our podcasts

You can keep up-to-date with the podcasts via our [RSS Feed], and they are available via

[SoundCloud](https://soundcloud.com/infoq-channel),

[Apple Podcasts](https://itunes.apple.com/gb/podcast/the-infoq-podcast/id1106971805?mt=2),

[Spotify](https://open.spotify.com/show/4NhWaYYpPWgWRDAOqeRQbj),

[Overcast](https://overcast.fm/itunes1106971805/the-infoq-podcast)and

YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

source & further reading

infoq.com — original article